k Nearest Neighbors

class: right, top, my-title, title-slide

# k Nearest Neighbors
### Jo Hardin
### October 26, 2021

---

# Agenda 10/26/21

1. Redux - model process
2. `\(k\)`-Nearest Neighbors
3. cross validation

---

## `tidymodels` syntax

1. partition the data
2. build a recipe
3. select a model
4. create a workflow
5. fit the model  
6. validate the model

---
## All together

.panelset[

.panel[.panel-name[recipe]

```r
penguin_lm_recipe <-
  recipe(body_mass_g ~ species + island + bill_length_mm + 
           bill_depth_mm + flipper_length_mm + sex + year,
         data = penguin_train) %>%
  step_mutate(year = as.factor(year)) %>%
  step_unknown(sex, new_level = "unknown") %>%
  step_relevel(sex, ref_level = "female") %>%
  update_role(island, new_role = "id variable")

summary(penguin_lm_recipe)
```

```
## # A tibble: 8 × 4
##   variable          type    role        source  
##   <chr>             <chr>   <chr>       <chr>   
## 1 species           nominal predictor   original
## 2 island            nominal id variable original
## 3 bill_length_mm    numeric predictor   original
## 4 bill_depth_mm     numeric predictor   original
## 5 flipper_length_mm numeric predictor   original
## 6 sex               nominal predictor   original
## 7 year              numeric predictor   original
## 8 body_mass_g       numeric outcome     original
```
]

.panel[.panel-name[model]

```r
penguin_lm <- linear_reg() %>%
  set_engine("lm")

penguin_lm
```

```
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
```
]

.panel[.panel-name[workflow]

```r
penguin_lm_wflow <- workflow() %>%
  add_model(penguin_lm) %>%
  add_recipe(penguin_lm_recipe)

penguin_lm_wflow
```

```
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_mutate()
## • step_unknown()
## • step_relevel()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
```
]

.panel[.panel-name[fit]

```r
penguin_lm_fit <- penguin_lm_wflow %>%
  fit(data = penguin_train)

penguin_lm_fit %>% tidy()
```

```
## # A tibble: 10 × 5
##    term               estimate std.error statistic    p.value
##    <chr>                 <dbl>     <dbl>     <dbl>      <dbl>
##  1 (Intercept)       -2417.5    664.73    -3.6368  3.3624e- 4
##  2 speciesChinstrap   -208.39    92.899   -2.2432  2.5776e- 2
##  3 speciesGentoo       984.90   152.04     6.4781  5.0203e-10
##  4 bill_length_mm       13.531    8.2871   1.6328  1.0378e- 1
##  5 bill_depth_mm        80.899   22.112    3.6586  3.1028e- 4
##  6 flipper_length_mm    20.771    3.6200   5.7378  2.8080e- 8
##  7 sexmale             350.57    52.597    6.6651  1.7239e-10
##  8 sexunknown           47.576  103.32     0.46049 6.4557e- 1
##  9 year2008            -24.774   47.511   -0.52145 6.0252e- 1
## 10 year2009            -61.895   46.008   -1.3453  1.7976e- 1
```
]

]

---

## model parameters

* Some model parameters are tuned from the data (some aren't).
  - linear model coefficients are optimized (not tuned)
  - `\(k\)`-nn value of `\(k\)` is tuned

* If the model is tuned using the data, the same data **cannot** be used to assess the model.

* With Cross Validation, you iteratively put data in your pocket.

* For example, keep 1/5 of the data in your pocket, build the model on the remaining 4/5 of the data.