class: right, top, my-title, title-slide # k Nearest Neighbors ### Jo Hardin ### October 26, 2021 --- # Agenda 10/26/21 1. Redux - model process 2. `\(k\)`-Nearest Neighbors 3. cross validation --- ## `tidymodels` syntax 1. partition the data 2. build a recipe 3. select a model 4. create a workflow 5. fit the model 6. validate the model --- ## All together .panelset[ .panel[.panel-name[recipe] ```r penguin_lm_recipe <- recipe(body_mass_g ~ species + island + bill_length_mm + bill_depth_mm + flipper_length_mm + sex + year, data = penguin_train) %>% step_mutate(year = as.factor(year)) %>% step_unknown(sex, new_level = "unknown") %>% step_relevel(sex, ref_level = "female") %>% update_role(island, new_role = "id variable") summary(penguin_lm_recipe) ``` ``` ## # A tibble: 8 × 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 species nominal predictor original ## 2 island nominal id variable original ## 3 bill_length_mm numeric predictor original ## 4 bill_depth_mm numeric predictor original ## 5 flipper_length_mm numeric predictor original ## 6 sex nominal predictor original ## 7 year numeric predictor original ## 8 body_mass_g numeric outcome original ``` ] .panel[.panel-name[model] ```r penguin_lm <- linear_reg() %>% set_engine("lm") penguin_lm ``` ``` ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` ] .panel[.panel-name[workflow] ```r penguin_lm_wflow <- workflow() %>% add_model(penguin_lm) %>% add_recipe(penguin_lm_recipe) penguin_lm_wflow ``` ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: linear_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 3 Recipe Steps ## ## • step_mutate() ## • step_unknown() ## • step_relevel() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` ] .panel[.panel-name[fit] ```r penguin_lm_fit <- penguin_lm_wflow %>% fit(data = penguin_train) penguin_lm_fit %>% tidy() ``` ``` ## # A tibble: 10 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -2417.5 664.73 -3.6368 3.3624e- 4 ## 2 speciesChinstrap -208.39 92.899 -2.2432 2.5776e- 2 ## 3 speciesGentoo 984.90 152.04 6.4781 5.0203e-10 ## 4 bill_length_mm 13.531 8.2871 1.6328 1.0378e- 1 ## 5 bill_depth_mm 80.899 22.112 3.6586 3.1028e- 4 ## 6 flipper_length_mm 20.771 3.6200 5.7378 2.8080e- 8 ## 7 sexmale 350.57 52.597 6.6651 1.7239e-10 ## 8 sexunknown 47.576 103.32 0.46049 6.4557e- 1 ## 9 year2008 -24.774 47.511 -0.52145 6.0252e- 1 ## 10 year2009 -61.895 46.008 -1.3453 1.7976e- 1 ``` ] ] --- ## model parameters * Some model parameters are tuned from the data (some aren't). - linear model coefficients are optimized (not tuned) - `\(k\)`-nn value of `\(k\)` is tuned * If the model is tuned using the data, the same data **cannot** be used to assess the model. * With Cross Validation, you iteratively put data in your pocket. * For example, keep 1/5 of the data in your pocket, build the model on the remaining 4/5 of the data. --- ## Cross validation ### for tuning parameters <div class="figure"> <img src="../images/CV/Slide2.png" alt="Image credit: Alison Hill" width="3999" /> <p class="caption">Image credit: Alison Hill</p> </div> --- ## Cross validation ### for tuning parameters <div class="figure"> <img src="../images/CV/Slide3.png" alt="Image credit: Alison Hill" width="3999" /> <p class="caption">Image credit: Alison Hill</p> </div> --- ## Cross validation ### for tuning parameters <div class="figure"> <img src="../images/CV/Slide4.png" alt="Image credit: Alison Hill" width="3999" /> <p class="caption">Image credit: Alison Hill</p> </div> --- ## Cross validation ### for tuning parameters <div class="figure"> <img src="../images/CV/Slide5.png" alt="Image credit: Alison Hill" width="3999" /> <p class="caption">Image credit: Alison Hill</p> </div> --- ## Cross validation ### for tuning parameters <div class="figure"> <img src="../images/CV/Slide6.png" alt="Image credit: Alison Hill" width="3999" /> <p class="caption">Image credit: Alison Hill</p> </div> --- ## Cross validation ### for tuning parameters <div class="figure"> <img src="../images/CV/Slide7.png" alt="Image credit: Alison Hill" width="3999" /> <p class="caption">Image credit: Alison Hill</p> </div> --- ## Cross validation ### for tuning parameters <div class="figure"> <img src="../images/CV/Slide8.png" alt="Image credit: Alison Hill" width="3999" /> <p class="caption">Image credit: Alison Hill</p> </div> --- ## Cross validation ### for tuning parameters <div class="figure"> <img src="../images/CV/Slide9.png" alt="Image credit: Alison Hill" width="3999" /> <p class="caption">Image credit: Alison Hill</p> </div> --- ## Cross validation ### for tuning parameters <div class="figure"> <img src="../images/CV/Slide10.png" alt="Image credit: Alison Hill" width="3999" /> <p class="caption">Image credit: Alison Hill</p> </div> --- ## Cross validation ### for tuning parameters <div class="figure"> <img src="../images/CV/Slide11.png" alt="Image credit: Alison Hill" width="3999" /> <p class="caption">Image credit: Alison Hill</p> </div> --- ## model parameters * Some model parameters are tuned from the data (some aren't). - linear model coefficients are optimized (not tuned) - `\(k\)`-NN value of `\(k\)` is tuned * If the model is tuned using the data, the same data **cannot** be used to assess the model. * With Cross Validation, you iteratively put data in your pocket. * For example, keep 1/5 of the data in your pocket, build the model on the remaining 4/5 of the data. --- ## `\(k\)`-Nearest Neighbors The `\(k\)`-Nearest Neighbor algorithm does exactly what it sounds like it does. * user decides on the integer value for `\(k\)` * user decides on a distance metric (most `\(k\)`-NN algorithms default to Euclidean distance) * a point is classified to be in the same group as the majority of the `\(k\)` **closest** points in the training data. --- ## `\(k\)`-NN visually Consider a population, a training set, and a decision boundary: <div class="figure" style="text-align: center"> <img src="../images/knnmodel.jpg" alt="Population structure is shown as 2D concentric rings. Dataset is given by a sample of points from the rings. Decision boundary is a jagged space roughy approximating the concentric rings." width="80%" /> <p class="caption">image credit: Ricardo Gutierrez-Osuna</p> </div> --- ## `\(k\)`-NN visually Choosing `\(k\)` accurately is one of the most important aspects of the algorithm. <div class="figure" style="text-align: center"> <img src="../images/knnK.jpg" alt="For each of k = 1, 5, 20, the resulting decision boundary is shown. Certainly with k=20 the decision boundary does not approximate the population." width="80%" /> <p class="caption">image credit: Ricardo Gutierrez-Osuna</p> </div> --- ## `\(k\)`-NN to predict penguin species .panelset[ .panel[.panel-name[recipe] ```r penguin_knn_recipe <- recipe(species ~ body_mass_g + island + bill_length_mm + bill_depth_mm + flipper_length_mm, data = penguin_train) %>% update_role(island, new_role = "id variable") %>% step_normalize(all_predictors()) summary(penguin_knn_recipe) ``` ``` ## # A tibble: 6 × 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 body_mass_g numeric predictor original ## 2 island nominal id variable original ## 3 bill_length_mm numeric predictor original ## 4 bill_depth_mm numeric predictor original ## 5 flipper_length_mm numeric predictor original ## 6 species nominal outcome original ``` ] .panel[.panel-name[model] ```r penguin_knn <- nearest_neighbor() %>% set_engine("kknn") %>% set_mode("classification") penguin_knn ``` ``` ## K-Nearest Neighbor Model Specification (classification) ## ## Computational engine: kknn ``` ] .panel[.panel-name[workflow] ```r penguin_knn_wflow <- workflow() %>% add_model(penguin_knn) %>% add_recipe(penguin_knn_recipe) penguin_knn_wflow ``` ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: nearest_neighbor() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 1 Recipe Step ## ## • step_normalize() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## K-Nearest Neighbor Model Specification (classification) ## ## Computational engine: kknn ``` ] .panel[.panel-name[fit] ```r penguin_knn_fit <- penguin_knn_wflow %>% fit(data = penguin_train) ``` ] .panel[.panel-name[predict] ```r penguin_knn_fit %>% predict(new_data = penguin_test) %>% cbind(penguin_test) %>% metrics(truth = species, estimate = .pred_class) %>% filter(.metric == "accuracy") ``` ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy multiclass 0.98837 ``` ] ] --- ## what is `\(k\)` ??? It turns out that the default value for `\(k\)` in the **kknn** engine is 7. Is 7 best? #### Cross Validation!!! The red observations are used to fit the model, the black observations are used to assess the model. <div class="figure" style="text-align: center"> <img src="../images/CV/Slide11.png" alt="Image credit: Alison Hill" width="60%" /> <p class="caption">Image credit: Alison Hill</p> </div> --- ## Cross validation Randomly split the training data into V distinct blocks of roughly equal size. * leave out the first block of analysis data and fit a model. * the model is used to predict the held-out block of assessment data. * continue the process until all V assessment blocks have been predicted. The tuned parameter is usually chosen to be the one which produces the best performance averaged across the V blocks. The final performance is usually based on the test data. --- ## Extending the modeling process .panelset[ .panel[.panel-name[creating folds] ```r set.seed(470) penguin_vfold <- vfold_cv(penguin_train, v = 3, strata = species) ``` ] .panel[.panel-name[k] ```r k_grid <- data.frame(neighbors = seq(1, 15, by = 4)) k_grid ``` ``` ## neighbors ## 1 1 ## 2 5 ## 3 9 ## 4 13 ``` ] .panel[.panel-name[tune wkflow] ```r penguin_knn_tune <- nearest_neighbor(neighbors = tune()) %>% set_engine("kknn") %>% set_mode("classification") penguin_knn_wflow_tune <- workflow() %>% add_model(penguin_knn_tune) %>% add_recipe(penguin_knn_recipe) ``` ] .panel[.panel-name[tuning] ```r penguin_knn_wflow_tune %>% tune_grid(resamples = penguin_vfold, grid = k_grid) %>% collect_metrics() %>% filter(.metric == "accuracy") ``` ``` ## # A tibble: 4 × 7 ## neighbors .metric .estimator mean n std_err .config ## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 1 accuracy multiclass 0.97106 2 0.0059476 Preprocessor1_Model1 ## 2 5 accuracy multiclass 0.97688 2 0.00013365 Preprocessor1_Model2 ## 3 9 accuracy multiclass 0.98844 2 0.000066827 Preprocessor1_Model3 ## 4 13 accuracy multiclass 0.98269 2 0.0056803 Preprocessor1_Model4 ``` ] ] --- ## We choose `\(k\)` = 9 ! ### 6. Validate the model .panelset[ .panel[.panel-name[recipe] ```r penguin_knn_recipe <- recipe(species ~ body_mass_g + island + bill_length_mm + bill_depth_mm + flipper_length_mm, data = penguin_train) %>% update_role(island, new_role = "id variable") %>% step_normalize(all_predictors()) summary(penguin_knn_recipe) ``` ``` ## # A tibble: 6 × 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 body_mass_g numeric predictor original ## 2 island nominal id variable original ## 3 bill_length_mm numeric predictor original ## 4 bill_depth_mm numeric predictor original ## 5 flipper_length_mm numeric predictor original ## 6 species nominal outcome original ``` ] .panel[.panel-name[model] ```r penguin_knn_final <- nearest_neighbor(neighbors = 9) %>% set_engine("kknn") %>% set_mode("classification") penguin_knn_final ``` ``` ## K-Nearest Neighbor Model Specification (classification) ## ## Main Arguments: ## neighbors = 9 ## ## Computational engine: kknn ``` ] .panel[.panel-name[workflow] ```r penguin_knn_wflow_final <- workflow() %>% add_model(penguin_knn_final) %>% add_recipe(penguin_knn_recipe) penguin_knn_wflow_final ``` ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: nearest_neighbor() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 1 Recipe Step ## ## • step_normalize() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## K-Nearest Neighbor Model Specification (classification) ## ## Main Arguments: ## neighbors = 9 ## ## Computational engine: kknn ``` ] .panel[.panel-name[fit] ```r penguin_knn_fit_final <- penguin_knn_wflow_final %>% fit(data = penguin_train) ``` ] .panel[.panel-name[predict] ```r penguin_knn_fit_final %>% predict(new_data = penguin_test) %>% cbind(penguin_test) %>% metrics(truth = species, estimate = .pred_class) %>% filter(.metric == "accuracy") ``` ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy multiclass 0.97674 ``` ] ] --- ## We choose `\(k\)` = 9 ! ### 6. Validate the model Huh. Seems like `\(k=9\)` didn't do as well as `\(k=7\)` (the value we tried at the very beginning before cross validating). Well, it turns out, that's the nature of variability, randomness, and model building. We don't know truth, and we won't every find a perfect model. --- ## Bias-Variance Tradeoff <div class="figure" style="text-align: center"> <img src="../images/varbias.png" alt="Test and training error as a function of model complexity. Note that the error goes down monotonically only for the training data. Be careful not to overfit!! image credit: ISLR" width="90%" /> <p class="caption">Test and training error as a function of model complexity. Note that the error goes down monotonically only for the training data. Be careful not to overfit!! image credit: ISLR</p> </div> --- ## Reflecting on Model Building <div class="figure"> <img src="../images/modelbuild1.png" alt="Image credit: https://www.tmwr.org/" width="2176" /> <p class="caption">Image credit: https://www.tmwr.org/</p> </div> --- ## Reflecting on Model Building <div class="figure"> <img src="../images/modelbuild2.png" alt="Image credit: https://www.tmwr.org/" width="2067" /> <p class="caption">Image credit: https://www.tmwr.org/</p> </div> --- ## Reflecting on Model Building <div class="figure" style="text-align: center"> <img src="../images/modelbuild3.png" alt="Image credit: https://www.tmwr.org/" width="70%" /> <p class="caption">Image credit: https://www.tmwr.org/</p> </div>