class: right, top, my-title, title-slide # Bagging and Random Forests ### Jo Hardin ### November 2 & 4, 2021 --- # Agenda 11/02/21 1. Redux - CART 2. bagging process 3. bagging error rate (OOB error) --- ## `tidymodels` syntax 1. partition the data 2. build a recipe 3. select a model 4. create a workflow 5. fit the model 6. validate the model --- ## Random Forests > Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees -- [Wikipedia](http://en.wikipedia.org/wiki/Random_forest) --- ## Why a forest instead of a tree? * Trees suffer from very **high variance** * Two datasets will provide extremely different trees * To reduce variability, fit many trees and aggregate the predictions. * No longer reporting the tree structure, now reporting the predicted values (with hopes that predictions from one dataset will be similar to predictions from another dataset). --- ## Bagging Bagging = **B**ootstrap **Agg**regating. * multiple models aggregate together will produce a smoother model fit and better balance between - bias in the fit - variance in the fit. Bagging can be applied to any classifier to reduce variability. <p style = "color:purple">Recall that the variance of the sample mean is variance of data / n. So we've seen the idea that averaging an outcome gives reduced variability.</p> --- ## Bagging Models <div class="figure" style="text-align: center"> <img src="../images/bagging.png" alt="Image credit: https://www.pluralsight.com/guides/ensemble-methods:-bagging-versus-boosting" width="70%" /> <p class="caption">Image credit: https://www.pluralsight.com/guides/ensemble-methods:-bagging-versus-boosting</p> </div> --- ## Bagging Trees <div class="figure" style="text-align: center"> <img src="../images/baggedtree.png" alt="Image credit: https://www.analyticsvidhya.com/blog/2021/05/bagging-25-questions-to-test-your-skills-on-random-forest-algorithm/" width="90%" /> <p class="caption">Image credit: https://www.analyticsvidhya.com/blog/2021/05/bagging-25-questions-to-test-your-skills-on-random-forest-algorithm/</p> </div> --- ****** **Algorithm**: Bagging Forest ****** 1. Resample (bootstrap) *cases* (observational units, not variables). 2. Build a tree on each new set of (bootstrapped) training observations. 3. Average (regression) or majority vote (classification). 4. Note that for every bootstrap sample, approximately 2/3 of the observations will be chosen and 1/3 of them will not be chosen. ****** --- ### Observations not in a given tree P(observation `\(i\)` is not in the bootstrap sample) = `\(\bigg(1 - \frac{1}{n} \bigg)^n\)` `$$\lim_{n \rightarrow \infty} \bigg(1 - \frac{1}{n} \bigg)^n = \frac{1}{e} \approx \frac{1}{3}$$` --- ## OOB Error **Out of Bag** with bagging, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally **for free**!!! * Each tree is constructed using a different bootstrap sample * Put all of the cases left out in the construction of the `\(b^{th}\)` tree down the `\(b^{th}\)` tree to get a classification. * At the end of the run, - take `\(j\)` to be the class that got most of the votes every time case `\(i\)` was oob. - The proportion of times that `\(j\)` is not equal to the true class of n averaged over all cases is the **oob error**. --- Let the OOB prediction for the `\(i^{th}\)` observation to be `\(\hat{y}_{(-i)}\)` `\begin{eqnarray*} \mbox{OOB}_{\mbox{error}} &=& \frac{1}{n} \sum_{i=1}^n \textrm{I} (y_i \ne \hat{y}_{(-i)}) \ \ \ \ \ \ \ \ \mbox{classification}\\ \mbox{OOB}_{\mbox{error}} &=& \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_{(-i)})^2 \ \ \ \ \ \ \ \ \mbox{regression}\\ \end{eqnarray*}` | obs | tree1 | tree2 | tree3 | tree4 | `\(\cdots\)` | tree100 | average | |:---: |:-----: |:-----: |:-----: |:-----: |:--------: |:-------: |:---------------: | | 1 | | X | X | | | | `\(\sum(pred)/38\)` | | 2 | X | | | | | | `\(\sum(pred)/30\)` | | 3 | | | | X | | X | `\(\sum(pred)/33\)` | | 4 | X | | | | | | `\(\sum(pred)/32\)` | | 5 | X | | | | | | `\(\sum(pred)/39\)` | | 6 | | | X | | | X | `\(\sum(pred)/29\)` | | 7 | | | | | | X | `\(\sum(pred)/29\)` | | 8 | | | X | X | | X | `\(\sum(pred)/31\)` | | 9 | | | | X | | | `\(\sum(pred)/36\)` | --- ## Bagging example w defaults .panelset[ .panel[.panel-name[recipe] ```r penguin_bag_recipe <- recipe(body_mass_g ~ . , data = penguin_train) %>% step_unknown(sex, new_level = "unknown") %>% step_mutate(year = as.factor(year)) %>% step_naomit(all_predictors(), body_mass_g) summary(penguin_bag_recipe) ``` ``` ## # A tibble: 8 × 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 species nominal predictor original ## 2 island nominal predictor original ## 3 bill_length_mm numeric predictor original ## 4 bill_depth_mm numeric predictor original ## 5 flipper_length_mm numeric predictor original ## 6 sex nominal predictor original ## 7 year numeric predictor original ## 8 body_mass_g numeric outcome original ``` ] .panel[.panel-name[model] ```r num_trees <- 500 penguin_bag <- rand_forest(mtry = 7, trees = num_trees) %>% set_engine("randomForest", oob.error = TRUE) %>% set_mode("regression") penguin_bag ``` ``` ## Random Forest Model Specification (regression) ## ## Main Arguments: ## mtry = 7 ## trees = num_trees ## ## Engine-Specific Arguments: ## oob.error = TRUE ## ## Computational engine: randomForest ``` ] .panel[.panel-name[workflow] ```r penguin_bag_wflow <- workflow() %>% add_model(penguin_bag) %>% add_recipe(penguin_bag_recipe) penguin_bag_wflow ``` ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: rand_forest() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 3 Recipe Steps ## ## • step_unknown() ## • step_mutate() ## • step_naomit() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## Random Forest Model Specification (regression) ## ## Main Arguments: ## mtry = 7 ## trees = num_trees ## ## Engine-Specific Arguments: ## oob.error = TRUE ## ## Computational engine: randomForest ``` ] .panel[.panel-name[fit] .pull-left[ ```r penguin_bag_fit <- penguin_bag_wflow %>% fit(data = penguin_train) penguin_bag_fit ``` ] .pull-right[ ``` ## ══ Workflow [trained] ══════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: rand_forest() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 3 Recipe Steps ## ## • step_unknown() ## • step_mutate() ## • step_naomit() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## ## Call: ## randomForest(x = maybe_data_frame(x), y = y, ntree = ~num_trees, mtry = min_cols(~7, x), oob.error = ~TRUE) ## Type of random forest: regression ## Number of trees: 500 ## No. of variables tried at each split: 7 ## ## Mean of squared residuals: 92247 ## % Var explained: 85.06 ``` ] ] ] --- #### Fit again ``` ## ══ Workflow [trained] ══════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: rand_forest() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 3 Recipe Steps ## ## • step_unknown() ## • step_mutate() ## • step_naomit() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## ## Call: ## randomForest(x = maybe_data_frame(x), y = y, ntree = ~num_trees, mtry = min_cols(~7, x), oob.error = ~TRUE) ## Type of random forest: regression ## Number of trees: 500 ## No. of variables tried at each split: 7 ## ## Mean of squared residuals: 92406 ## % Var explained: 85.03 ``` --- ## Plotting the OOB (not tidy) .panelset[ .panel[.panel-name[plot 1] .pull-left[ ```r rsq <- penguin_bag_fit %>% extract_fit_parsnip() %>% pluck("fit") %>% pluck("rsq") data.frame( trees = 1:num_trees, rsq = rsq) %>% ggplot() + geom_line( aes(x = trees, y = rsq)) + ylab("") + xlab("Number of Trees") + ggtitle("OOB R-squared") ``` ] .pull-right[ ![](2021-11-02-rf_files/figure-html/unnamed-chunk-13-1.png)<!-- --> ] ] .panel[.panel-name[plot 2] .pull-left[ ```r MSE <- penguin_bag_fit %>% extract_fit_parsnip() %>% pluck("fit") %>% pluck("mse") data.frame( trees = 1:num_trees, MSE = MSE) %>% ggplot() + geom_line( aes(x = trees, y = MSE)) + ylab("") + xlab("Number of Trees") + ggtitle("OOB MSE") ``` ] .pull-right[ ![](2021-11-02-rf_files/figure-html/unnamed-chunk-15-1.png)<!-- --> ] ] ] --- # Agenda 11/04/21 1. Random Forests 2. Example --- ## Random Forests * Many trees - bootstrap the data before building each tree - for best split look only at a random `\(m\)` varaibles (different `\(m\)` variables per split) * Tuning: look at OOB predictions -- either average or majority vote * Testing: test data will be (average or majority vote) prediction for all the trees. --- ****** **Algorithm**: Random Forest ****** 1. Bootstrap sample from the training set. 2. Grow an un-pruned tree on the bootstrap sample. * *At each split*, select `\(m\)` variables and determine the best split using only the `\(m\)` predictors. * Do *not* prune the tree. Save the tree as is! 3. Repeat steps 1-2 for many many trees. 4. For each tree grown on a bootstrap sample, predict the OOB samples. For each tree grown, `\(~1/3\)` of the training samples won't be in the bootstrap sample -- those are called out of bootstrap (OOB) samples. OOB samples can be used as *test* data to estimate the error rate of the tree. 5. Combine the OOB predictions to create the "out-of-bag" error rate (either majority vote or average of predictions / class probabilities). 6. All trees together represent the *model* that is used for new predictions (either majority vote or average). ****** --- ## Variable Importance > importance = decrease in node impurity resulting from splits over that variable, averaged over all trees. * For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). * **Permute** the `\(j^{th}\)` variable and recalculate the oob prediction error. * The difference between the two oob errors are averaged over all trees (for the `\(j^{th}\)` variable) and normalized to give the importance for the `\(j^{th}\)` variable. --- ## RF technical details * unfortunately, **tidymodels** currently does not have an OOB option for model tuning (e.g., to find `mtry`) * unlike `num_trees`, `mtry` is not calculated "as you go along" * cross validation will be used to find the optimal value of `mtry` --- ## RF example w CV tuning .panelset[ .panel[.panel-name[recipe] ```r penguin_rf_recipe <- recipe(body_mass_g ~ . , data = penguin_train) %>% step_unknown(sex, new_level = "unknown") %>% step_mutate(year = as.factor(year)) summary(penguin_rf_recipe) ``` ``` ## # A tibble: 8 × 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 species nominal predictor original ## 2 island nominal predictor original ## 3 bill_length_mm numeric predictor original ## 4 bill_depth_mm numeric predictor original ## 5 flipper_length_mm numeric predictor original ## 6 sex nominal predictor original ## 7 year numeric predictor original ## 8 body_mass_g numeric outcome original ``` ] .panel[.panel-name[model] ```r penguin_rf <- rand_forest(mtry = tune(), trees = tune()) %>% set_engine("ranger", importance = "permutation") %>% set_mode("regression") penguin_rf ``` ``` ## Random Forest Model Specification (regression) ## ## Main Arguments: ## mtry = tune() ## trees = tune() ## ## Engine-Specific Arguments: ## importance = permutation ## ## Computational engine: ranger ``` ] .panel[.panel-name[workflow] ```r penguin_rf_wflow <- workflow() %>% add_model(penguin_rf) %>% add_recipe(penguin_rf_recipe) penguin_rf_wflow ``` ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: rand_forest() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 2 Recipe Steps ## ## • step_unknown() ## • step_mutate() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## Random Forest Model Specification (regression) ## ## Main Arguments: ## mtry = tune() ## trees = tune() ## ## Engine-Specific Arguments: ## importance = permutation ## ## Computational engine: ranger ``` ] .panel[.panel-name[CV] ```r set.seed(234) penguin_folds <- vfold_cv(penguin_train, v = 4) ``` ] .panel[.panel-name[param] ```r penguin_grid <- grid_regular(mtry(range = c(2,7)), trees(range = c(1,500)), levels = 5) penguin_grid ``` ``` ## # A tibble: 25 × 2 ## mtry trees ## <int> <int> ## 1 2 1 ## 2 3 1 ## 3 4 1 ## 4 5 1 ## 5 7 1 ## 6 2 125 ## 7 3 125 ## 8 4 125 ## 9 5 125 ## 10 7 125 ## # … with 15 more rows ``` ] .panel[.panel-name[tune] ```r penguin_rf_tune <- penguin_rf_wflow %>% tune_grid(resamples = penguin_folds, grid = penguin_grid) penguin_rf_tune ``` ``` ## # Tuning results ## # 4-fold cross-validation ## # A tibble: 4 × 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [186/63]> Fold1 <tibble [50 × 6]> <tibble [0 × 1]> ## 2 <split [187/62]> Fold2 <tibble [50 × 6]> <tibble [0 × 1]> ## 3 <split [187/62]> Fold3 <tibble [50 × 6]> <tibble [0 × 1]> ## 4 <split [187/62]> Fold4 <tibble [50 × 6]> <tibble [0 × 1]> ``` ] ] --- ## RF model output ```r penguin_rf_tune %>% collect_metrics() %>% filter(.metric == "rmse") %>% ggplot() + geom_line(aes(x = trees, y = mean, color = as.factor(mtry))) ``` ![](2021-11-02-rf_files/figure-html/unnamed-chunk-23-1.png)<!-- --> --- ## RF model output - take two ```r penguin_rf_tune %>% autoplot() ``` ![](2021-11-02-rf_files/figure-html/unnamed-chunk-24-1.png)<!-- --> --- ## RF Final model ```r penguin_rf_best <- finalize_model( penguin_rf, select_best(penguin_rf_tune, "rmse")) penguin_rf_best ``` ``` ## Random Forest Model Specification (regression) ## ## Main Arguments: ## mtry = 2 ## trees = 375 ## ## Engine-Specific Arguments: ## importance = permutation ## ## Computational engine: ranger ``` ```r penguin_rf_final <- workflow() %>% add_model(penguin_rf_best) %>% add_recipe(penguin_rf_recipe) %>% fit(data = penguin_train) ``` --- ## RF Final model ```r penguin_rf_final ``` ``` ## ══ Workflow [trained] ══════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: rand_forest() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 2 Recipe Steps ## ## • step_unknown() ## • step_mutate() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## Ranger result ## ## Call: ## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L, x), num.trees = ~375L, importance = ~"permutation", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) ## ## Type: Regression ## Number of trees: 375 ## Sample size: 249 ## Number of independent variables: 7 ## Mtry: 2 ## Target node size: 5 ## Variable importance mode: permutation ## Splitrule: variance ## OOB prediction error (MSE): 84149 ## R squared (OOB): 0.86346 ``` --- # VIP ```r library(vip) penguin_rf_final %>% extract_fit_parsnip() %>% vip(geom = "point") ``` ![](2021-11-02-rf_files/figure-html/unnamed-chunk-27-1.png)<!-- --> --- ## Test predictions ```r penguin_rf_final %>% predict(new_data = penguin_test) %>% cbind(penguin_test) %>% ggplot() + geom_point(aes(x = body_mass_g, y = .pred)) + geom_abline(intercept = 0, slope = 1) ``` ![](2021-11-02-rf_files/figure-html/unnamed-chunk-28-1.png)<!-- --> --- ## Model choices | `\(\mbox{ }\)` | `\(\mbox{ }\)` | |--------------------------------------------- |------------------------------- | | * explanatory variable choice | * `\(k\)` ($k$-NN) | | * number of explanatory variables | * distance measure | | * functions/transformation of explanatory | * V (CV) | | * transformation of response | * CV `set.seed` | | * response:continuous vs. categorical | * `\(\alpha\)` prune | | * how missing data is dealt with | * maxdepth prune | | * train/test split (`set.seed`) | * prune or not | | * train/test proportion | * gini / entropy (split) | | * type of classification model | * \# trees / \# BS samples | | * use of cost complexity / parameter | * grid search etc. for tuning | | * majority / average prob (tree error rate) | * value(s) of `mtry` | | * accuracy vs sensitivity vs specificity | * OOB vs CV for tuning | --- ## Bias-Variance Tradeoff <div class="figure" style="text-align: center"> <img src="../images/varbias.png" alt="Test and training error as a function of model complexity. Note that the error goes down monotonically only for the training data. Be careful not to overfit!! image credit: ISLR" width="90%" /> <p class="caption">Test and training error as a function of model complexity. Note that the error goes down monotonically only for the training data. Be careful not to overfit!! image credit: ISLR</p> </div> --- ## Reflecting on Model Building <div class="figure"> <img src="../images/modelbuild1.png" alt="Image credit: https://www.tmwr.org/" width="2176" /> <p class="caption">Image credit: https://www.tmwr.org/</p> </div> --- ## Reflecting on Model Building <div class="figure"> <img src="../images/modelbuild2.png" alt="Image credit: https://www.tmwr.org/" width="2067" /> <p class="caption">Image credit: https://www.tmwr.org/</p> </div> --- ## Reflecting on Model Building <div class="figure" style="text-align: center"> <img src="../images/modelbuild3.png" alt="Image credit: https://www.tmwr.org/" width="70%" /> <p class="caption">Image credit: https://www.tmwr.org/</p> </div>