Bagging and Random Forests

class: right, top, my-title, title-slide

# Bagging and Random Forests
### Jo Hardin
### November 2 & 4, 2021

---

# Agenda 11/02/21

1. Redux - CART
2. bagging process
3. bagging error rate (OOB error)

---

## `tidymodels` syntax

1. partition the data
2. build a recipe
3. select a model
4. create a workflow
5. fit the model  
6. validate the model

---

## Random Forests

> Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees -- [Wikipedia](http://en.wikipedia.org/wiki/Random_forest)

---

## Why a forest instead of a tree?

* Trees suffer from very **high variance**

* Two datasets will provide extremely different trees

* To reduce variability, fit many trees and aggregate the predictions.

* No longer reporting the tree structure, now reporting the predicted values (with hopes that predictions from one dataset will be similar to predictions from another dataset).

---
## Bagging

Bagging = **B**ootstrap **Agg**regating.

* multiple models aggregate together will produce a smoother model fit and better balance between 
   - bias in the fit
   - variance in the fit.  
   
Bagging can be applied to any classifier to reduce variability.

<p style = "color:purple">Recall that the variance of the sample mean is variance of data / n.  So we've seen the idea that averaging an outcome gives reduced variability.</p>

---

## Bagging Models

<div class="figure" style="text-align: center">
<img src="../images/bagging.png" alt="Image credit: https://www.pluralsight.com/guides/ensemble-methods:-bagging-versus-boosting" width="70%" />
<p class="caption">Image credit: https://www.pluralsight.com/guides/ensemble-methods:-bagging-versus-boosting</p>
</div>

---

## Bagging Trees

<div class="figure" style="text-align: center">
<img src="../images/baggedtree.png" alt="Image credit: https://www.analyticsvidhya.com/blog/2021/05/bagging-25-questions-to-test-your-skills-on-random-forest-algorithm/" width="90%" />
<p class="caption">Image credit: https://www.analyticsvidhya.com/blog/2021/05/bagging-25-questions-to-test-your-skills-on-random-forest-algorithm/</p>
</div>

---

******
**Algorithm**:  Bagging Forest

******
1. Resample (bootstrap) *cases* (observational units, not variables).  
2. Build a tree on each new set of (bootstrapped) training observations.
3. Average (regression) or majority vote (classification).
4. Note that for every bootstrap sample, approximately 2/3 of the observations will be chosen and 1/3 of them will not be chosen.

******

---

### Observations not in a given tree

P(observation `$i$` is not in the bootstrap sample) = `$\bigg(1 - \frac{1}{n} \bigg)^n$`

`$$\lim_{n \rightarrow \infty} \bigg(1 - \frac{1}{n} \bigg)^n = \frac{1}{e} \approx \frac{1}{3}$$`

---

## OOB Error

**Out of Bag**

with bagging, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally **for free**!!!

* Each tree is constructed using a different bootstrap sample

* Put all of the cases left out in the construction of the `$b^{th}$` tree down the `$b^{th}$` tree to get a classification.

* At the end of the run, 
   - take `$j$` to be the class that got most of the votes every time case `$i$` was oob. 
   - The proportion of times that `$j$` is not equal to the true class of n averaged over all cases is the **oob error**.

---

Let the OOB prediction for the `$i^{th}$` observation to be  `$\hat{y}_{(-i)}$`

`\begin{eqnarray*}
\mbox{OOB}_{\mbox{error}} &=& \frac{1}{n} \sum_{i=1}^n \textrm{I} (y_i \ne \hat{y}_{(-i)}) \ \ \ \ \ \ \ \  \mbox{classification}\\
\mbox{OOB}_{\mbox{error}} &=& \frac{1}{n} \sum_{i=1}^n  (y_i - \hat{y}_{(-i)})^2  \ \ \ \ \ \ \ \ \mbox{regression}\\
\end{eqnarray*}`

| obs 	| tree1 	| tree2 	| tree3 	| tree4 	| `$\cdots$` 	| tree100 	| average 	|
|:---:	|:-----:	|:-----:	|:-----:	|:-----:	|:--------:	|:-------:	|:---------------:	|
| 1 	|  	| X 	| X 	|  	|  	|  	| `$\sum(pred)/38$` 	|
| 2 	| X 	|  	|  	|  	|  	|  	| `$\sum(pred)/30$` 	|
| 3 	|  	|  	|  	| X 	|  	| X 	| `$\sum(pred)/33$` 	|
| 4 	| X 	|  	|  	|  	|  	|  	| `$\sum(pred)/32$` 	|
| 5 	| X 	|  	|  	|  	|  	|  	| `$\sum(pred)/39$` 	|
| 6 	|  	|  	| X 	|  	|  	| X 	| `$\sum(pred)/29$` 	|
| 7 	|  	|  	|  	|  	|  	| X 	| `$\sum(pred)/29$` 	|
| 8 	|  	|  	| X 	| X 	|  	| X 	| `$\sum(pred)/31$` 	|
| 9 	|  	|  	|  	| X 	|  	|  	| `$\sum(pred)/36$` 	|

---
## Bagging example w defaults

.panelset[

.panel[.panel-name[recipe]

```r
penguin_bag_recipe <-
  recipe(body_mass_g ~ . ,
         data = penguin_train) %>%
  step_unknown(sex, new_level = "unknown") %>%
  step_mutate(year = as.factor(year)) %>%
  step_naomit(all_predictors(), body_mass_g)

summary(penguin_bag_recipe)
```

```
## # A tibble: 8 × 4
##   variable          type    role      source  
##   <chr>             <chr>   <chr>     <chr>   
## 1 species           nominal predictor original
## 2 island            nominal predictor original
## 3 bill_length_mm    numeric predictor original
## 4 bill_depth_mm     numeric predictor original
## 5 flipper_length_mm numeric predictor original
## 6 sex               nominal predictor original
## 7 year              numeric predictor original
## 8 body_mass_g       numeric outcome   original
```
]

.panel[.panel-name[model]

```r
num_trees <- 500
penguin_bag <- rand_forest(mtry = 7,
                           trees = num_trees) %>%
  set_engine("randomForest", oob.error = TRUE) %>%
  set_mode("regression")

penguin_bag
```

```
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = 7
##   trees = num_trees
## 
## Engine-Specific Arguments:
##   oob.error = TRUE
## 
## Computational engine: randomForest
```
]

.panel[.panel-name[workflow]

```r
penguin_bag_wflow <- workflow() %>%
  add_model(penguin_bag) %>%
  add_recipe(penguin_bag_recipe)

penguin_bag_wflow
```

```
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_unknown()
## • step_mutate()
## • step_naomit()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = 7
##   trees = num_trees
## 
## Engine-Specific Arguments:
##   oob.error = TRUE
## 
## Computational engine: randomForest
```
]

.panel[.panel-name[fit]
.pull-left[

```r
penguin_bag_fit <- 
  penguin_bag_wflow %>%
  fit(data = penguin_train)

penguin_bag_fit 
```
]
.pull-right[

```
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_unknown()
## • step_mutate()
## • step_naomit()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
##  randomForest(x = maybe_data_frame(x), y = y, ntree = ~num_trees,      mtry = min_cols(~7, x), oob.error = ~TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##           Mean of squared residuals: 92247
##                     % Var explained: 85.06
```
]

]

---

#### Fit again

```
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_unknown()
## • step_mutate()
## • step_naomit()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
##  randomForest(x = maybe_data_frame(x), y = y, ntree = ~num_trees,      mtry = min_cols(~7, x), oob.error = ~TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##           Mean of squared residuals: 92406
##                     % Var explained: 85.03
```

---

##  Plotting the OOB (not tidy)

.panelset[

.panel[.panel-name[plot 1]
.pull-left[

```r
rsq <- penguin_bag_fit %>%
  extract_fit_parsnip() %>%
  pluck("fit") %>%
  pluck("rsq")

data.frame(
  trees = 1:num_trees, 
  rsq = rsq) %>%
  ggplot() + 
  geom_line(
    aes(x = trees, y = rsq)) +
  ylab("") + 
  xlab("Number of Trees") + 
  ggtitle("OOB R-squared")
```
]

.pull-right[
![](2021-11-02-rf_files/figure-html/unnamed-chunk-13-1.png)
]
]

.panel[.panel-name[plot 2]
.pull-left[

```r
MSE <- penguin_bag_fit %>%
  extract_fit_parsnip() %>%
  pluck("fit") %>%
  pluck("mse")

data.frame(
  trees = 1:num_trees, 
  MSE = MSE) %>%
  ggplot() + 
  geom_line(
    aes(x = trees, y = MSE)) +
  ylab("") + 
  xlab("Number of Trees") + 
  ggtitle("OOB MSE")
```
]

.pull-right[
![](2021-11-02-rf_files/figure-html/unnamed-chunk-15-1.png)

]

---

# Agenda 11/04/21

1. Random Forests
2. Example

---
## Random Forests

* Many trees
  - bootstrap the data before building each tree
  - for best split look only at a random `$m$` varaibles (different `$m$` variables per split)
  
* Tuning:  look at OOB predictions -- either average or majority vote

* Testing:  test data will be (average or majority vote) prediction for all the trees.

---

******
**Algorithm**:  Random Forest

******
1. Bootstrap sample from the training set.
2. Grow an un-pruned tree on the bootstrap sample.
* *At each split*, select `$m$` variables and determine the best split using only the `$m$` predictors.
* Do *not* prune the tree.  Save the tree as is!
3. Repeat steps 1-2 for many many trees.
4. For each tree grown on a bootstrap sample, predict the OOB samples.  For each tree grown, `$~1/3$` of the training samples won't be in the bootstrap sample -- those are called out of bootstrap (OOB) samples.  OOB samples can be used as *test* data to estimate the error rate of the tree.
5. Combine the OOB predictions to create the "out-of-bag" error rate (either majority vote or average of predictions / class probabilities).
6. All trees together represent the *model* that is used for new predictions (either majority vote or average).

******

---

## Variable Importance

> importance = decrease in node impurity resulting from splits over that variable, averaged over all trees.

* For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression).

* **Permute** the `$j^{th}$` variable and recalculate the oob prediction error.

* The difference between the two oob errors are averaged over all trees (for the `$j^{th}$` variable) and normalized to give the importance for the `$j^{th}$` variable.

---

## RF technical details

* unfortunately, **tidymodels** currently does not have an OOB option for model tuning (e.g., to find `mtry`)

* unlike `num_trees`, `mtry` is not calculated "as you go along"

* cross validation will be used to find the optimal value of `mtry`

---

## RF example w CV tuning

.panelset[

.panel[.panel-name[recipe]

```r
penguin_rf_recipe <-
  recipe(body_mass_g ~ . ,
         data = penguin_train) %>%
  step_unknown(sex, new_level = "unknown") %>%
  step_mutate(year = as.factor(year))

summary(penguin_rf_recipe)
```

.panel[.panel-name[model]

```r
penguin_rf <- rand_forest(mtry = tune(),
                           trees = tune()) %>%
  set_engine("ranger", importance = "permutation") %>%
  set_mode("regression")

penguin_rf
```

```
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = tune()
##   trees = tune()
## 
## Engine-Specific Arguments:
##   importance = permutation
## 
## Computational engine: ranger
```
]

.panel[.panel-name[workflow]

```r
penguin_rf_wflow <- workflow() %>%
  add_model(penguin_rf) %>%
  add_recipe(penguin_rf_recipe)

penguin_rf_wflow
```

```
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_unknown()
## • step_mutate()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = tune()
##   trees = tune()
## 
## Engine-Specific Arguments:
##   importance = permutation
## 
## Computational engine: ranger
```
]

.panel[.panel-name[CV]

```r
set.seed(234)
penguin_folds <- vfold_cv(penguin_train,
                          v = 4)
```
]

.panel[.panel-name[param]

```r
penguin_grid <- grid_regular(mtry(range = c(2,7)),
                             trees(range = c(1,500)),
                             levels = 5)

penguin_grid
```

```
## # A tibble: 25 × 2
##     mtry trees
##    <int> <int>
##  1     2     1
##  2     3     1
##  3     4     1
##  4     5     1
##  5     7     1
##  6     2   125
##  7     3   125
##  8     4   125
##  9     5   125
## 10     7   125
## # … with 15 more rows
```
]

.panel[.panel-name[tune]

```r
penguin_rf_tune <- 
  penguin_rf_wflow %>%
  tune_grid(resamples = penguin_folds,
            grid = penguin_grid)

penguin_rf_tune 
```

```
## # Tuning results
## # 4-fold cross-validation 
## # A tibble: 4 × 4
##   splits           id    .metrics          .notes          
##   <list>           <chr> <list>            <list>          
## 1 <split [186/63]> Fold1 <tibble [50 × 6]> <tibble [0 × 1]>
## 2 <split [187/62]> Fold2 <tibble [50 × 6]> <tibble [0 × 1]>
## 3 <split [187/62]> Fold3 <tibble [50 × 6]> <tibble [0 × 1]>
## 4 <split [187/62]> Fold4 <tibble [50 × 6]> <tibble [0 × 1]>
```

]

---
##  RF model output

```r
penguin_rf_tune %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  ggplot() + 
  geom_line(aes(x = trees, y = mean, color = as.factor(mtry)))
```

![](2021-11-02-rf_files/figure-html/unnamed-chunk-23-1.png)

---

## RF model output - take two

```r
penguin_rf_tune %>%
  autoplot()
```

![](2021-11-02-rf_files/figure-html/unnamed-chunk-24-1.png)

---

## RF Final model

```r
penguin_rf_best <- finalize_model(
  penguin_rf,
  select_best(penguin_rf_tune, "rmse"))

penguin_rf_best
```

```
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = 2
##   trees = 375
## 
## Engine-Specific Arguments:
##   importance = permutation
## 
## Computational engine: ranger
```

```r
penguin_rf_final <-
  workflow() %>%
  add_model(penguin_rf_best) %>%
  add_recipe(penguin_rf_recipe) %>%
  fit(data = penguin_train)
```

---

## RF Final model

```r
penguin_rf_final
```

```
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_unknown()
## • step_mutate()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L,      x), num.trees = ~375L, importance = ~"permutation", num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1)) 
## 
## Type:                             Regression 
## Number of trees:                  375 
## Sample size:                      249 
## Number of independent variables:  7 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         permutation 
## Splitrule:                        variance 
## OOB prediction error (MSE):       84149 
## R squared (OOB):                  0.86346
```

---

# VIP

```r
library(vip)

penguin_rf_final %>%
  extract_fit_parsnip() %>%
  vip(geom = "point")
```

![](2021-11-02-rf_files/figure-html/unnamed-chunk-27-1.png)

---

## Test predictions

```r
penguin_rf_final %>%
  predict(new_data = penguin_test) %>%
  cbind(penguin_test) %>%
  ggplot() +
  geom_point(aes(x = body_mass_g, y = .pred)) + 
  geom_abline(intercept = 0, slope = 1)
```

![](2021-11-02-rf_files/figure-html/unnamed-chunk-28-1.png)

---

## Model choices

|  `$\mbox{  }$` 	| `$\mbox{  }$`	|
|---------------------------------------------	|-------------------------------	|
| * explanatory variable choice 	| * `$k$` ($k$-NN) 	|
| * number of explanatory variables 	| * distance measure 	|
| * functions/transformation of explanatory 	| * V (CV) 	|
| * transformation of response 	| * CV `set.seed` 	|
| * response:continuous vs. categorical 	| * `$\alpha$` prune 	|
| * how missing data is dealt with 	| * maxdepth prune 	|
| * train/test split (`set.seed`) 	| * prune or not 	|
| * train/test proportion 	| * gini / entropy (split) 	|
| * type of classification model 	| * \# trees / \# BS samples 	|
| * use of cost complexity / parameter 	| * grid search etc. for tuning 	|
| * majority / average prob (tree error rate) 	| * value(s) of `mtry` 	|
| * accuracy vs sensitivity vs specificity 	| * OOB vs CV for tuning 	|

---

## Bias-Variance Tradeoff

<div class="figure" style="text-align: center">
<img src="../images/varbias.png" alt="Test and training error as a function of model complexity.  Note that the error goes down monotonically only for the training data.  Be careful not to overfit!!  image credit: ISLR" width="90%" />
<p class="caption">Test and training error as a function of model complexity.  Note that the error goes down monotonically only for the training data.  Be careful not to overfit!!  image credit: ISLR</p>
</div>

---

## Reflecting on Model Building

<div class="figure">
<img src="../images/modelbuild1.png" alt="Image credit: https://www.tmwr.org/" width="2176" />
<p class="caption">Image credit: https://www.tmwr.org/</p>
</div>

---

## Reflecting on Model Building

<div class="figure">
<img src="../images/modelbuild2.png" alt="Image credit: https://www.tmwr.org/" width="2067" />
<p class="caption">Image credit: https://www.tmwr.org/</p>
</div>

---

## Reflecting on Model Building

<div class="figure" style="text-align: center">
<img src="../images/modelbuild3.png" alt="Image credit: https://www.tmwr.org/" width="70%" />
<p class="caption">Image credit: https://www.tmwr.org/</p>
</div>