Decision Trees

October 27, 2025

Jo Hardin

Agenda 10/27/25

  1. Decision Trees

tidymodels syntax

  1. partition the data
  2. build a recipe
  3. select a model
  4. create a workflow
  5. fit the model
  6. validate the model

Decision trees in action

A visual introduction to machine learning by Yee and Chu.

Yee and Chu created a step-by-step build of a recursive binary tree to model the differences between homes in SF and homes in NYC. http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

Classification and Regression Trees (CART)

Basic Classification and Regression Trees (CART) Algorithm:

  1. Start with all observations in one group.
  2. Find the variable/split that best separates the response variable (successive binary partitions based on the different predictors / explanatory variables).
    • Evaluation “homogeneity” within each group
    • Divide the data into two groups (“leaves”) on that split (“node”).
    • Within each split, find the best variable/split that separates the outcomes.
  3. Continue until the groups are too small or sufficiently “pure”.
  4. Prune tree.

Minimize heterogeneity

For every observation that falls into the region \(R_m\),

prediction = the mean of the response values for observations in \(R_m\).

\(\Rightarrow\) Minimize Residual Sum of Squares (RSS): \[RSS = \sum_{m=1}^{|T|} \sum_{i \in R_m} (y_i - \overline{y}_{R_m})^2\] where \(\overline{y}_{R_m}\) is the mean response for observations within the \(m\)th region.

Recursive binary splitting

Select the predictor \(X_j\) and the cutpoint \(s\) such that splitting the predictor space into the regions \(\{X | X_j< s\}\) and \(\{X | X_j \geq s\}\) lead to the greatest reduction in RSS.

For any \(j\) and \(s\), define the pair of half-planes to be \[R_1(j,s) = \{X | X_j < s\} \mbox{ and } R_2(j,s) = \{X | X_j \geq s\}\] Find the value of \(j\) and \(s\) that minimize the equation: \[\sum_{i:x_i \in R_1(j,s)} (y_i - \overline{y}_{R_1})^2 + \sum_{i:x_i \in R_2(j,s)} (y_i - \overline{y}_{R_2})^2\]

where \(\overline{y}_{R_1}\) is the mean response for observations in \(R_1(j,s)\) and \(\overline{y}_{R_2}\) is the mean response observations in \(R_2(j,s)\).

Trees in action

Measures of impurity

\(\hat{p}_{mk}\) = proportion of observations in the \(m\)th region from the \(k\)th class.

  • classification error rate = fraction of observations in the node & not in the most common class:

\[E_m = 1 - \max_k(\hat{p}_{mk})\]

  • Gini index \[G_m= \sum_{k=1}^K \hat{p}_{mk}(1-\hat{p}_{mk})\]

  • cross-entropy \[D_m = - \sum_{k=1}^K \hat{p}_{mk} \log \hat{p}_{mk}\] (Gini index & cross-entropy will both take on a value near zero if the \(\hat{p}_{mk}\) values are all near zero or all near one.)

Recursive binary splitting

For any \(j\) and \(s\), define the pair of half-planes to be \[R_1(j,s) = \{X | X_j < s\} \mbox{ and } R_2(j,s) = \{X | X_j \geq s\}\]

Seek the value of \(j\) and \(s\) that minimize the equation: \[\begin{align} & \sum_{i:x_i \in R_1(j,s)} \sum_{k=1}^K \hat{p}_{{R_1}k}(1-\hat{p}_{{R_1}k}) + \sum_{i:x_i \in R_2(j,s)} \sum_{k=1}^K \hat{p}_{{R_2}k}(1-\hat{p}_{{R_2}k})\\ \\ \mbox{equivalently: } & \\ & n_{R_1} \sum_{k=1}^K \hat{p}_{{R_1}k}(1-\hat{p}_{{R_1}k}) + n_{R_2} \sum_{k=1}^K \hat{p}_{{R_2}k}(1-\hat{p}_{{R_2}k})\\ \end{align}\]

Stopping

We can always make the tree more “pure” by continuing the split.

Too many splits will overfit the model to the training data!

Ways to control:

  • cost_complexity
  • tree_depth
  • min_n

Overfitting: http://www.r2d3.us/visual-intro-to-machine-learning-part-2/

Agenda 10/29/25

  1. pruning trees
  2. cross validation to find \(\alpha\)

Cost complexity

There is a cost to having a larger (more complex!) tree.

Define the cost complexity criterion, \(\alpha > 0:\) \[\begin{align} \mbox{numerical: } C_\alpha(T) &= \sum_{m=1}^{|T|} \sum_{i \in R_m} (y_i - \overline{y}_{R_m})^2 + \alpha \cdot |T|\\ \mbox{categorical: } C_\alpha(T) &= \sum_{m=1}^{|T|} \sum_{i \in R_m} I(y_i \ne k(m)) + \alpha \cdot |T| \end{align}\] where \(k(m)\) is the class with the majority of observations in node \(m\) and \(|T|\) is the number of terminal nodes in the tree.

  • \(\alpha\) small: If \(\alpha\) is set to be small, we are saying that the risk is more worrisome than the complexity and larger trees are favored because they reduce the risk.
  • \(\alpha\) large: If \(\alpha\) is set to be large, then the complexity of the tree is more worrisome and smaller trees are favored.

In practice

Consider \(\alpha\) increasing. As \(\alpha\) gets bigger, the “best” tree will be smaller.

The training error will be monotonically related to the size of the training tree.

However, the test error will not be monotonically related to the size of the training tree.

A note on \(\alpha\)

In the text (Introduction to Statistical Learning) and almost everywhere else you might look, the cost complexity is defined as in previous slides.

However, you might notice that in R the cost_complexity value is typically less than 1. The value of the function that is being minimized in R is the average of the squared errors and the missclassification rate. (Allows for \(\alpha\) to not depend on sample size.)

\[\begin{align} \mbox{numerical: } C_\alpha(T) &= \frac{1}{n}\sum_{m=1}^{|T|} \sum_{i \in R_m} (y_i - \overline{y}_{R_m})^2 + \alpha \cdot |T|\\ \mbox{categorical: } C_\alpha(T) &= \frac{1}{n}\sum_{m=1}^{|T|} \sum_{i \in R_m} I(y_i \ne k(m)) + \alpha \cdot |T| \end{align}\]

CART algorithm

Algorithm: Building a Regression Tree

  1. Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.
  2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of \(\alpha\).
  3. Use \(V\)-fold cross-validation to choose \(\alpha\). Divide the training observations into \(V\) folds, then for each \(v=1, 2, \ldots, V\):
    1. Repeat Steps 1 and 2 on all but the \(v\)th fold of the training data. (That is, you’ll have a set of best subtrees for the set of training data which does not contain fold \(v\).)
    2. Let’s say you are considering a set of \(\alpha\) values \((\alpha_1, \alpha_2, \alpha_3, \alpha_4)\). For each \(\alpha\) under consideration:
      • Using the \(\alpha\) under consideration, find the subtree (from step 3a) with the lowest cost complexity.
      • Use that subtree to calculate the error on the left-out \(v^{th}\) fold.
    3. For each value of \(\alpha\), average the error (either misclassification rate or rmse = square root of the mean squared error), and pick \(\alpha\) to minimize the average error. The averaging happens over the \(v\) folds. You will have as many average errors as you have options for \(\alpha\) values under consideration.
    4. Choose the value of \(\alpha\) that gives the minimum average error.
  4. Return the subtree from Step 2 that corresponds to the chosen value of \(\alpha\).
  5. The subtree from Step 2 can then be applied to the test data (which has been in your pocket this whole time!) as an assessment for the predictions from the tree in the wild.

CART example w defaults

penguin_cart_recipe <-
  recipe(species ~ . ,
         data = penguin_train) |>
  step_unknown(sex, new_level = "unknown") |>
  step_mutate(year = as.factor(year)) 

penguin_cart_recipe
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:   1
predictor: 7
── Operations 
• Unknown factor level assignment for: sex
• Variable mutation for: as.factor(year)
penguin_cart <- decision_tree() |>
  set_engine("rpart") |>
  set_mode("classification")

penguin_cart
Decision Tree Model Specification (classification)

Computational engine: rpart 
penguin_cart_wflow <- workflow() |>
  add_model(penguin_cart) |>
  add_recipe(penguin_cart_recipe)

penguin_cart_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_unknown()
• step_mutate()

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)

Computational engine: rpart 
penguin_cart_fit <- penguin_cart_wflow |>
  fit(data = penguin_train)

penguin_cart_fit 
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_unknown()
• step_mutate()

── Model ───────────────────────────────────────────────────────────────────────
n= 258 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 258 139 Adelie (0.461240310 0.189922481 0.348837209)  
  2) flipper_length_mm< 206.5 164  46 Adelie (0.719512195 0.274390244 0.006097561)  
    4) bill_length_mm< 43.15 118   3 Adelie (0.974576271 0.025423729 0.000000000) *
    5) bill_length_mm>=43.15 46   4 Chinstrap (0.065217391 0.913043478 0.021739130) *
  3) flipper_length_mm>=206.5 94   5 Gentoo (0.010638298 0.042553191 0.946808511)  
    6) bill_depth_mm>=17.15 7   3 Chinstrap (0.142857143 0.571428571 0.285714286) *
    7) bill_depth_mm< 17.15 87   0 Gentoo (0.000000000 0.000000000 1.000000000) *
penguin_cart_fit |>
  predict(new_data = penguin_train) |>
  cbind(penguin_train) |>
  select(.pred_class, species) |>
  table()
           species
.pred_class Adelie Chinstrap Gentoo
  Adelie       115         3      0
  Chinstrap      4        46      3
  Gentoo         0         0     87

Not quite ready for test data because \(\alpha\) is unknown.

Plotting the tree (not tidy)

library(rpart.plot)
penguins_cart_plot <- 
  penguin_cart_fit |>
  extract_fit_parsnip()

rpart.plot(
  penguins_cart_plot$fit,
  roundint = FALSE)  

library(rattle)
penguins_cart_plot <- 
  penguin_cart_fit |>
  extract_fit_parsnip()

fancyRpartPlot(
  penguins_cart_plot$fit,
  sub = NULL,
  palettes = "RdPu")

CART example w CV

penguin_cart_tune_recipe <-
  recipe(body_mass_g ~ . ,
         data = penguin_train) |>
  step_unknown(sex, new_level = "unknown") |>
  step_mutate(year = as.factor(year)) |>
  step_naomit(all_predictors(), body_mass_g)
set.seed(470)
penguin_vfold <- vfold_cv(penguin_train,
                          v = 5)
cart_grid <- expand.grid(
  cost_complexity = c(0, 10^(seq(-5,-1,1))),
  tree_depth = seq(1,6, by = 1))

cart_grid
   cost_complexity tree_depth
1            0e+00          1
2            1e-05          1
3            1e-04          1
4            1e-03          1
5            1e-02          1
6            1e-01          1
7            0e+00          2
8            1e-05          2
9            1e-04          2
10           1e-03          2
11           1e-02          2
12           1e-01          2
13           0e+00          3
14           1e-05          3
15           1e-04          3
16           1e-03          3
17           1e-02          3
18           1e-01          3
19           0e+00          4
20           1e-05          4
21           1e-04          4
22           1e-03          4
23           1e-02          4
24           1e-01          4
25           0e+00          5
26           1e-05          5
27           1e-04          5
28           1e-03          5
29           1e-02          5
30           1e-01          5
31           0e+00          6
32           1e-05          6
33           1e-04          6
34           1e-03          6
35           1e-02          6
36           1e-01          6
penguin_cart_tune <- 
  decision_tree(cost_complexity = tune(),
                tree_depth = tune()) |>
  set_engine("rpart") |>
  set_mode("regression")

penguin_cart_wflow_tune <- workflow() |>
  add_model(penguin_cart_tune) |>
  add_recipe(penguin_cart_tune_recipe)
penguin_tuned <- penguin_cart_wflow_tune |>
  tune_grid(resamples = penguin_vfold, 
           grid = cart_grid) 

penguin_tuned |> collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean)
# A tibble: 36 × 8
   cost_complexity tree_depth .metric .estimator  mean     n std_err
             <dbl>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
 1         0.001            4 rmse    standard    332.     5    13.8
 2         0                4 rmse    standard    333.     5    13.7
 3         0.00001          4 rmse    standard    333.     5    13.7
 4         0.0001           4 rmse    standard    333.     5    13.7
 5         0.001            6 rmse    standard    334.     5    18.4
 6         0                6 rmse    standard    336.     5    18.3
 7         0.00001          6 rmse    standard    336.     5    18.3
 8         0.0001           6 rmse    standard    336.     5    18.3
 9         0.001            5 rmse    standard    337.     5    15.9
10         0                5 rmse    standard    338.     5    15.6
# ℹ 26 more rows
# ℹ 1 more variable: .config <chr>

Parameter choice

(Hard to see because tree_depth = 6 is on top of tree_depth = 5, either would be fine to choose.)

penguin_tuned |>
  collect_metrics() |>
  filter(.metric == "rmse") |> 
  ggplot(aes(x = cost_complexity, y = mean, 
             color = as.factor(tree_depth))) + 
  geom_line()
penguin_tuned |> 
  select_best(metric = "rmse")
# A tibble: 1 × 3
  cost_complexity tree_depth .config              
            <dbl>      <dbl> <chr>                
1           0.001          4 Preprocessor1_Model22

Best model – using CV params

penguin_opt <- decision_tree(cost_complexity = 0, 
                             tree_depth = 5) |>
  set_engine("rpart") |>
  set_mode("regression")

penguin_opt
Decision Tree Model Specification (regression)

Main Arguments:
  cost_complexity = 0
  tree_depth = 5

Computational engine: rpart 
penguin_final_opt <-
  workflow() |>
  add_model(penguin_opt) |>
  add_recipe(penguin_cart_tune_recipe) |>
  fit(data = penguin_train)

Best Model Predictions

workflow() |>
  add_model(penguin_opt) |>
  add_recipe(penguin_cart_tune_recipe) |>
  fit(data = penguin_train) |>
  predict(new_data = penguin_test) |>
  cbind(penguin_test) |>
  ggplot(aes(x = body_mass_g, y = .pred)) + 
  geom_point() + 
  geom_abline(intercept = 0, slope = 1)

Best Model Predictions

library(rpart.plot)
penguins_cart_plot <- 
workflow() |>
  add_model(penguin_opt) |>
  add_recipe(penguin_cart_tune_recipe) |>
  fit(data = penguin_train) |>
  extract_fit_parsnip()

rpart.plot(
  penguins_cart_plot$fit,
  roundint = FALSE) 

Bias-Variance Tradeoff

  • Variance refers to the amount by which would change if we estimated it using a different training set. Generally, the closer the model fits the (training) data, the more variable it will be (it will be different for each data set!). A model with many many explanatory variables will often fit the data too closely.

  • Bias refers to the error that is introduced by approximating the “truth” by a model which is too simple. For example, we often use linear models to describe complex relationships, but it is unlikely that any real life situation actually has a true linear model. However, if the true relationship is close to linear, then the linear model will have a low bias.

  • an excellent resource for explaining the bias-variance trade-off: http://scott.fortmann-roe.com/docs/BiasVariance.html

Bias-Variance Tradeoff

model high bias
(too simple)
high variance
(overfit training data)
linear model one (zero!) predictor many many predictors
\(k\)-NN \(k = n\) \(k = 1\)
CART one split (high \(\alpha\)) single observation in every terminal node (\(\alpha = 0\))

Bias-Variance Tradeoff

Test and training error as a function of model complexity. Note that the error goes down monotonically only for the training data. Be careful not to overfit!! image credit: ISLR

Reflecting on Model Building

Image credit: https://www.tmwr.org/

Reflecting on Model Building

Image credit: https://www.tmwr.org/

Reflecting on Model Building

Image credit: https://www.tmwr.org/