Using recipes

October 21, 2024

Jo Hardin

Agenda 10/21/24

  1. Feature engineering: should the data be modified?
  2. tidymodels syntax for recipes
    1. recipe / feature engineering
    2. model
    3. workflow
    4. fit
    5. validate
  3. Example

Modeling

library(tidymodels)

Motivation

tidymodels syntax

  1. partition the data
  2. build a recipe
  3. select a model
  4. create a workflow
  5. fit the model
  6. (validate the model)

partition the data

Put the testing data in your pocket (keep it secret from R!!)

Image credit: Julia Silge

partition the data

library(tidymodels)
library(palmerpenguins)

set.seed(47)
penguin_split <- initial_split(penguins)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

build a recipe

  1. Start the recipe()
  2. Define the variables involved
  3. Describe preprocessing step-by-step

feature engineering

feature engineering is the process of transforming raw data into features (variables) that are better predictors (for the model at hand) of the response variable

  • create new variables (e.g., combine levels \(\rightarrow\) from state to region)
  • transform variable (e.g., log, arcsine)
  • continuous variables \(\rightarrow\) discrete (e.g., binning)
  • numerical categorical data \(\rightarrow\) factors / character strings (one hot encoding)
  • time \(\rightarrow\) discretized time
  • missing values \(\rightarrow\) imputed
  • NA \(\rightarrow\) level
  • continuous variables \(\rightarrow\) center & scale (“normalize”)

step_ functions

For more information: https://recipes.tidymodels.org/reference/index.html

apropos("^step_")
 [1] "step_arrange"            "step_bagimpute"         
 [3] "step_bin2factor"         "step_BoxCox"            
 [5] "step_bs"                 "step_center"            
 [7] "step_classdist"          "step_classdist_shrunken"
 [9] "step_corr"               "step_count"             
[11] "step_cut"                "step_date"              
[13] "step_depth"              "step_discretize"        
[15] "step_dummy"              "step_dummy_extract"     
[17] "step_dummy_multi_choice" "step_factor2string"     
[19] "step_filter"             "step_filter_missing"    
[21] "step_geodist"            "step_harmonic"          
[23] "step_holiday"            "step_hyperbolic"        
[25] "step_ica"                "step_impute_bag"        
[27] "step_impute_knn"         "step_impute_linear"     
[29] "step_impute_lower"       "step_impute_mean"       
[31] "step_impute_median"      "step_impute_mode"       
[33] "step_impute_roll"        "step_indicate_na"       
[35] "step_integer"            "step_interact"          
[37] "step_intercept"          "step_inverse"           
[39] "step_invlogit"           "step_isomap"            
[41] "step_knnimpute"          "step_kpca"              
[43] "step_kpca_poly"          "step_kpca_rbf"          
[45] "step_lag"                "step_lincomb"           
[47] "step_log"                "step_logit"             
[49] "step_lowerimpute"        "step_meanimpute"        
[51] "step_medianimpute"       "step_modeimpute"        
[53] "step_mutate"             "step_mutate_at"         
[55] "step_naomit"             "step_nnmf"              
[57] "step_nnmf_sparse"        "step_normalize"         
[59] "step_novel"              "step_ns"                
[61] "step_num2factor"         "step_nzv"               
[63] "step_ordinalscore"       "step_other"             
[65] "step_pca"                "step_percentile"        
[67] "step_pls"                "step_poly"              
[69] "step_poly_bernstein"     "step_profile"           
[71] "step_range"              "step_ratio"             
[73] "step_regex"              "step_relevel"           
[75] "step_relu"               "step_rename"            
[77] "step_rename_at"          "step_rm"                
[79] "step_rollimpute"         "step_sample"            
[81] "step_scale"              "step_select"            
[83] "step_shuffle"            "step_slice"             
[85] "step_spatialsign"        "step_spline_b"          
[87] "step_spline_convex"      "step_spline_monotone"   
[89] "step_spline_natural"     "step_spline_nonnegative"
[91] "step_sqrt"               "step_string2factor"     
[93] "step_time"               "step_unknown"           
[95] "step_unorder"            "step_window"            
[97] "step_YeoJohnson"         "step_zv"                

the data: penguins

Image credit: Alison Hill

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

recipe

penguin_recipe <-
  recipe(body_mass_g ~ species + island + bill_length_mm + 
           bill_depth_mm + flipper_length_mm + sex + year,
         data = penguin_train) |>
  step_mutate(year = as.factor(year)) |>
  step_unknown(sex, new_level = "unknown") |>
  step_relevel(sex, ref_level = "female") |>
  update_role(island, new_role = "id variable")
summary(penguin_recipe)
# A tibble: 8 × 4
  variable          type      role        source  
  <chr>             <list>    <chr>       <chr>   
1 species           <chr [3]> predictor   original
2 island            <chr [3]> id variable original
3 bill_length_mm    <chr [2]> predictor   original
4 bill_depth_mm     <chr [2]> predictor   original
5 flipper_length_mm <chr [2]> predictor   original
6 sex               <chr [3]> predictor   original
7 year              <chr [2]> predictor   original
8 body_mass_g       <chr [2]> outcome     original

model

To specify a model:

  1. pick a model
  2. set the mode (regression vs classification, if needed)
  3. set the engine
penguin_lm <- linear_reg() |>
  set_engine("lm")
penguin_lm
Linear Regression Model Specification (regression)

Computational engine: lm 

engines

show_engines("nearest_neighbor")
# A tibble: 2 × 2
  engine mode          
  <chr>  <chr>         
1 kknn   classification
2 kknn   regression    
show_engines("decision_tree")
# A tibble: 5 × 2
  engine mode          
  <chr>  <chr>         
1 rpart  classification
2 rpart  regression    
3 C5.0   classification
4 spark  classification
5 spark  regression    
show_engines("rand_forest")
# A tibble: 6 × 2
  engine       mode          
  <chr>        <chr>         
1 ranger       classification
2 ranger       regression    
3 randomForest classification
4 randomForest regression    
5 spark        classification
6 spark        regression    
show_engines("svm_poly")
# A tibble: 2 × 2
  engine  mode          
  <chr>   <chr>         
1 kernlab classification
2 kernlab regression    
show_engines("svm_rbf")
# A tibble: 4 × 2
  engine    mode          
  <chr>     <chr>         
1 kernlab   classification
2 kernlab   regression    
3 liquidSVM classification
4 liquidSVM regression    
show_engines("linear_reg")
# A tibble: 7 × 2
  engine mode      
  <chr>  <chr>     
1 lm     regression
2 glm    regression
3 glmnet regression
4 stan   regression
5 spark  regression
6 keras  regression
7 brulee regression

workflow

penguin_wflow <- workflow() |>
  add_model(penguin_lm) |>
  add_recipe(penguin_recipe)
penguin_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_mutate()
• step_unknown()
• step_relevel()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 

fit

penguin_fit <- penguin_wflow |>
  fit(data = penguin_train)
penguin_fit |> tidy()
# A tibble: 10 × 5
   term              estimate std.error statistic  p.value
   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)        -2417.     665.      -3.64  3.36e- 4
 2 speciesChinstrap    -208.      92.9     -2.24  2.58e- 2
 3 speciesGentoo        985.     152.       6.48  5.02e-10
 4 bill_length_mm        13.5      8.29     1.63  1.04e- 1
 5 bill_depth_mm         80.9     22.1      3.66  3.10e- 4
 6 flipper_length_mm     20.8      3.62     5.74  2.81e- 8
 7 sexmale              351.      52.6      6.67  1.72e-10
 8 sexunknown            47.6    103.       0.460 6.46e- 1
 9 year2008             -24.8     47.5     -0.521 6.03e- 1
10 year2009             -61.9     46.0     -1.35  1.80e- 1

entire process

penguin_recipe <-
  recipe(body_mass_g ~ species + island + bill_length_mm + 
           bill_depth_mm + flipper_length_mm + sex + year,
         data = penguin_train) |>
  step_mutate(year = as.factor(year)) |>
  step_unknown(sex, new_level = "unknown") |>
  step_relevel(sex, ref_level = "female") |>
  update_role(island, new_role = "id variable")

summary(penguin_recipe)
# A tibble: 8 × 4
  variable          type      role        source  
  <chr>             <list>    <chr>       <chr>   
1 species           <chr [3]> predictor   original
2 island            <chr [3]> id variable original
3 bill_length_mm    <chr [2]> predictor   original
4 bill_depth_mm     <chr [2]> predictor   original
5 flipper_length_mm <chr [2]> predictor   original
6 sex               <chr [3]> predictor   original
7 year              <chr [2]> predictor   original
8 body_mass_g       <chr [2]> outcome     original
penguin_lm <- linear_reg() |>
  set_engine("lm")

penguin_lm
Linear Regression Model Specification (regression)

Computational engine: lm 
penguin_wflow <- workflow() |>
  add_model(penguin_lm) |>
  add_recipe(penguin_recipe)

penguin_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_mutate()
• step_unknown()
• step_relevel()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 
penguin_fit <- penguin_wflow |>
  fit(data = penguin_train)

penguin_fit |> tidy()
# A tibble: 10 × 5
   term              estimate std.error statistic  p.value
   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)        -2417.     665.      -3.64  3.36e- 4
 2 speciesChinstrap    -208.      92.9     -2.24  2.58e- 2
 3 speciesGentoo        985.     152.       6.48  5.02e-10
 4 bill_length_mm        13.5      8.29     1.63  1.04e- 1
 5 bill_depth_mm         80.9     22.1      3.66  3.10e- 4
 6 flipper_length_mm     20.8      3.62     5.74  2.81e- 8
 7 sexmale              351.      52.6      6.67  1.72e-10
 8 sexunknown            47.6    103.       0.460 6.46e- 1
 9 year2008             -24.8     47.5     -0.521 6.03e- 1
10 year2009             -61.9     46.0     -1.35  1.80e- 1

model parameters

  • Some model parameters are tuned from the data (some aren’t).

    • linear model coefficients are optimized (not tuned)
    • k-nn value of “k” is tuned
  • If the model is tuned using the data, the same data cannot be used to assess the model.

  • With Cross Validation, you iteratively put data in your pocket.

  • For example, keep 1/5 of the data in your pocket, build the model on the remaining 4/5 of the data.

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

model parameters

  • Some model parameters are tuned from the data (some aren’t).

    • linear model coefficients are optimized (not tuned)
    • k-nn value of “k” is tuned
  • If the model is tuned using the data, the same data cannot be used to assess the model.

  • With Cross Validation, you iteratively put data in your pocket.

  • For example, keep 1/5 of the data in your pocket, build the model on the remaining 4/5 of the data.