Using recipes

October 21, 2024

Jo Hardin

Agenda 10/21/24

Feature engineering: should the data be modified?
tidymodels syntax for recipes
1. recipe / feature engineering
2. model
3. workflow
4. fit
5. validate
Example

Modeling

library(tidymodels)

Motivation

`tidymodels` syntax

partition the data
build a recipe
select a model
create a workflow
fit the model
(validate the model)

partition the data

Put the testing data in your pocket (keep it secret from R!!)

Image credit: Julia Silge

partition the data

library(tidymodels)
library(palmerpenguins)

set.seed(47)
penguin_split <- initial_split(penguins)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

build a recipe

Start the recipe()
Define the variables involved
Describe preprocessing step-by-step

feature engineering

feature engineering is the process of transforming raw data into features (variables) that are better predictors (for the model at hand) of the response variable

create new variables (e.g., combine levels \(\rightarrow\) from state to region)
transform variable (e.g., log, arcsine)
continuous variables \(\rightarrow\) discrete (e.g., binning)
numerical categorical data \(\rightarrow\) factors / character strings (one hot encoding)
time \(\rightarrow\) discretized time
missing values \(\rightarrow\) imputed
NA \(\rightarrow\) level
continuous variables \(\rightarrow\) center & scale (“normalize”)

`step_` functions

For more information: https://recipes.tidymodels.org/reference/index.html

apropos("^step_")

 [1] "step_arrange"            "step_bagimpute"         
 [3] "step_bin2factor"         "step_BoxCox"            
 [5] "step_bs"                 "step_center"            
 [7] "step_classdist"          "step_classdist_shrunken"
 [9] "step_corr"               "step_count"             
[11] "step_cut"                "step_date"              
[13] "step_depth"              "step_discretize"        
[15] "step_dummy"              "step_dummy_extract"     
[17] "step_dummy_multi_choice" "step_factor2string"     
[19] "step_filter"             "step_filter_missing"    
[21] "step_geodist"            "step_harmonic"          
[23] "step_holiday"            "step_hyperbolic"        
[25] "step_ica"                "step_impute_bag"        
[27] "step_impute_knn"         "step_impute_linear"     
[29] "step_impute_lower"       "step_impute_mean"       
[31] "step_impute_median"      "step_impute_mode"       
[33] "step_impute_roll"        "step_indicate_na"       
[35] "step_integer"            "step_interact"          
[37] "step_intercept"          "step_inverse"           
[39] "step_invlogit"           "step_isomap"            
[41] "step_knnimpute"          "step_kpca"              
[43] "step_kpca_poly"          "step_kpca_rbf"          
[45] "step_lag"                "step_lincomb"           
[47] "step_log"                "step_logit"             
[49] "step_lowerimpute"        "step_meanimpute"        
[51] "step_medianimpute"       "step_modeimpute"        
[53] "step_mutate"             "step_mutate_at"         
[55] "step_naomit"             "step_nnmf"              
[57] "step_nnmf_sparse"        "step_normalize"         
[59] "step_novel"              "step_ns"                
[61] "step_num2factor"         "step_nzv"               
[63] "step_ordinalscore"       "step_other"             
[65] "step_pca"                "step_percentile"        
[67] "step_pls"                "step_poly"              
[69] "step_poly_bernstein"     "step_profile"           
[71] "step_range"              "step_ratio"             
[73] "step_regex"              "step_relevel"           
[75] "step_relu"               "step_rename"            
[77] "step_rename_at"          "step_rm"                
[79] "step_rollimpute"         "step_sample"            
[81] "step_scale"              "step_select"            
[83] "step_shuffle"            "step_slice"             
[85] "step_spatialsign"        "step_spline_b"          
[87] "step_spline_convex"      "step_spline_monotone"   
[89] "step_spline_natural"     "step_spline_nonnegative"
[91] "step_sqrt"               "step_string2factor"     
[93] "step_time"               "step_unknown"           
[95] "step_unorder"            "step_window"            
[97] "step_YeoJohnson"         "step_zv"

the data: penguins

Image credit: Alison Hill

penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

recipe

penguin_recipe <-
  recipe(body_mass_g ~ species + island + bill_length_mm + 
           bill_depth_mm + flipper_length_mm + sex + year,
         data = penguin_train) |>
  step_mutate(year = as.factor(year)) |>
  step_unknown(sex, new_level = "unknown") |>
  step_relevel(sex, ref_level = "female") |>
  update_role(island, new_role = "id variable")

penguin_recipe

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:     1
predictor:   6
id variable: 1

── Operations

• Variable mutation for: as.factor(year)

• Unknown factor level assignment for: sex

• Re-order factor level to ref_level for: sex

model

To specify a model:

pick a model
set the mode (regression vs classification, if needed)
set the engine

penguin_lm <- linear_reg() |>
  set_engine("lm")

penguin_lm

Linear Regression Model Specification (regression)

Computational engine: lm

show_engines("nearest_neighbor")

# A tibble: 2 × 2
  engine mode          
  <chr>  <chr>         
1 kknn   classification
2 kknn   regression

show_engines("decision_tree")

# A tibble: 5 × 2
  engine mode          
  <chr>  <chr>         
1 rpart  classification
2 rpart  regression    
3 C5.0   classification
4 spark  classification
5 spark  regression

show_engines("rand_forest")

# A tibble: 6 × 2
  engine       mode          
  <chr>        <chr>         
1 ranger       classification
2 ranger       regression    
3 randomForest classification
4 randomForest regression    
5 spark        classification
6 spark        regression

show_engines("svm_poly")

# A tibble: 2 × 2
  engine  mode          
  <chr>   <chr>         
1 kernlab classification
2 kernlab regression

show_engines("svm_rbf")

# A tibble: 4 × 2
  engine    mode          
  <chr>     <chr>         
1 kernlab   classification
2 kernlab   regression    
3 liquidSVM classification
4 liquidSVM regression

show_engines("linear_reg")

# A tibble: 7 × 2
  engine mode      
  <chr>  <chr>     
1 lm     regression
2 glm    regression
3 glmnet regression
4 stan   regression
5 spark  regression
6 keras  regression
7 brulee regression

workflow

penguin_wflow <- workflow() |>
  add_model(penguin_lm) |>
  add_recipe(penguin_recipe)

penguin_wflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_mutate()
• step_unknown()
• step_relevel()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

fit

penguin_fit <- penguin_wflow |>
  fit(data = penguin_train)

penguin_fit |> tidy()

# A tibble: 10 × 5
   term              estimate std.error statistic  p.value
   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)        -2417.     665.      -3.64  3.36e- 4
 2 speciesChinstrap    -208.      92.9     -2.24  2.58e- 2
 3 speciesGentoo        985.     152.       6.48  5.02e-10
 4 bill_length_mm        13.5      8.29     1.63  1.04e- 1
 5 bill_depth_mm         80.9     22.1      3.66  3.10e- 4
 6 flipper_length_mm     20.8      3.62     5.74  2.81e- 8
 7 sexmale              351.      52.6      6.67  1.72e-10
 8 sexunknown            47.6    103.       0.460 6.46e- 1
 9 year2008             -24.8     47.5     -0.521 6.03e- 1
10 year2009             -61.9     46.0     -1.35  1.80e- 1

entire process

recipe
model
workflow
fit

penguin_recipe <-
  recipe(body_mass_g ~ species + island + bill_length_mm + 
           bill_depth_mm + flipper_length_mm + sex + year,
         data = penguin_train) |>
  step_mutate(year = as.factor(year)) |>
  step_unknown(sex, new_level = "unknown") |>
  step_relevel(sex, ref_level = "female") |>
  update_role(island, new_role = "id variable")

penguin_recipe

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:     1
predictor:   6
id variable: 1

── Operations

• Variable mutation for: as.factor(year)

• Unknown factor level assignment for: sex

• Re-order factor level to ref_level for: sex

penguin_lm <- linear_reg() |>
  set_engine("lm")

penguin_lm

Linear Regression Model Specification (regression)

Computational engine: lm

penguin_wflow <- workflow() |>
  add_model(penguin_lm) |>
  add_recipe(penguin_recipe)

penguin_wflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_mutate()
• step_unknown()
• step_relevel()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

penguin_fit <- penguin_wflow |>
  fit(data = penguin_train)

penguin_fit |> tidy()

# A tibble: 10 × 5
   term              estimate std.error statistic  p.value
   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)        -2417.     665.      -3.64  3.36e- 4
 2 speciesChinstrap    -208.      92.9     -2.24  2.58e- 2
 3 speciesGentoo        985.     152.       6.48  5.02e-10
 4 bill_length_mm        13.5      8.29     1.63  1.04e- 1
 5 bill_depth_mm         80.9     22.1      3.66  3.10e- 4
 6 flipper_length_mm     20.8      3.62     5.74  2.81e- 8
 7 sexmale              351.      52.6      6.67  1.72e-10
 8 sexunknown            47.6    103.       0.460 6.46e- 1
 9 year2008             -24.8     47.5     -0.521 6.03e- 1
10 year2009             -61.9     46.0     -1.35  1.80e- 1

model parameters

Some model parameters are tuned from the data (some aren’t).
- linear model coefficients are optimized (not tuned)
- k-nn value of “k” is tuned
If the model is tuned using the data, the same data cannot be used to assess the model.
With Cross Validation, you iteratively put data in your pocket.
For example, keep 1/5 of the data in your pocket, build the model on the remaining 4/5 of the data.

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

model parameters

Some model parameters are tuned from the data (some aren’t).
- linear model coefficients are optimized (not tuned)
- k-nn value of “k” is tuned
If the model is tuned using the data, the same data cannot be used to assess the model.
With Cross Validation, you iteratively put data in your pocket.
For example, keep 1/5 of the data in your pocket, build the model on the remaining 4/5 of the data.

Using recipes

Agenda 10/21/24

Modeling

Motivation

tidymodels syntax

partition the data

partition the data

build a recipe

feature engineering

step_ functions

the data: penguins

recipe

model

engines

workflow

fit

entire process

model parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

model parameters

`tidymodels` syntax

`step_` functions