Using recipes

October 20, 2025

Jo Hardin

Agenda 10/20/25

Feature engineering: should the data be modified?
tidymodels syntax for recipes
1. recipe / feature engineering
2. model
3. workflow
4. fit
5. validate

Modeling

library(tidymodels)

Motivation

`tidymodels` syntax

partition the data
build a recipe
select a model
create a workflow
fit the model
(validate the model)

partition the data

Put the testing data in your pocket (keep it secret from the model / R!!)

Image credit: Julia Silge

partition the data

library(tidymodels)
library(palmerpenguins)

set.seed(47)
penguin_split <- initial_split(penguins)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

build a recipe

Start the recipe()
Define the variables involved
Describe preprocessing step-by-step

feature engineering

feature engineering is the process of transforming raw data into features (variables) that are better predictors (for the model at hand) of the response variable

create new variables (e.g., combine levels \(\rightarrow\) from state to region)
transform variable (e.g., log, arcsine)
continuous variables \(\rightarrow\) discrete or categorical (e.g., binning)
numerical categorical data \(\rightarrow\) factors / character strings (dummy or one hot encoding) (e.g., zipcode)
time \(\rightarrow\) discretized time
missing values \(\rightarrow\) imputed
NA \(\rightarrow\) level
continuous variables \(\rightarrow\) center & scale (“normalize”)
date \(\rightarrow\) weekday vs. weekend

`step_` functions

For more information: https://recipes.tidymodels.org/reference/index.html

apropos("^step_")

 [1] "step_arrange"            "step_bagimpute"         
 [3] "step_bin2factor"         "step_BoxCox"            
 [5] "step_bs"                 "step_center"            
 [7] "step_classdist"          "step_classdist_shrunken"
 [9] "step_corr"               "step_count"             
[11] "step_cut"                "step_date"              
[13] "step_depth"              "step_discretize"        
[15] "step_dummy"              "step_dummy_extract"     
[17] "step_dummy_multi_choice" "step_factor2string"     
[19] "step_filter"             "step_filter_missing"    
[21] "step_geodist"            "step_harmonic"          
[23] "step_holiday"            "step_hyperbolic"        
[25] "step_ica"                "step_impute_bag"        
[27] "step_impute_knn"         "step_impute_linear"     
[29] "step_impute_lower"       "step_impute_mean"       
[31] "step_impute_median"      "step_impute_mode"       
[33] "step_impute_roll"        "step_indicate_na"       
[35] "step_integer"            "step_interact"          
[37] "step_intercept"          "step_inverse"           
[39] "step_invlogit"           "step_isomap"            
[41] "step_knnimpute"          "step_kpca"              
[43] "step_kpca_poly"          "step_kpca_rbf"          
[45] "step_lag"                "step_lincomb"           
[47] "step_log"                "step_logit"             
[49] "step_lowerimpute"        "step_meanimpute"        
[51] "step_medianimpute"       "step_modeimpute"        
[53] "step_mutate"             "step_mutate_at"         
[55] "step_naomit"             "step_nnmf"              
[57] "step_nnmf_sparse"        "step_normalize"         
[59] "step_novel"              "step_ns"                
[61] "step_num2factor"         "step_nzv"               
[63] "step_ordinalscore"       "step_other"             
[65] "step_pca"                "step_percentile"        
[67] "step_pls"                "step_poly"              
[69] "step_poly_bernstein"     "step_profile"           
[71] "step_range"              "step_ratio"             
[73] "step_regex"              "step_relevel"           
[75] "step_relu"               "step_rename"            
[77] "step_rename_at"          "step_rm"                
[79] "step_rollimpute"         "step_sample"            
[81] "step_scale"              "step_select"            
[83] "step_shuffle"            "step_slice"             
[85] "step_spatialsign"        "step_spline_b"          
[87] "step_spline_convex"      "step_spline_monotone"   
[89] "step_spline_natural"     "step_spline_nonnegative"
[91] "step_sqrt"               "step_string2factor"     
[93] "step_time"               "step_unknown"           
[95] "step_unorder"            "step_window"            
[97] "step_YeoJohnson"         "step_zv"

the data: penguins

Image credit: Alison Hill

penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <fct>   <fct>              <dbl>         <dbl>             <int>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>

recipe

penguin_recipe <-
  recipe(body_mass_g ~ species + island + bill_length_mm + 
           bill_depth_mm + flipper_length_mm + sex + year,
         data = penguin_train) |>
  step_mutate(year = as.factor(year)) |>
  step_unknown(sex, new_level = "unknown") |>
  step_relevel(sex, ref_level = "female") |>
  update_role(island, new_role = "id variable")

penguin_recipe

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:     1
predictor:   6
id variable: 1

── Operations

• Variable mutation for: as.factor(year)

• Unknown factor level assignment for: sex

• Re-order factor level to ref_level for: sex

silly recipe

penguin_recipe_silly <-
  recipe(body_mass_g ~ species + island + bill_length_mm + 
           bill_depth_mm + flipper_length_mm + sex + year,
         data = penguin_train) |>
  step_log(all_numeric())

penguin_recipe_silly

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 7

── Operations

• Log transformation on: all_numeric()

model

To specify a model:

pick a model
set the mode (regression vs classification, if needed)
set the engine

penguin_lm <- linear_reg() |>
  set_engine("lm")

penguin_lm

Linear Regression Model Specification (regression)

Computational engine: lm

show_engines("nearest_neighbor")

# A tibble: 2 × 2
  engine mode          
  <chr>  <chr>         
1 kknn   classification
2 kknn   regression

show_engines("decision_tree")

# A tibble: 5 × 2
  engine mode          
  <chr>  <chr>         
1 rpart  classification
2 rpart  regression    
3 C5.0   classification
4 spark  classification
5 spark  regression

show_engines("rand_forest")

# A tibble: 6 × 2
  engine       mode          
  <chr>        <chr>         
1 ranger       classification
2 ranger       regression    
3 randomForest classification
4 randomForest regression    
5 spark        classification
6 spark        regression

show_engines("svm_poly")

# A tibble: 2 × 2
  engine  mode          
  <chr>   <chr>         
1 kernlab classification
2 kernlab regression

show_engines("svm_rbf")

# A tibble: 4 × 2
  engine    mode          
  <chr>     <chr>         
1 kernlab   classification
2 kernlab   regression    
3 liquidSVM classification
4 liquidSVM regression

show_engines("linear_reg")

# A tibble: 8 × 2
  engine   mode               
  <chr>    <chr>              
1 lm       regression         
2 glm      regression         
3 glmnet   regression         
4 stan     regression         
5 spark    regression         
6 keras    regression         
7 brulee   regression         
8 quantreg quantile regression

workflow

penguin_wflow <- workflow() |>
  add_model(penguin_lm) |>
  add_recipe(penguin_recipe)

penguin_wflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_mutate()
• step_unknown()
• step_relevel()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

fit

penguin_fit <- penguin_wflow |>
  fit(data = penguin_train)

penguin_fit |> tidy()

# A tibble: 10 × 5
   term              estimate std.error statistic  p.value
   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)        -2417.     665.      -3.64  3.36e- 4
 2 speciesChinstrap    -208.      92.9     -2.24  2.58e- 2
 3 speciesGentoo        985.     152.       6.48  5.02e-10
 4 bill_length_mm        13.5      8.29     1.63  1.04e- 1
 5 bill_depth_mm         80.9     22.1      3.66  3.10e- 4
 6 flipper_length_mm     20.8      3.62     5.74  2.81e- 8
 7 sexmale              351.      52.6      6.67  1.72e-10
 8 sexunknown            47.6    103.       0.460 6.46e- 1
 9 year2008             -24.8     47.5     -0.521 6.03e- 1
10 year2009             -61.9     46.0     -1.35  1.80e- 1

entire process

recipe
model
workflow
fit

penguin_recipe <-
  recipe(body_mass_g ~ species + island + bill_length_mm + 
           bill_depth_mm + flipper_length_mm + sex + year,
         data = penguin_train) |>
  step_mutate(year = as.factor(year)) |>
  step_unknown(sex, new_level = "unknown") |>
  step_relevel(sex, ref_level = "female") |>
  update_role(island, new_role = "id variable")

penguin_recipe

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:     1
predictor:   6
id variable: 1

── Operations

• Variable mutation for: as.factor(year)

• Unknown factor level assignment for: sex

• Re-order factor level to ref_level for: sex

penguin_lm <- linear_reg() |>
  set_engine("lm")

penguin_lm

Linear Regression Model Specification (regression)

Computational engine: lm

penguin_wflow <- workflow() |>
  add_model(penguin_lm) |>
  add_recipe(penguin_recipe)

penguin_wflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_mutate()
• step_unknown()
• step_relevel()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

penguin_fit <- penguin_wflow |>
  fit(data = penguin_train)

penguin_fit |> tidy()

# A tibble: 10 × 5
   term              estimate std.error statistic  p.value
   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)        -2417.     665.      -3.64  3.36e- 4
 2 speciesChinstrap    -208.      92.9     -2.24  2.58e- 2
 3 speciesGentoo        985.     152.       6.48  5.02e-10
 4 bill_length_mm        13.5      8.29     1.63  1.04e- 1
 5 bill_depth_mm         80.9     22.1      3.66  3.10e- 4
 6 flipper_length_mm     20.8      3.62     5.74  2.81e- 8
 7 sexmale              351.      52.6      6.67  1.72e-10
 8 sexunknown            47.6    103.       0.460 6.46e- 1
 9 year2008             -24.8     47.5     -0.521 6.03e- 1
10 year2009             -61.9     46.0     -1.35  1.80e- 1

model parameters

Some model parameters are tuned from the data (some aren’t).
- linear model coefficients are optimized (not tuned)
- k-nn value of “k” is tuned
If the model is tuned using the data, the same data cannot be used to assess the model.
With Cross Validation, you iteratively put data in your pocket.
For example, keep 1/5 of the data in your pocket, build the model on the remaining 4/5 of the data.

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

Cross validation

for tuning parameters

Image credit: Alison Hill

model parameters

Some model parameters are tuned from the data (some aren’t).
- linear model coefficients are optimized (not tuned)
- k-nn value of “k” is tuned
If the model is tuned using the data, the same data cannot be used to assess the model.
With Cross Validation, you iteratively put data in your pocket.
For example, keep 1/5 of the data in your pocket, build the model on the remaining 4/5 of the data.

Using recipes

Agenda 10/20/25

Modeling

Motivation

tidymodels syntax

partition the data

partition the data

build a recipe

feature engineering

step_ functions

the data: penguins

recipe

silly recipe

model

engines

workflow

fit

entire process

model parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

Cross validation

for tuning parameters

model parameters

`tidymodels` syntax

`step_` functions