October 20, 2025
tidymodels syntax for recipes
tidymodels syntaxPut the testing data in your pocket (keep it secret from the model / R!!)
Image credit: Julia Silge
recipe()feature engineering is the process of transforming raw data into features (variables) that are better predictors (for the model at hand) of the response variable
step_ functionsFor more information: https://recipes.tidymodels.org/reference/index.html
[1] "step_arrange" "step_bagimpute"
[3] "step_bin2factor" "step_BoxCox"
[5] "step_bs" "step_center"
[7] "step_classdist" "step_classdist_shrunken"
[9] "step_corr" "step_count"
[11] "step_cut" "step_date"
[13] "step_depth" "step_discretize"
[15] "step_dummy" "step_dummy_extract"
[17] "step_dummy_multi_choice" "step_factor2string"
[19] "step_filter" "step_filter_missing"
[21] "step_geodist" "step_harmonic"
[23] "step_holiday" "step_hyperbolic"
[25] "step_ica" "step_impute_bag"
[27] "step_impute_knn" "step_impute_linear"
[29] "step_impute_lower" "step_impute_mean"
[31] "step_impute_median" "step_impute_mode"
[33] "step_impute_roll" "step_indicate_na"
[35] "step_integer" "step_interact"
[37] "step_intercept" "step_inverse"
[39] "step_invlogit" "step_isomap"
[41] "step_knnimpute" "step_kpca"
[43] "step_kpca_poly" "step_kpca_rbf"
[45] "step_lag" "step_lincomb"
[47] "step_log" "step_logit"
[49] "step_lowerimpute" "step_meanimpute"
[51] "step_medianimpute" "step_modeimpute"
[53] "step_mutate" "step_mutate_at"
[55] "step_naomit" "step_nnmf"
[57] "step_nnmf_sparse" "step_normalize"
[59] "step_novel" "step_ns"
[61] "step_num2factor" "step_nzv"
[63] "step_ordinalscore" "step_other"
[65] "step_pca" "step_percentile"
[67] "step_pls" "step_poly"
[69] "step_poly_bernstein" "step_profile"
[71] "step_range" "step_ratio"
[73] "step_regex" "step_relevel"
[75] "step_relu" "step_rename"
[77] "step_rename_at" "step_rm"
[79] "step_rollimpute" "step_sample"
[81] "step_scale" "step_select"
[83] "step_shuffle" "step_slice"
[85] "step_spatialsign" "step_spline_b"
[87] "step_spline_convex" "step_spline_monotone"
[89] "step_spline_natural" "step_spline_nonnegative"
[91] "step_sqrt" "step_string2factor"
[93] "step_time" "step_unknown"
[95] "step_unorder" "step_window"
[97] "step_YeoJohnson" "step_zv"
Image credit: Alison Hill
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm
<fct> <fct> <dbl> <dbl> <int>
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
6 Adelie Torgersen 39.3 20.6 190
7 Adelie Torgersen 38.9 17.8 181
8 Adelie Torgersen 39.2 19.6 195
9 Adelie Torgersen 34.1 18.1 193
10 Adelie Torgersen 42 20.2 190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>
penguin_recipe <-
recipe(body_mass_g ~ species + island + bill_length_mm +
bill_depth_mm + flipper_length_mm + sex + year,
data = penguin_train) |>
step_mutate(year = as.factor(year)) |>
step_unknown(sex, new_level = "unknown") |>
step_relevel(sex, ref_level = "female") |>
update_role(island, new_role = "id variable")
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 6
id variable: 1
── Operations
• Variable mutation for: as.factor(year)
• Unknown factor level assignment for: sex
• Re-order factor level to ref_level for: sex
To specify a model:
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_mutate()
• step_unknown()
• step_relevel()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
# A tibble: 10 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2417. 665. -3.64 3.36e- 4
2 speciesChinstrap -208. 92.9 -2.24 2.58e- 2
3 speciesGentoo 985. 152. 6.48 5.02e-10
4 bill_length_mm 13.5 8.29 1.63 1.04e- 1
5 bill_depth_mm 80.9 22.1 3.66 3.10e- 4
6 flipper_length_mm 20.8 3.62 5.74 2.81e- 8
7 sexmale 351. 52.6 6.67 1.72e-10
8 sexunknown 47.6 103. 0.460 6.46e- 1
9 year2008 -24.8 47.5 -0.521 6.03e- 1
10 year2009 -61.9 46.0 -1.35 1.80e- 1
penguin_recipe <-
recipe(body_mass_g ~ species + island + bill_length_mm +
bill_depth_mm + flipper_length_mm + sex + year,
data = penguin_train) |>
step_mutate(year = as.factor(year)) |>
step_unknown(sex, new_level = "unknown") |>
step_relevel(sex, ref_level = "female") |>
update_role(island, new_role = "id variable")
penguin_recipe
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 6
id variable: 1
── Operations
• Variable mutation for: as.factor(year)
• Unknown factor level assignment for: sex
• Re-order factor level to ref_level for: sex
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_mutate()
• step_unknown()
• step_relevel()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
# A tibble: 10 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2417. 665. -3.64 3.36e- 4
2 speciesChinstrap -208. 92.9 -2.24 2.58e- 2
3 speciesGentoo 985. 152. 6.48 5.02e-10
4 bill_length_mm 13.5 8.29 1.63 1.04e- 1
5 bill_depth_mm 80.9 22.1 3.66 3.10e- 4
6 flipper_length_mm 20.8 3.62 5.74 2.81e- 8
7 sexmale 351. 52.6 6.67 1.72e-10
8 sexunknown 47.6 103. 0.460 6.46e- 1
9 year2008 -24.8 47.5 -0.521 6.03e- 1
10 year2009 -61.9 46.0 -1.35 1.80e- 1
Some model parameters are tuned from the data (some aren’t).
If the model is tuned using the data, the same data cannot be used to assess the model.
With Cross Validation, you iteratively put data in your pocket.
For example, keep 1/5 of the data in your pocket, build the model on the remaining 4/5 of the data.
Image credit: Alison Hill
Image credit: Alison Hill
Image credit: Alison Hill
Image credit: Alison Hill
Image credit: Alison Hill
Image credit: Alison Hill
Image credit: Alison Hill
Image credit: Alison Hill
Image credit: Alison Hill
Image credit: Alison Hill
Some model parameters are tuned from the data (some aren’t).
If the model is tuned using the data, the same data cannot be used to assess the model.
With Cross Validation, you iteratively put data in your pocket.
For example, keep 1/5 of the data in your pocket, build the model on the remaining 4/5 of the data.