October 21, 2024
tidymodels
syntax for recipes
tidymodels
syntaxPut the testing data in your pocket (keep it secret from R!!)
recipe()
feature engineering is the process of transforming raw data into features (variables) that are better predictors (for the model at hand) of the response variable
step_
functionsFor more information: https://recipes.tidymodels.org/reference/index.html
[1] "step_arrange" "step_bagimpute"
[3] "step_bin2factor" "step_BoxCox"
[5] "step_bs" "step_center"
[7] "step_classdist" "step_classdist_shrunken"
[9] "step_corr" "step_count"
[11] "step_cut" "step_date"
[13] "step_depth" "step_discretize"
[15] "step_dummy" "step_dummy_extract"
[17] "step_dummy_multi_choice" "step_factor2string"
[19] "step_filter" "step_filter_missing"
[21] "step_geodist" "step_harmonic"
[23] "step_holiday" "step_hyperbolic"
[25] "step_ica" "step_impute_bag"
[27] "step_impute_knn" "step_impute_linear"
[29] "step_impute_lower" "step_impute_mean"
[31] "step_impute_median" "step_impute_mode"
[33] "step_impute_roll" "step_indicate_na"
[35] "step_integer" "step_interact"
[37] "step_intercept" "step_inverse"
[39] "step_invlogit" "step_isomap"
[41] "step_knnimpute" "step_kpca"
[43] "step_kpca_poly" "step_kpca_rbf"
[45] "step_lag" "step_lincomb"
[47] "step_log" "step_logit"
[49] "step_lowerimpute" "step_meanimpute"
[51] "step_medianimpute" "step_modeimpute"
[53] "step_mutate" "step_mutate_at"
[55] "step_naomit" "step_nnmf"
[57] "step_nnmf_sparse" "step_normalize"
[59] "step_novel" "step_ns"
[61] "step_num2factor" "step_nzv"
[63] "step_ordinalscore" "step_other"
[65] "step_pca" "step_percentile"
[67] "step_pls" "step_poly"
[69] "step_poly_bernstein" "step_profile"
[71] "step_range" "step_ratio"
[73] "step_regex" "step_relevel"
[75] "step_relu" "step_rename"
[77] "step_rename_at" "step_rm"
[79] "step_rollimpute" "step_sample"
[81] "step_scale" "step_select"
[83] "step_shuffle" "step_slice"
[85] "step_spatialsign" "step_spline_b"
[87] "step_spline_convex" "step_spline_monotone"
[89] "step_spline_natural" "step_spline_nonnegative"
[91] "step_sqrt" "step_string2factor"
[93] "step_time" "step_unknown"
[95] "step_unorder" "step_window"
[97] "step_YeoJohnson" "step_zv"
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
penguin_recipe <-
recipe(body_mass_g ~ species + island + bill_length_mm +
bill_depth_mm + flipper_length_mm + sex + year,
data = penguin_train) |>
step_mutate(year = as.factor(year)) |>
step_unknown(sex, new_level = "unknown") |>
step_relevel(sex, ref_level = "female") |>
update_role(island, new_role = "id variable")
# A tibble: 8 × 4
variable type role source
<chr> <list> <chr> <chr>
1 species <chr [3]> predictor original
2 island <chr [3]> id variable original
3 bill_length_mm <chr [2]> predictor original
4 bill_depth_mm <chr [2]> predictor original
5 flipper_length_mm <chr [2]> predictor original
6 sex <chr [3]> predictor original
7 year <chr [2]> predictor original
8 body_mass_g <chr [2]> outcome original
To specify a model:
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_mutate()
• step_unknown()
• step_relevel()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
# A tibble: 10 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2417. 665. -3.64 3.36e- 4
2 speciesChinstrap -208. 92.9 -2.24 2.58e- 2
3 speciesGentoo 985. 152. 6.48 5.02e-10
4 bill_length_mm 13.5 8.29 1.63 1.04e- 1
5 bill_depth_mm 80.9 22.1 3.66 3.10e- 4
6 flipper_length_mm 20.8 3.62 5.74 2.81e- 8
7 sexmale 351. 52.6 6.67 1.72e-10
8 sexunknown 47.6 103. 0.460 6.46e- 1
9 year2008 -24.8 47.5 -0.521 6.03e- 1
10 year2009 -61.9 46.0 -1.35 1.80e- 1
penguin_recipe <-
recipe(body_mass_g ~ species + island + bill_length_mm +
bill_depth_mm + flipper_length_mm + sex + year,
data = penguin_train) |>
step_mutate(year = as.factor(year)) |>
step_unknown(sex, new_level = "unknown") |>
step_relevel(sex, ref_level = "female") |>
update_role(island, new_role = "id variable")
summary(penguin_recipe)
# A tibble: 8 × 4
variable type role source
<chr> <list> <chr> <chr>
1 species <chr [3]> predictor original
2 island <chr [3]> id variable original
3 bill_length_mm <chr [2]> predictor original
4 bill_depth_mm <chr [2]> predictor original
5 flipper_length_mm <chr [2]> predictor original
6 sex <chr [3]> predictor original
7 year <chr [2]> predictor original
8 body_mass_g <chr [2]> outcome original
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_mutate()
• step_unknown()
• step_relevel()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
# A tibble: 10 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2417. 665. -3.64 3.36e- 4
2 speciesChinstrap -208. 92.9 -2.24 2.58e- 2
3 speciesGentoo 985. 152. 6.48 5.02e-10
4 bill_length_mm 13.5 8.29 1.63 1.04e- 1
5 bill_depth_mm 80.9 22.1 3.66 3.10e- 4
6 flipper_length_mm 20.8 3.62 5.74 2.81e- 8
7 sexmale 351. 52.6 6.67 1.72e-10
8 sexunknown 47.6 103. 0.460 6.46e- 1
9 year2008 -24.8 47.5 -0.521 6.03e- 1
10 year2009 -61.9 46.0 -1.35 1.80e- 1
Some model parameters are tuned from the data (some aren’t).
If the model is tuned using the data, the same data cannot be used to assess the model.
With Cross Validation, you iteratively put data in your pocket.
For example, keep 1/5 of the data in your pocket, build the model on the remaining 4/5 of the data.
Some model parameters are tuned from the data (some aren’t).
If the model is tuned using the data, the same data cannot be used to assess the model.
With Cross Validation, you iteratively put data in your pocket.
For example, keep 1/5 of the data in your pocket, build the model on the remaining 4/5 of the data.