Using Recipes

class: right, top, my-title, title-slide

# Using Recipes
### Jo Hardin
### October 21, 2021

---

# Agenda 10/21/21

1. What needs to be done to the data?
2. `tidymodels` syntax for recipes  
   a. recipe / feature engineering  
   b. model  
   c. workflow  
   d. fit  
   e. validate  
3. Example

---

## Modeling

```r
library(tidymodels)
```

---

## Motivation

<img src="../images/garbage.png" width="2371" />
---

## `tidymodels` syntax

1. partition the data
2. build a recipe
3. select a model
4. create a workflow
5. fit the model  
6. (validate the model)

---

## partition the data

Put the testing data in your pocket (keep it secret from R!!)

<div class="figure">
<img src="../images/testtrain.png" alt="Image credit: Julia Silge" width="2843" />
<p class="caption">Image credit: Julia Silge</p>
</div>
---

## partition the data

```r
library(tidymodels)
library(palmerpenguins)

set.seed(47)
penguin_split <- initial_split(penguins)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)
```

---
# build a recipe

1. Start the `recipe()`
2. Define the **variables** involved
3. Describe preprocessing **step-by-step**

---

# feature engineering

> feature engineering is the process of transforming raw data into features (variables) that are better predictors (for the model at hand) of the

* create new variables (e.g., combine levels -> from state to region)
* transform variable (e.g., log, polar coordinates)
* continuous variables -> discrete (e.g., binning)
* numerical categorical data -> factors / character strings (one hot encoding)
* time -> discretized time
* missing values -> imputed
* NA -> level
* continuous variables -> center & scale ("normalize")

---

# `step_` functions

For more information: https://recipes.tidymodels.org/reference/index.html

```r
apropos("^step_")
```

```
##  [1] "step_arrange"       "step_bagimpute"     "step_bin2factor"   
##  [4] "step_BoxCox"        "step_bs"            "step_center"       
##  [7] "step_classdist"     "step_corr"          "step_count"        
## [10] "step_cut"           "step_date"          "step_depth"        
## [13] "step_discretize"    "step_downsample"    "step_dummy"        
## [16] "step_factor2string" "step_filter"        "step_geodist"      
## [19] "step_holiday"       "step_hyperbolic"    "step_ica"          
## [22] "step_impute_bag"    "step_impute_knn"    "step_impute_linear"
## [25] "step_impute_lower"  "step_impute_mean"   "step_impute_median"
## [28] "step_impute_mode"   "step_impute_roll"   "step_indicate_na"  
## [31] "step_integer"       "step_interact"      "step_intercept"    
## [34] "step_inverse"       "step_invlogit"      "step_isomap"       
## [37] "step_knnimpute"     "step_kpca"          "step_kpca_poly"    
## [40] "step_kpca_rbf"      "step_lag"           "step_lincomb"      
## [43] "step_log"           "step_logit"         "step_lowerimpute"  
## [46] "step_meanimpute"    "step_medianimpute"  "step_modeimpute"   
## [49] "step_mutate"        "step_mutate_at"     "step_naomit"       
## [52] "step_nnmf"          "step_normalize"     "step_novel"        
## [55] "step_ns"            "step_num2factor"    "step_nzv"          
## [58] "step_ordinalscore"  "step_other"         "step_pca"          
## [61] "step_pls"           "step_poly"          "step_profile"      
## [64] "step_range"         "step_ratio"         "step_regex"        
## [67] "step_relevel"       "step_relu"          "step_rename"       
## [70] "step_rename_at"     "step_rm"            "step_rollimpute"   
## [73] "step_sample"        "step_scale"         "step_select"       
## [76] "step_shuffle"       "step_slice"         "step_spatialsign"  
## [79] "step_sqrt"          "step_string2factor" "step_unknown"      
## [82] "step_unorder"       "step_upsample"      "step_window"       
## [85] "step_YeoJohnson"    "step_zv"
```

---

## the data: penguins

<div class="figure" style="text-align: right">
<img src="../images/penguins.png" alt="Image credit: Alison Hill" width="30%" />
<p class="caption">Image credit: Alison Hill</p>
</div>

```r
penguins
```

```
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
```

---

## recipe

```r
penguin_recipe <-
  recipe(body_mass_g ~ species + island + bill_length_mm + 
           bill_depth_mm + flipper_length_mm + sex + year,
         data = penguin_train) %>%
  step_mutate(year = as.factor(year)) %>%
  step_unknown(sex, new_level = "unknown") %>%
  step_relevel(sex, ref_level = "female") %>%
  update_role(island, new_role = "id variable")
```

```r
summary(penguin_recipe)
```

```
## # A tibble: 8 × 4
##   variable          type    role        source  
##   <chr>             <chr>   <chr>       <chr>   
## 1 species           nominal predictor   original
## 2 island            nominal id variable original
## 3 bill_length_mm    numeric predictor   original
## 4 bill_depth_mm     numeric predictor   original
## 5 flipper_length_mm numeric predictor   original
## 6 sex               nominal predictor   original
## 7 year              numeric predictor   original
## 8 body_mass_g       numeric outcome     original
```

---

## model

To specify a model:

1. pick a **model**
2. set the **mode** (regression vs classification, if needed)
3. set the **engine**

```r
penguin_lm <- linear_reg() %>%
  set_engine("lm")
```

```r
penguin_lm
```

```
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
```

---
### engines

```r
show_engines("nearest_neighbor")
```

```
## # A tibble: 2 × 2
##   engine mode          
##   <chr>  <chr>         
## 1 kknn   classification
## 2 kknn   regression
```

```r
show_engines("decision_tree")
```

```
## # A tibble: 5 × 2
##   engine mode          
##   <chr>  <chr>         
## 1 rpart  classification
## 2 rpart  regression    
## 3 C5.0   classification
## 4 spark  classification
## 5 spark  regression
```

---
### engines

```r
show_engines("rand_forest")
```

```
## # A tibble: 6 × 2
##   engine       mode          
##   <chr>        <chr>         
## 1 ranger       classification
## 2 ranger       regression    
## 3 randomForest classification
## 4 randomForest regression    
## 5 spark        classification
## 6 spark        regression
```

---
### engines

```r
show_engines("svm_poly")
```

```
## # A tibble: 2 × 2
##   engine  mode          
##   <chr>   <chr>         
## 1 kernlab classification
## 2 kernlab regression
```

```r
show_engines("svm_rbf")
```

```
## # A tibble: 4 × 2
##   engine    mode          
##   <chr>     <chr>         
## 1 kernlab   classification
## 2 kernlab   regression    
## 3 liquidSVM classification
## 4 liquidSVM regression
```

---
### engines

```r
show_engines("linear_reg")
```

```
## # A tibble: 5 × 2
##   engine mode      
##   <chr>  <chr>     
## 1 lm     regression
## 2 glmnet regression
## 3 stan   regression
## 4 spark  regression
## 5 keras  regression
```

---

## workflow

```r
penguin_wflow <- workflow() %>%
  add_model(penguin_lm) %>%
  add_recipe(penguin_recipe)
```

```r
penguin_wflow
```

```
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_mutate()
## • step_unknown()
## • step_relevel()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
```

---

## fit

```r
penguin_fit <- penguin_wflow %>%
  fit(data = penguin_train)
```

```r
penguin_fit %>% tidy()
```

```
## # A tibble: 10 × 5
##    term               estimate std.error statistic    p.value
##    <chr>                 <dbl>     <dbl>     <dbl>      <dbl>
##  1 (Intercept)       -2417.5    664.73    -3.6368  3.3624e- 4
##  2 speciesChinstrap   -208.39    92.899   -2.2432  2.5776e- 2
##  3 speciesGentoo       984.90   152.04     6.4781  5.0203e-10
##  4 bill_length_mm       13.531    8.2871   1.6328  1.0378e- 1
##  5 bill_depth_mm        80.899   22.112    3.6586  3.1028e- 4
##  6 flipper_length_mm    20.771    3.6200   5.7378  2.8080e- 8
##  7 sexmale             350.57    52.597    6.6651  1.7239e-10
##  8 sexunknown           47.576  103.32     0.46049 6.4557e- 1
##  9 year2008            -24.774   47.511   -0.52145 6.0252e- 1
## 10 year2009            -61.895   46.008   -1.3453  1.7976e- 1
```

---
## entire process

.panelset[

.panel[.panel-name[recipe]

summary(penguin_recipe)
```

.panel[.panel-name[model]

```r
penguin_lm <- linear_reg() %>%
  set_engine("lm")

penguin_lm
```

```
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
```
]

.panel[.panel-name[workflow]

```r
penguin_wflow <- workflow() %>%
  add_model(penguin_lm) %>%
  add_recipe(penguin_recipe)

penguin_wflow
```

.panel[.panel-name[fit]

```r
penguin_fit <- penguin_wflow %>%
  fit(data = penguin_train)

penguin_fit %>% tidy()
```

]

---

## model parameters

* Some model parameters are tuned from the data (some aren't).
  - linear model coefficients are optimized (not tuned)
  - k-nn value of "k" is tuned

* If the model is tuned using the data, the same data **cannot** be used to assess the model.

* With Cross Validation, you iteratively put data in your pocket.

* For example, keep 1/5 of the data in your pocket, build the model on the remaining 4/5 of the data.