Extensions

class: right, top, my-title, title-slide

# Extensions
### Jo Hardin
### December 07, 2021

---

# Agenda 12/07/21

1. Return clickers
2. What's next?
3. Course evaluations

---

## Parallel computing

Set up the clusters to bootstrap the mean of the penguin `mass_g` (`Chinstrap` only).

```r
data(penguins)

penguin_bs <- penguins %>%
  filter(species == "Chinstrap") 
```

```r
P <- parallel::detectCores(logical=FALSE)
P
```

```
## [1] 8
```

```r
cl <- parallel::makeCluster(P, setup_strategy = "sequential")
clusterEvalQ(cl, library("tidyverse"))
```

```
## [[1]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"     
## 
## [[4]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"     
## 
## [[5]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"     
## 
## [[6]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"     
## 
## [[7]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"     
## 
## [[8]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"
```

---

## Bootstrap in parallel

```r
doParallel::registerDoParallel(cl)
bsmean_mass <- foreach(i = 1:100, .combine = 'c') %dopar% {
  penguin_bs %>%
    mutate(mass_bs = sample(body_mass_g, replace = TRUE)) %>%
    summarize(mean(mass_bs)) %>% pull()
}
boot_results <- tibble(bsmean_mass)
stopCluster(cl)
```

```r
boot_results
```

```
## # A tibble: 100 × 1
##    bsmean_mass
##          <dbl>
##  1      3730.9
##  2      3739.0
##  3      3675.7
##  4      3712.9
##  5      3800.4
##  6      3763.2
##  7      3665.8
##  8      3744.9
##  9      3657.0
## 10      3735.7
## # … with 90 more rows
```

---

##  Graphing

```r
ggplot(boot_results, aes(x = bsmean_mass)) + 
  geom_histogram(bins = 25) + 
  ggtitle("Histogram of 100 Bootstrapped Means using foreach")
```

![](2021-12-07-extensions_files/figure-html/unnamed-chunk-7-1.png)

---

## Python in R

The main R package which allows Python use in R is called `reticulate`.

```r
library(reticulate)
use_virtualenv("r-reticulate")
reticulate::import("statsmodels")
```

```
## Module(statsmodels)
```

---

## Python code in an RMarkdown chunk

.panelset[
.panel[.panel-name[Rmd]
````
```{python}
import pandas
flights = pandas.read_csv("flights.csv")
flights = flights[flights["dest"] == "ORD"]
flights = flights[['carrier', 'dep_delay', 'arr_delay']]
flights = flights.dropna()
```
````
]
.panel[.panel-name[html]
A view of the Python chunk which is actually run:

```python
import pandas
flights = pandas.read_csv("flights.csv")
flights = flights[flights["dest"] == "ORD"]
flights = flights[['carrier', 'dep_delay', 'arr_delay']]
flights = flights.dropna()
```
]]

---

## Computations using `pandas`

.panelset[
.panel[.panel-name[Rmd]
````
```{python}
flights = pandas.read_csv("flights.csv")
flights = flights[['carrier', 'dep_delay', 'arr_delay']]
flights.groupby("carrier").mean()
```
````
]

.panel[.panel-name[html]

```python
flights = pandas.read_csv("flights.csv")
flights = flights[['carrier', 'dep_delay', 'arr_delay']]
flights.groupby("carrier").mean()
```

```
##          dep_delay  arr_delay
## carrier                      
## AA        8.586016   0.364291
## AS        5.804775  -9.930889
## DL        9.264505   1.644341
## UA       12.106073   3.558011
## US        3.782418   2.129595
```
]
]

---

## From Python chunk to R chunk

* `py$x` accesses an `x` variable created within Python from R
* `r.x` accesses an `x` variable created within R from Python

---

### SQL in R

Setting up the SQL connection to an external databse

```r
library(RMySQL)
library(DBI)
con <- dbConnect(
  MySQL(), host = "XXX", user = "XXX", 
  password = "XXX", dbname = "XXX")
```

---

## Running SQL code in R

```r
dbGetQuery(con, 
  "SELECT subject_race, AVG(subject_age) AS 'ave age' 
  FROM TNnashville 
  GROUP BY subject_race
  ORDER BY `ave age`")
```

```
##             subject_race ave age
## 1               hispanic  33.161
## 2                  other  34.497
## 3                     NA  35.162
## 4                  black  36.126
## 5                unknown  36.259
## 6 asian/pacific islander  36.874
## 7                  white  38.134
```

---

## SQL code in a SQL chunk

.panelset[
.panel[.panel-name[Rmd]
````
```{sql, connection = con}
SELECT subject_race, AVG(subject_age) AS 'ave age' 
  FROM TNnashville 
  GROUP BY subject_race
  ORDER BY `ave age`
```
````
]

.panel[.panel-name[html]

```sql
SELECT subject_race, AVG(subject_age) AS 'ave age' 
  FROM TNnashville 
  GROUP BY subject_race
  ORDER BY `ave age`
```

Table: 7 records

|subject_race           | ave age|
|:----------------------|-------:|
|hispanic               |  33.161|
|other                  |  34.497|
|NA                     |  35.162|
|black                  |  36.126|
|unknown                |  36.259|
|asian/pacific islander |  36.874|
|white                  |  38.134|

</div>
]
.panel[.panel-name[SQL to R]
````
```{sql, connection = con, output.var = "TN_age"}
SELECT subject_race, AVG(subject_age) AS 'ave age' 
  FROM TNnashville 
  GROUP BY subject_race
  ORDER BY `ave age`
```
````

```sql
SELECT subject_race, AVG(subject_age) AS 'ave age' 
  FROM TNnashville 
  GROUP BY subject_race
  ORDER BY `ave age`
```
]

.panel[.panel-name[R obj]

(From an R chunk)

```r
TN_age
```

]
]

---

## Regular expressions (strings)

> A concise mechanism for describing patterns in character strings

* regular expressions
* `stringr`

---

## Weather Data

[https://www.timeanddate.com/weather/](https://www.timeanddate.com/weather/)

---

## Weather Data

```r
library(rvest)
weather <- read_html("https://www.timeanddate.com/weather/")
w <- html_table(html_nodes(weather, "table"))[[1]]
```

(tiny bit of wrangling of columns here)

```r
w
```

```
## # A tibble: 139 × 3
##    city         time        temp  
##    <chr>        <chr>       <chr> 
##  1 Accra        Tue 2:44 pm 90 °F 
##  2 Adelaide *   Wed 1:14 am 56 °F 
##  3 Algiers      Tue 3:44 pm 67 °F 
##  4 Almaty       Tue 8:44 pm 39 °F 
##  5 Amman        Tue 4:44 pm 61 °F 
##  6 Amsterdam    Tue 3:44 pm 41 °F 
##  7 Anadyr       Wed 2:44 am -10 °F
##  8 Anchorage    Tue 5:44 am 31 °F 
##  9 Ankara       Tue 5:44 pm 59 °F 
## 10 Antananarivo Tue 5:44 pm 79 °F 
## # … with 129 more rows
```

---

## Matching patterns

```r
w %>% filter(str_detect(city, "La"))
```

```
## # A tibble: 5 × 3
##   city           time         temp 
##   <chr>          <chr>        <chr>
## 1 La Paz         Tue 10:44 am 45 °F
## 2 Lagos          Tue 3:44 pm  90 °F
## 3 Lahore         Tue 7:44 pm  64 °F
## 4 Las Vegas      Tue 6:44 am  55 °F
## 5 Salt Lake City Tue 7:44 am  35 °F
```

---

## Matching patterns

To indicate that the pattern must appear at the start of a character string use `^`.

```r
w %>% filter(str_detect(city, "^La"))
```

```
## # A tibble: 4 × 3
##   city      time         temp 
##   <chr>     <chr>        <chr>
## 1 La Paz    Tue 10:44 am 45 °F
## 2 Lagos     Tue 3:44 pm  90 °F
## 3 Lahore    Tue 7:44 pm  64 °F
## 4 Las Vegas Tue 6:44 am  55 °F
```

... or the end of a character string use `$`.

```r
w %>% filter(str_detect(city, "La$"))
```

```
## # A tibble: 0 × 3
## # … with 3 variables: city <chr>, time <chr>, temp <chr>
```

---

## Matching patterns

If case doesn't matter use `(?i)`.

```r
w %>% filter(str_detect(city, "(?i)La"))
```

```
## # A tibble: 16 × 3
##    city           time         temp 
##    <chr>          <chr>        <chr>
##  1 Adelaide *     Wed 1:14 am  56 °F
##  2 Atlanta        Tue 9:44 am  37 °F
##  3 Auckland *     Wed 3:44 am  66 °F
##  4 Casablanca *   Tue 3:44 pm  63 °F
##  5 Dallas         Tue 8:44 am  38 °F
##  6 Dar es Salaam  Tue 5:44 pm  82 °F
##  7 Guatemala City Tue 8:44 am  64 °F
##  8 Islamabad      Tue 7:44 pm  64 °F
##  9 Kuala Lumpur   Tue 10:44 pm 79 °F
## 10 La Paz         Tue 10:44 am 45 °F
## 11 Lagos          Tue 3:44 pm  90 °F
## 12 Lahore         Tue 7:44 pm  64 °F
## 13 Las Vegas      Tue 6:44 am  55 °F
## 14 Manila         Tue 10:44 pm 78 °F
## 15 Philadelphia   Tue 9:44 am  38 °F
## 16 Salt Lake City Tue 7:44 am  35 °F
```

---

## Matching patterns

Match for Wednesday (only a few places are Wednesday right now at Tuesday, 6:45am PT).

```r
w %>% filter(str_detect(time, "Wed"))
```

```
## # A tibble: 10 × 3
##    city        time         temp  
##    <chr>       <chr>        <chr> 
##  1 Adelaide *  Wed 1:14 am  56 °F 
##  2 Anadyr      Wed 2:44 am  -10 °F
##  3 Auckland *  Wed 3:44 am  66 °F 
##  4 Brisbane    Wed 12:44 am 73 °F 
##  5 Canberra *  Wed 1:44 am  61 °F 
##  6 Darwin      Wed 12:14 am 81 °F 
##  7 Kiritimati  Wed 4:44 am  79 °F 
##  8 Melbourne * Wed 1:44 am  52 °F 
##  9 Suva        Wed 2:44 am  78 °F 
## 10 Sydney *    Wed 1:44 am  68 °F
```

---

## Matching patterns

Or not Wednesday

```r
w %>% filter(str_detect(time, "Wed", negate = TRUE))
```

```
## # A tibble: 129 × 3
##    city         time         temp 
##    <chr>        <chr>        <chr>
##  1 Accra        Tue 2:44 pm  90 °F
##  2 Algiers      Tue 3:44 pm  67 °F
##  3 Almaty       Tue 8:44 pm  39 °F
##  4 Amman        Tue 4:44 pm  61 °F
##  5 Amsterdam    Tue 3:44 pm  41 °F
##  6 Anchorage    Tue 5:44 am  31 °F
##  7 Ankara       Tue 5:44 pm  59 °F
##  8 Antananarivo Tue 5:44 pm  79 °F
##  9 Asuncion *   Tue 11:44 am 86 °F
## 10 Athens       Tue 4:44 pm  60 °F
## # … with 119 more rows
```

---

## Matching patterns

```r
w %>%
  mutate(temp_num = str_sub(temp, 1, nchar(temp) -1)) %>% head(3)
```

```
## # A tibble: 3 × 4
##   city       time        temp  temp_num
##   <chr>      <chr>       <chr> <chr>   
## 1 Accra      Tue 2:44 pm 90 °F 90 °    
## 2 Adelaide * Wed 1:14 am 56 °F 56 °    
## 3 Algiers    Tue 3:44 pm 67 °F 67 °
```

```r
w %>%
  mutate(temp_num = str_sub(temp, 1, nchar(temp) -2)) %>% head(3)
```

```
## # A tibble: 3 × 4
##   city       time        temp  temp_num
##   <chr>      <chr>       <chr> <chr>   
## 1 Accra      Tue 2:44 pm 90 °F 90      
## 2 Adelaide * Wed 1:14 am 56 °F 56      
## 3 Algiers    Tue 3:44 pm 67 °F 67 
```

---

## Matching patterns

```r
w %>%
  mutate(temp_num = str_sub(temp, 1, nchar(temp) -3)) %>% head(3)
```

```
## # A tibble: 3 × 4
##   city       time        temp  temp_num
##   <chr>      <chr>       <chr> <chr>   
## 1 Accra      Tue 2:44 pm 90 °F 90      
## 2 Adelaide * Wed 1:14 am 56 °F 56      
## 3 Algiers    Tue 3:44 pm 67 °F 67
```

```r
w %>%
  mutate(temp_num = as.numeric(str_sub(temp, 1, nchar(temp) -3))) 
```

```
## # A tibble: 139 × 4
##    city         time        temp   temp_num
##    <chr>        <chr>       <chr>     <dbl>
##  1 Accra        Tue 2:44 pm 90 °F        90
##  2 Adelaide *   Wed 1:14 am 56 °F        56
##  3 Algiers      Tue 3:44 pm 67 °F        67
##  4 Almaty       Tue 8:44 pm 39 °F        39
##  5 Amman        Tue 4:44 pm 61 °F        61
##  6 Amsterdam    Tue 3:44 pm 41 °F        41
##  7 Anadyr       Wed 2:44 am -10 °F      -10
##  8 Anchorage    Tue 5:44 am 31 °F        31
##  9 Ankara       Tue 5:44 pm 59 °F        59
## 10 Antananarivo Tue 5:44 pm 79 °F        79
## # … with 129 more rows
```

---

## Matching patterns

Or by looking for the number in the string.

```r
w %>%
  mutate(temp_num = str_extract(temp, "^-?\\d+")) 
```

```
## # A tibble: 139 × 4
##    city         time        temp   temp_num
##    <chr>        <chr>       <chr>  <chr>   
##  1 Accra        Tue 2:44 pm 90 °F  90      
##  2 Adelaide *   Wed 1:14 am 56 °F  56      
##  3 Algiers      Tue 3:44 pm 67 °F  67      
##  4 Almaty       Tue 8:44 pm 39 °F  39      
##  5 Amman        Tue 4:44 pm 61 °F  61      
##  6 Amsterdam    Tue 3:44 pm 41 °F  41      
##  7 Anadyr       Wed 2:44 am -10 °F -10     
##  8 Anchorage    Tue 5:44 am 31 °F  31      
##  9 Ankara       Tue 5:44 pm 59 °F  59      
## 10 Antananarivo Tue 5:44 pm 79 °F  79      
## # … with 129 more rows
```

---

## Matching patterns

How to fix the name which has a `*`?

```r
w %>%
  mutate(new_city = str_replace(city, " \\*", ""))
```

```
## # A tibble: 139 × 4
##    city         time        temp   new_city    
##    <chr>        <chr>       <chr>  <chr>       
##  1 Accra        Tue 2:44 pm 90 °F  Accra       
##  2 Adelaide *   Wed 1:14 am 56 °F  Adelaide    
##  3 Algiers      Tue 3:44 pm 67 °F  Algiers     
##  4 Almaty       Tue 8:44 pm 39 °F  Almaty      
##  5 Amman        Tue 4:44 pm 61 °F  Amman       
##  6 Amsterdam    Tue 3:44 pm 41 °F  Amsterdam   
##  7 Anadyr       Wed 2:44 am -10 °F Anadyr      
##  8 Anchorage    Tue 5:44 am 31 °F  Anchorage   
##  9 Ankara       Tue 5:44 pm 59 °F  Ankara      
## 10 Antananarivo Tue 5:44 pm 79 °F  Antananarivo
## # … with 129 more rows
```

---

# Closing thoughts

> **Important Adage #1**:  The perfect is the enemy of the good enough. (Voltaire?)

> **Important Adage #2**:  All models are wrong, but some are useful. (G.E.P. Box 1987)

---

## Computational Statistics

Statistics + Math + Computer Science

* **Hypothesis Testing:**  Permutation tests
* **Confidence Intervals:**  Bootstrapping
* **Parameter Estimation:** The EM algorithm
* **Bayesian Analysis:** Gibbs sampler, Metropolis-Hasting algorithm
* **Polynomial regression:** Smoothing methods (e.g., loess)
* **Prediction**: Supervised learning methods

---

[Introduction to Everyday Information Architecture](https://everydayinformationarchitecture.com/) by Lisa Maria Martin

---
<img src="../images/info_architect_2.jpeg" title="Warning story about a white supremacist with data power." alt="Warning story about a white supremacist with data power." width="90%" style="display: block; margin: auto;" />

---

class: center, middle

# Fin