class: right, top, my-title, title-slide # Extensions ### Jo Hardin ### December 07, 2021 --- # Agenda 12/07/21 1. Return clickers 2. What's next? 3. Course evaluations --- ## Parallel computing Set up the clusters to bootstrap the mean of the penguin `mass_g` (`Chinstrap` only). ```r data(penguins) penguin_bs <- penguins %>% filter(species == "Chinstrap") ``` ```r P <- parallel::detectCores(logical=FALSE) P ``` ``` ## [1] 8 ``` ```r cl <- parallel::makeCluster(P, setup_strategy = "sequential") clusterEvalQ(cl, library("tidyverse")) ``` ``` ## [[1]] ## [1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr" ## [7] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices" ## [13] "utils" "datasets" "methods" "base" ## ## [[2]] ## [1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr" ## [7] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices" ## [13] "utils" "datasets" "methods" "base" ## ## [[3]] ## [1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr" ## [7] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices" ## [13] "utils" "datasets" "methods" "base" ## ## [[4]] ## [1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr" ## [7] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices" ## [13] "utils" "datasets" "methods" "base" ## ## [[5]] ## [1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr" ## [7] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices" ## [13] "utils" "datasets" "methods" "base" ## ## [[6]] ## [1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr" ## [7] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices" ## [13] "utils" "datasets" "methods" "base" ## ## [[7]] ## [1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr" ## [7] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices" ## [13] "utils" "datasets" "methods" "base" ## ## [[8]] ## [1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr" ## [7] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices" ## [13] "utils" "datasets" "methods" "base" ``` --- ## Bootstrap in parallel ```r doParallel::registerDoParallel(cl) bsmean_mass <- foreach(i = 1:100, .combine = 'c') %dopar% { penguin_bs %>% mutate(mass_bs = sample(body_mass_g, replace = TRUE)) %>% summarize(mean(mass_bs)) %>% pull() } boot_results <- tibble(bsmean_mass) stopCluster(cl) ``` ```r boot_results ``` ``` ## # A tibble: 100 × 1 ## bsmean_mass ## <dbl> ## 1 3730.9 ## 2 3739.0 ## 3 3675.7 ## 4 3712.9 ## 5 3800.4 ## 6 3763.2 ## 7 3665.8 ## 8 3744.9 ## 9 3657.0 ## 10 3735.7 ## # … with 90 more rows ``` --- ## Graphing ```r ggplot(boot_results, aes(x = bsmean_mass)) + geom_histogram(bins = 25) + ggtitle("Histogram of 100 Bootstrapped Means using foreach") ``` ![](2021-12-07-extensions_files/figure-html/unnamed-chunk-7-1.png)<!-- --> --- ## Python in R The main R package which allows Python use in R is called `reticulate`. ```r library(reticulate) use_virtualenv("r-reticulate") reticulate::import("statsmodels") ``` ``` ## Module(statsmodels) ``` --- ## Python code in an RMarkdown chunk .panelset[ .panel[.panel-name[Rmd] ```` ```{python} import pandas flights = pandas.read_csv("flights.csv") flights = flights[flights["dest"] == "ORD"] flights = flights[['carrier', 'dep_delay', 'arr_delay']] flights = flights.dropna() ``` ```` ] .panel[.panel-name[html] A view of the Python chunk which is actually run: ```python import pandas flights = pandas.read_csv("flights.csv") flights = flights[flights["dest"] == "ORD"] flights = flights[['carrier', 'dep_delay', 'arr_delay']] flights = flights.dropna() ``` ]] --- ## Computations using `pandas` .panelset[ .panel[.panel-name[Rmd] ```` ```{python} flights = pandas.read_csv("flights.csv") flights = flights[['carrier', 'dep_delay', 'arr_delay']] flights.groupby("carrier").mean() ``` ```` ] .panel[.panel-name[html] ```python flights = pandas.read_csv("flights.csv") flights = flights[['carrier', 'dep_delay', 'arr_delay']] flights.groupby("carrier").mean() ``` ``` ## dep_delay arr_delay ## carrier ## AA 8.586016 0.364291 ## AS 5.804775 -9.930889 ## DL 9.264505 1.644341 ## UA 12.106073 3.558011 ## US 3.782418 2.129595 ``` ] ] --- ## From Python chunk to R chunk * `py$x` accesses an `x` variable created within Python from R * `r.x` accesses an `x` variable created within R from Python --- ### SQL in R Setting up the SQL connection to an external databse ```r library(RMySQL) library(DBI) con <- dbConnect( MySQL(), host = "XXX", user = "XXX", password = "XXX", dbname = "XXX") ``` --- ## Running SQL code in R ```r dbGetQuery(con, "SELECT subject_race, AVG(subject_age) AS 'ave age' FROM TNnashville GROUP BY subject_race ORDER BY `ave age`") ``` ``` ## subject_race ave age ## 1 hispanic 33.161 ## 2 other 34.497 ## 3 NA 35.162 ## 4 black 36.126 ## 5 unknown 36.259 ## 6 asian/pacific islander 36.874 ## 7 white 38.134 ``` --- ## SQL code in a SQL chunk .panelset[ .panel[.panel-name[Rmd] ```` ```{sql, connection = con} SELECT subject_race, AVG(subject_age) AS 'ave age' FROM TNnashville GROUP BY subject_race ORDER BY `ave age` ``` ```` ] .panel[.panel-name[html] ```sql SELECT subject_race, AVG(subject_age) AS 'ave age' FROM TNnashville GROUP BY subject_race ORDER BY `ave age` ``` <div class="knitsql-table"> Table: 7 records |subject_race | ave age| |:----------------------|-------:| |hispanic | 33.161| |other | 34.497| |NA | 35.162| |black | 36.126| |unknown | 36.259| |asian/pacific islander | 36.874| |white | 38.134| </div> ] .panel[.panel-name[SQL to R] ```` ```{sql, connection = con, output.var = "TN_age"} SELECT subject_race, AVG(subject_age) AS 'ave age' FROM TNnashville GROUP BY subject_race ORDER BY `ave age` ``` ```` ```sql SELECT subject_race, AVG(subject_age) AS 'ave age' FROM TNnashville GROUP BY subject_race ORDER BY `ave age` ``` ] .panel[.panel-name[R obj] (From an R chunk) ```r TN_age ``` ``` ## subject_race ave age ## 1 hispanic 33.161 ## 2 other 34.497 ## 3 NA 35.162 ## 4 black 36.126 ## 5 unknown 36.259 ## 6 asian/pacific islander 36.874 ## 7 white 38.134 ``` ] ] --- ## Regular expressions (strings) > A concise mechanism for describing patterns in character strings * regular expressions * `stringr` --- ## Weather Data <img src="../images/weather.png" title="Screenshot of weather data across the world on Dec 7, 2021." alt="Screenshot of weather data across the world on Dec 7, 2021." width="90%" style="display: block; margin: auto;" /> [https://www.timeanddate.com/weather/](https://www.timeanddate.com/weather/) --- ## Weather Data ```r library(rvest) weather <- read_html("https://www.timeanddate.com/weather/") w <- html_table(html_nodes(weather, "table"))[[1]] ``` (tiny bit of wrangling of columns here) ```r w ``` ``` ## # A tibble: 139 × 3 ## city time temp ## <chr> <chr> <chr> ## 1 Accra Tue 2:44 pm 90 °F ## 2 Adelaide * Wed 1:14 am 56 °F ## 3 Algiers Tue 3:44 pm 67 °F ## 4 Almaty Tue 8:44 pm 39 °F ## 5 Amman Tue 4:44 pm 61 °F ## 6 Amsterdam Tue 3:44 pm 41 °F ## 7 Anadyr Wed 2:44 am -10 °F ## 8 Anchorage Tue 5:44 am 31 °F ## 9 Ankara Tue 5:44 pm 59 °F ## 10 Antananarivo Tue 5:44 pm 79 °F ## # … with 129 more rows ``` --- ## Matching patterns ```r w %>% filter(str_detect(city, "La")) ``` ``` ## # A tibble: 5 × 3 ## city time temp ## <chr> <chr> <chr> ## 1 La Paz Tue 10:44 am 45 °F ## 2 Lagos Tue 3:44 pm 90 °F ## 3 Lahore Tue 7:44 pm 64 °F ## 4 Las Vegas Tue 6:44 am 55 °F ## 5 Salt Lake City Tue 7:44 am 35 °F ``` --- ## Matching patterns To indicate that the pattern must appear at the start of a character string use `^`. ```r w %>% filter(str_detect(city, "^La")) ``` ``` ## # A tibble: 4 × 3 ## city time temp ## <chr> <chr> <chr> ## 1 La Paz Tue 10:44 am 45 °F ## 2 Lagos Tue 3:44 pm 90 °F ## 3 Lahore Tue 7:44 pm 64 °F ## 4 Las Vegas Tue 6:44 am 55 °F ``` ... or the end of a character string use `$`. ```r w %>% filter(str_detect(city, "La$")) ``` ``` ## # A tibble: 0 × 3 ## # … with 3 variables: city <chr>, time <chr>, temp <chr> ``` --- ## Matching patterns If case doesn't matter use `(?i)`. ```r w %>% filter(str_detect(city, "(?i)La")) ``` ``` ## # A tibble: 16 × 3 ## city time temp ## <chr> <chr> <chr> ## 1 Adelaide * Wed 1:14 am 56 °F ## 2 Atlanta Tue 9:44 am 37 °F ## 3 Auckland * Wed 3:44 am 66 °F ## 4 Casablanca * Tue 3:44 pm 63 °F ## 5 Dallas Tue 8:44 am 38 °F ## 6 Dar es Salaam Tue 5:44 pm 82 °F ## 7 Guatemala City Tue 8:44 am 64 °F ## 8 Islamabad Tue 7:44 pm 64 °F ## 9 Kuala Lumpur Tue 10:44 pm 79 °F ## 10 La Paz Tue 10:44 am 45 °F ## 11 Lagos Tue 3:44 pm 90 °F ## 12 Lahore Tue 7:44 pm 64 °F ## 13 Las Vegas Tue 6:44 am 55 °F ## 14 Manila Tue 10:44 pm 78 °F ## 15 Philadelphia Tue 9:44 am 38 °F ## 16 Salt Lake City Tue 7:44 am 35 °F ``` --- ## Matching patterns Match for Wednesday (only a few places are Wednesday right now at Tuesday, 6:45am PT). ```r w %>% filter(str_detect(time, "Wed")) ``` ``` ## # A tibble: 10 × 3 ## city time temp ## <chr> <chr> <chr> ## 1 Adelaide * Wed 1:14 am 56 °F ## 2 Anadyr Wed 2:44 am -10 °F ## 3 Auckland * Wed 3:44 am 66 °F ## 4 Brisbane Wed 12:44 am 73 °F ## 5 Canberra * Wed 1:44 am 61 °F ## 6 Darwin Wed 12:14 am 81 °F ## 7 Kiritimati Wed 4:44 am 79 °F ## 8 Melbourne * Wed 1:44 am 52 °F ## 9 Suva Wed 2:44 am 78 °F ## 10 Sydney * Wed 1:44 am 68 °F ``` --- ## Matching patterns Or not Wednesday ```r w %>% filter(str_detect(time, "Wed", negate = TRUE)) ``` ``` ## # A tibble: 129 × 3 ## city time temp ## <chr> <chr> <chr> ## 1 Accra Tue 2:44 pm 90 °F ## 2 Algiers Tue 3:44 pm 67 °F ## 3 Almaty Tue 8:44 pm 39 °F ## 4 Amman Tue 4:44 pm 61 °F ## 5 Amsterdam Tue 3:44 pm 41 °F ## 6 Anchorage Tue 5:44 am 31 °F ## 7 Ankara Tue 5:44 pm 59 °F ## 8 Antananarivo Tue 5:44 pm 79 °F ## 9 Asuncion * Tue 11:44 am 86 °F ## 10 Athens Tue 4:44 pm 60 °F ## # … with 119 more rows ``` --- ## Matching patterns ```r w %>% mutate(temp_num = str_sub(temp, 1, nchar(temp) -1)) %>% head(3) ``` ``` ## # A tibble: 3 × 4 ## city time temp temp_num ## <chr> <chr> <chr> <chr> ## 1 Accra Tue 2:44 pm 90 °F 90 ° ## 2 Adelaide * Wed 1:14 am 56 °F 56 ° ## 3 Algiers Tue 3:44 pm 67 °F 67 ° ``` ```r w %>% mutate(temp_num = str_sub(temp, 1, nchar(temp) -2)) %>% head(3) ``` ``` ## # A tibble: 3 × 4 ## city time temp temp_num ## <chr> <chr> <chr> <chr> ## 1 Accra Tue 2:44 pm 90 °F 90 ## 2 Adelaide * Wed 1:14 am 56 °F 56 ## 3 Algiers Tue 3:44 pm 67 °F 67 ``` --- ## Matching patterns ```r w %>% mutate(temp_num = str_sub(temp, 1, nchar(temp) -3)) %>% head(3) ``` ``` ## # A tibble: 3 × 4 ## city time temp temp_num ## <chr> <chr> <chr> <chr> ## 1 Accra Tue 2:44 pm 90 °F 90 ## 2 Adelaide * Wed 1:14 am 56 °F 56 ## 3 Algiers Tue 3:44 pm 67 °F 67 ``` ```r w %>% mutate(temp_num = as.numeric(str_sub(temp, 1, nchar(temp) -3))) ``` ``` ## # A tibble: 139 × 4 ## city time temp temp_num ## <chr> <chr> <chr> <dbl> ## 1 Accra Tue 2:44 pm 90 °F 90 ## 2 Adelaide * Wed 1:14 am 56 °F 56 ## 3 Algiers Tue 3:44 pm 67 °F 67 ## 4 Almaty Tue 8:44 pm 39 °F 39 ## 5 Amman Tue 4:44 pm 61 °F 61 ## 6 Amsterdam Tue 3:44 pm 41 °F 41 ## 7 Anadyr Wed 2:44 am -10 °F -10 ## 8 Anchorage Tue 5:44 am 31 °F 31 ## 9 Ankara Tue 5:44 pm 59 °F 59 ## 10 Antananarivo Tue 5:44 pm 79 °F 79 ## # … with 129 more rows ``` --- ## Matching patterns Or by looking for the number in the string. ```r w %>% mutate(temp_num = str_extract(temp, "^-?\\d+")) ``` ``` ## # A tibble: 139 × 4 ## city time temp temp_num ## <chr> <chr> <chr> <chr> ## 1 Accra Tue 2:44 pm 90 °F 90 ## 2 Adelaide * Wed 1:14 am 56 °F 56 ## 3 Algiers Tue 3:44 pm 67 °F 67 ## 4 Almaty Tue 8:44 pm 39 °F 39 ## 5 Amman Tue 4:44 pm 61 °F 61 ## 6 Amsterdam Tue 3:44 pm 41 °F 41 ## 7 Anadyr Wed 2:44 am -10 °F -10 ## 8 Anchorage Tue 5:44 am 31 °F 31 ## 9 Ankara Tue 5:44 pm 59 °F 59 ## 10 Antananarivo Tue 5:44 pm 79 °F 79 ## # … with 129 more rows ``` --- ## Matching patterns How to fix the name which has a `*`? ```r w %>% mutate(new_city = str_replace(city, " \\*", "")) ``` ``` ## # A tibble: 139 × 4 ## city time temp new_city ## <chr> <chr> <chr> <chr> ## 1 Accra Tue 2:44 pm 90 °F Accra ## 2 Adelaide * Wed 1:14 am 56 °F Adelaide ## 3 Algiers Tue 3:44 pm 67 °F Algiers ## 4 Almaty Tue 8:44 pm 39 °F Almaty ## 5 Amman Tue 4:44 pm 61 °F Amman ## 6 Amsterdam Tue 3:44 pm 41 °F Amsterdam ## 7 Anadyr Wed 2:44 am -10 °F Anadyr ## 8 Anchorage Tue 5:44 am 31 °F Anchorage ## 9 Ankara Tue 5:44 pm 59 °F Ankara ## 10 Antananarivo Tue 5:44 pm 79 °F Antananarivo ## # … with 129 more rows ``` --- # Closing thoughts > **Important Adage #1**: The perfect is the enemy of the good enough. (Voltaire?) > **Important Adage #2**: All models are wrong, but some are useful. (G.E.P. Box 1987) --- ## Computational Statistics Statistics + Math + Computer Science * **Hypothesis Testing:** Permutation tests * **Confidence Intervals:** Bootstrapping * **Parameter Estimation:** The EM algorithm * **Bayesian Analysis:** Gibbs sampler, Metropolis-Hasting algorithm * **Polynomial regression:** Smoothing methods (e.g., loess) * **Prediction**: Supervised learning methods --- <img src="../images/info_architect_1.jpeg" title="Warning story about a white supremacist with data power." alt="Warning story about a white supremacist with data power." width="90%" style="display: block; margin: auto;" /> [Introduction to Everyday Information Architecture](https://everydayinformationarchitecture.com/) by Lisa Maria Martin --- <img src="../images/info_architect_2.jpeg" title="Warning story about a white supremacist with data power." alt="Warning story about a white supremacist with data power." width="90%" style="display: block; margin: auto;" /> --- class: center, middle # Fin