class: right, top, my-title, title-slide # Working with Data in R ### Jo Hardin ### September 14 & 16, 2021 --- <style type="text/css"> /* custom.css */ .remark-slide-content { font-size: 16px; } </style> Much of this material can be found at: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html --- # Agenda 9/14/21 1. Tidy data 2. Data verbs --- # The data ## What does a data set *look like*? - **Observations** down the rows - **Variables** across the columns - Flat file versus relational database. --- ## Active Duty Military The Active Duty data are not tidy! What are the cases? How are the data not tidy? What might the data look like in tidy form? Suppose that the case was "an individual in the armed forces." What variables would you use to capture the information in the following table? https://docs.google.com/spreadsheets/d/1Ow6Cm4z-Z1Yybk3i352msulYCEDOUaOghmo9ALajyHo/edit#gid=1811988794 <img src="../images/activedutyTidy.png" width="100%" style="display: block; margin: auto;" /> --- ## Tidy packages: the tidyverse <div class="figure"> <img src="../images/tidyverse.png" alt="Image of hex stickers for the eight core tidyverse packages including ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats." width="4107" /> <p class="caption">image credit: https://www.tidyverse.org/.</p> </div> --- ## Reading in data from a file Hosted online: ```r movies <- read_csv("http://pages.pomona.edu/~jsh04747/courses/math58/Math58Data/movies2.csv") ``` Hosted locally: ```r movies <- read_csv("movies2.csv") ``` Things to note: - The assign arrow is used to create **objects** in R, which are stored in your environment. - Object names don't have to correspond to file names. - Be sure R knows where to look for the file! --- ## Viewing data - the viewer / Environment - `View()` can be used in RStudio to bring up an excel-style spreadsheet. Only for viewing, not editing! - The dimensions of the data can be found in the environment pane. - The names of the variables can be seen at the top of the viewer. - `View()` has a capital letter `V` - `View()` should not be used in a Markdown or Sweave document --- ## Viewing data - inside .Rmd / the console - `head()` can be used to print the first several lines of the dataset to the console. - `dim()` can be used to find the dimensions (rows then columns). - `names()` can be used to find the names of the variables. --- ## Practice Running in to problems? Ask your neighbor or try google! 1. What are the dimensions of the data set? 2. What appears to be the unit of observation? 3. What are the variables? ```r dim(movies) ``` ``` ## [1] 134 5 ``` ```r head(movies,3) ``` ``` ## # A tibble: 3 × 5 ## ...1 score2 rating2 genre2 `box office2` ## <chr> <dbl> <chr> <chr> <dbl> ## 1 2 Fast 2 Furious 48.9 PG-13 action 127. ## 2 28 Days Later 78.2 R horror 45.1 ## 3 A Guy Thing 39.5 PG-13 rom comedy 15.5 ``` ```r names(movies) ``` ``` ## [1] "...1" "score2" "rating2" "genre2" "box office2" ``` --- ## Reading in data from a package For now, we'll work with all flights out of the three NYC airports in 2013. 1. Download and install the package from CRAN (done in the Console, only once). ```r install.packages("nycflights13") ``` 2. Load the package (in the .Rmd file, need it for the .Rmd file to compile appropriately). ```r library(nycflights13) ``` 3. Make the data set visible. ```r data(flights) ``` 4. Get help. ```r ?flights ``` --- # Slice and dice with dplyr (a package within the tidyverse) ## **dplyr** > Whenever you're learning a new tool, for a long time you're going to suck ... > but the good news is that is typical, that's something that happens to everyone, > and it's only temporary. -Hadley Wickham --- ## Why dplyr? Data sets are often of high *volume* (lots of rows) and high *variety* (lots of columns). This is overwhelming to visualize and analyze, so we find ourselves chopping the data set up into more manageable and meaningful chunks. We also often need to perform operations to organize and clean our data. This is all possible in base R, but with `dplyr`, it is simple, readible, and fast. --- ## Some Basic Verbs - `filter()` - `arrange()` - `select()` - `distinct()` - `mutate()` - `summarize()` - `sample_n()` --- ## `filter()` Allows you to select a subset of the **rows** of a data frame. The first argument is the name of the data frame, the following arguments are the filters that you'd like to apply For all flights on January 1st: ```r filter(flights, month == 1, day == 1) ``` ``` ## # A tibble: 842 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # … with 832 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>, ## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ## Constructing filters Filters are constructed of **logical operators**: `<`, `>`, `<=`, `>=`, `==`, `!=` (and some others). Adding them one by one to `filter()` is akin to saying "this AND that". To say "this OR that OR both", use |. ```r filter(flights, month == 1 | month == 2) ``` ``` ## # A tibble: 51,955 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # … with 51,945 more rows, and 11 more variables: arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ## Practice Construct filters to isolate: 1. Flights that left on St. Patrick's Day. 2. Flights that were destined for Chicago's primary airport. 3. Flights that were destined for Chicago's primary airport and were operated by United Airlines. 4. Flights with flight times more than 2000 miles or that were in the air more than 5 hours. --- 1. Flights that left on St. Patrick's Day. 2. Flights that were destined for Chicago's primary airport. 3. Flights that were destined for Chicago's primary airport and were operated by United Airlines. 4. Flights with flight times more than 2000 miles or that were in the air more than 5 hours. ```r filter(flights, month == 3, day == 17) filter(flights, dest == "ORD") filter(flights, dest == "ORD", carrier == "UA") filter(flights, distance > 2000 | air_time > 5*60) ``` --- ## `arrange()` `arrange()` reorders the rows: It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns: ```r arrange(flights, year, month, day) ``` Use `desc()` to sort in descending order. ```r arrange(flights, desc(arr_delay)) ``` --- ## `select()` Often you work with large datasets with many columns where only a few are actually of interest to you. `select()` allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions: ```r select(flights, year, month, day) ``` You can exclude columns using `-` and specify a range using `:`. ```r select(flights, -(year:day)) ``` --- ## `distinct()` A common use of `select()` is to find out which values a set of variables takes. This is particularly useful in conjunction with the `distinct()` verb which only returns the unique values in a table. What do the following data correspond to? ```r distinct(select(flights, origin, dest)) ``` ``` ## # A tibble: 224 × 2 ## origin dest ## <chr> <chr> ## 1 EWR IAH ## 2 LGA IAH ## 3 JFK MIA ## 4 JFK BQN ## 5 LGA ATL ## 6 EWR ORD ## 7 EWR FLL ## 8 LGA IAD ## 9 JFK MCO ## 10 LGA ORD ## # … with 214 more rows ``` --- ## `mutate()` As well as selecting from the set of existing columns, it's often useful to add new columns that are functions of existing columns. This is the job of `mutate()`: ```r select(mutate(flights, gain = dep_delay - arr_delay), flight, dep_delay, arr_delay, gain) ``` ``` ## # A tibble: 336,776 × 4 ## flight dep_delay arr_delay gain ## <int> <dbl> <dbl> <dbl> ## 1 1545 2 11 -9 ## 2 1714 4 20 -16 ## 3 1141 2 33 -31 ## 4 725 -1 -18 17 ## 5 461 -6 -25 19 ## 6 1696 -4 12 -16 ## 7 507 -5 19 -24 ## 8 5708 -3 -14 11 ## 9 79 -3 -8 5 ## 10 301 -2 8 -10 ## # … with 336,766 more rows ``` --- ## `summarize()` and `sample_n()` `summarize()` collapses a data frame to a single row. It's not very useful yet. `sample_n()` provides you with a random sample of the rows. ```r summarize(flights, delay = mean(dep_delay, na.rm = TRUE)) ``` ``` ## # A tibble: 1 × 1 ## delay ## <dbl> ## 1 12.6 ``` ```r sample_n(flights, 10) ``` ``` ## # A tibble: 10 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 2 5 1749 1753 -4 2057 2105 ## 2 2013 8 27 1004 955 9 1246 1310 ## 3 2013 7 9 1149 1200 -11 1308 1333 ## 4 2013 5 3 NA 729 NA NA 1020 ## 5 2013 2 8 NA 2035 NA NA 2142 ## 6 2013 8 7 642 645 -3 913 934 ## 7 2013 10 27 759 800 -1 1138 1155 ## 8 2013 2 25 558 600 -2 655 708 ## 9 2013 1 4 754 755 -1 956 1030 ## 10 2013 3 20 1155 1135 20 1516 1449 ## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ## Practice **Mutate** the data to create a new column that contains the average speed traveled by the plane for each flight. **Select** the new variable and save it, along with tailnum, as a new data frame object. --- ## Practice **Mutate** the data to create a new column that contains the average speed traveled by the plane for each flight. **Select** the new variable and save it, along with tailnum, as a new data frame object. ```r flights2 <- mutate(flights, speed = distance/(air_time/60)) speed_data <- select(flights2, tailnum, speed) ``` --- ## `group_by()` `summarize()` and `sample_n()` are even more powerful when combined with the idea of "group by", repeating the operation separately on groups of observations within the dataset. The `group_by()` function describes how to break a dataset down into groups of rows. --- ## `group_by()` Find the fastest airplanes in the bunch, measured as the average speed per airplane. ```r by_tailnum <- group_by(speed_data, tailnum) avg_speed <- summarize(by_tailnum, count = n(), avg_speed = mean(speed, na.rm = TRUE)) arrange(avg_speed, desc(avg_speed)) ``` ``` ## # A tibble: 4,044 × 3 ## tailnum count avg_speed ## <chr> <int> <dbl> ## 1 N228UA 1 501. ## 2 N315AS 1 499. ## 3 N654UA 1 499. ## 4 N819AW 1 490. ## 5 N382HA 26 486. ## 6 N388HA 36 484. ## 7 N391HA 21 484. ## 8 N777UA 1 483. ## 9 N385HA 28 483. ## 10 N392HA 13 482. ## # … with 4,034 more rows ``` --- ## Chaining Instead of applying each verb step-by-step, we can chain them into a single data pipeline, connected with the `%>%` operator. You start the pipeline with a data frame and then pass it to each function in turn. The pipe syntax (`%>%`) takes a data frame and sends it to the argument of a function. The mapping goes to the first available argument in the function. For example: `x %>% f(y)` is the same as `f(x, y)` ` y %>% f(x, ., z)` is the same as `f(x,y,z)` --- ## Mornings ``` step1 <- dress(me, what = sports) step2 <- exercise(step1, how = running) step3 <- eat(step2, choice = cereal) step4 <- dress(step3, what = school) step5 <- commute(step4, transportation = bike) ``` --- ## Mornings ``` commute(dress(eat(exercise(dress(me, what = sports), how = running), choice = cereal), what = school), transportation = bike) ``` --- ## Morning (better??) ``` commute( dress( eat( exercise( dress(me, what = sports), how = running), choice = cereal), what = school), transportation = bike) ``` --- ## Mornings ``` me %>% dress(what = sports) %>% exercise(how = running) %>% eat(choice = cereal) %>% dress(what = school) %>% commute(transportation = bike) ``` --- ## Mornings ``` me %>% dress(what = sports) %>% exercise(how = running) %>% eat(choice = cereal) %>% dress(what = school) %>% commute(transportation = bike) ``` The pipe syntax (`%>%`) takes a data frame and sends it to the argument of a function. The mapping goes to the first available argument in the function. For example: `x %>% f(y)` is the same as `f(x, y)` ` y %>% f(x, ., z)` is the same as `f(x,y,z)` --- #### Little Bunny Foo Foo From Hadley Wickham, how to think about tidy data. > Little bunny Foo Foo > Went hopping through the forest > Scooping up the field mice > And bopping them on the head --- #### Little Bunny Foo Foo The nursery rhyme could be created by a series of steps where the output from each step is saved as an object along the way. ``` foo_foo <- little_bunny() foo_foo_1 <- hop(foo_foo, through = forest) foo_foo_2 <- scoop(foo_foo_2, up = field_mice) foo_foo_3 <- bop(foo_foo_2, on = head) ``` --- #### Little Bunny Foo Foo Another approach is to concatenate the functions so that there is only one output. ``` bop( scoop( hop(foo_foo, through = forest), up = field_mice), on = head) ``` --- #### Little Bunny Foo Foo Or even worse, as one line: ``` bop(scoop(hop(foo_foo, through = forest), up = field_mice), on = head))) ``` --- #### Little Bunny Foo Foo Instead, the code can be written using the pipe in the **order** in which the function is evaluated: ``` foo_foo %>% hop(through = forest) %>% scoop(up = field_mice) %>% bop(on = head) ``` --- #### Little Bunny Foo Foo Instead, the code can be written using the pipe in the **order** in which the function is evaluated: ``` foo_foo %>% hop(through = forest) %>% scoop(up = field_mice) %>% bop(on = head) ``` The pipe syntax (`%>%`) takes a data frame and sends it to the argument of a function. The mapping goes to the first available argument in the function. For example: `x %>% f(y)` is the same as `f(x, y)` ` y %>% f(x, ., z)` is the same as `f(x,y,z)` --- ```r flights2 %>% select(tailnum, speed) %>% group_by(tailnum) %>% summarize(number = n(), avg_speed = mean(speed, na.rm = TRUE)) %>% arrange(desc(avg_speed)) ``` ``` ## # A tibble: 4,044 × 3 ## tailnum number avg_speed ## <chr> <int> <dbl> ## 1 N228UA 1 501. ## 2 N315AS 1 499. ## 3 N654UA 1 499. ## 4 N819AW 1 490. ## 5 N382HA 26 486. ## 6 N388HA 36 484. ## 7 N391HA 21 484. ## 8 N777UA 1 483. ## 9 N385HA 28 483. ## 10 N392HA 13 482. ## # … with 4,034 more rows ``` --- ## Practice Form a chain that creates a data frame containing only carrier and the mean departure delay time. Which carriers have the highest and lowest mean delays? --- ## Practice Form a chain that creates a data frame containing only carrier and the mean departure delay time. Which carriers have the highest and lowest mean delays? ```r flights %>% group_by(carrier) %>% summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) %>% arrange(desc(avg_delay)) ``` ``` ## # A tibble: 16 × 2 ## carrier avg_delay ## <chr> <dbl> ## 1 F9 20.2 ## 2 EV 20.0 ## 3 YV 19.0 ## 4 FL 18.7 ## 5 WN 17.7 ## 6 9E 16.7 ## 7 B6 13.0 ## 8 VX 12.9 ## 9 OO 12.6 ## 10 UA 12.1 ## 11 MQ 10.6 ## 12 DL 9.26 ## 13 AA 8.59 ## 14 AS 5.80 ## 15 HA 4.90 ## 16 US 3.78 ``` --- ## Practice again Say you're curious about the relationship between the number of flights each plane made in 2013, the mean distance that each of those planes flew, and the mean arrival delay. You also want to exclude the edge cases from your analysis, so focus on the planes that have logged more than 20 flights and flown an average distance of less than 2000 miles. Please form the chain that creates this dataset. --- ## Practice again ```r delay_data <- flights %>% group_by(tailnum) %>% summarize(number = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) %>% filter(number > 20, dist < 2000) ``` Say you're curious about the relationship between the number of flights each plane made in 2013, the mean distance that each of those planes flew, and the mean arrival delay. You also want to exclude the edge cases from your analysis, so focus on the planes that have logged more than 20 flights and flown an average distance of less than 2000 miles. Please form the chain that creates this dataset. --- ## Visualizing the data .pull-left[ ![](2021-09-14-wrangling_files/figure-html/unnamed-chunk-25-1.png)<!-- --> ] .pull-right[ ```r delay_data %>% ggplot(aes(dist, delay)) + geom_point(aes(size = number), alpha = 1/2) + geom_smooth() + scale_size_area() ``` When `scale_size_area` is used, the default behavior is to scale the area of points to be proportional to the value. ] --- # Agenda 9/16/21 1. Relational data (`_join`) 2. `pivot`ing 3. `map`ping 4. **lubridate** --- ## Relational data (multiple data frames) <img src="../images/dplyr-joins.png" width="70%" style="display: block; margin: auto;" /> See the [RStudio cheatsheets](https://www.rstudio.com/resources/cheatsheets/) --- ## Joining two (or more) dataframes: * <mark>`left_join`</mark> returns all rows from the left table, and any rows with matching keys from the right table. * <mark>`inner_join`</mark> returns only the rows in which the left table have matching keys in the right table (i.e., matching rows in both sets). * <mark>`full_join`</mark> returns all rows from both tables, join records from the left which have matching keys in the right table. Good practice: always specify the `by` argument when joining data frames. --- ## Women in Science 10 women in science who changed the world (source: Discover Magazine) <table> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> profession </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Ada Lovelace </td> <td style="text-align:left;"> Mathematician </td> </tr> <tr> <td style="text-align:left;"> Marie Curie </td> <td style="text-align:left;"> Physicist and Chemist </td> </tr> <tr> <td style="text-align:left;"> Janaki Ammal </td> <td style="text-align:left;"> Botanist </td> </tr> <tr> <td style="text-align:left;"> Chien-Shiung Wu </td> <td style="text-align:left;"> Physicist </td> </tr> <tr> <td style="text-align:left;"> Katherine Johnson </td> <td style="text-align:left;"> Mathematician </td> </tr> <tr> <td style="text-align:left;"> Rosalind Franklin </td> <td style="text-align:left;"> Chemist </td> </tr> <tr> <td style="text-align:left;"> Vera Rubin </td> <td style="text-align:left;"> Astronomer </td> </tr> <tr> <td style="text-align:left;"> Gladys West </td> <td style="text-align:left;"> Mathematician </td> </tr> <tr> <td style="text-align:left;"> Flossie Wong-Staal </td> <td style="text-align:left;"> Virologist and Molecular Biologist </td> </tr> <tr> <td style="text-align:left;"> Jennifer Doudna </td> <td style="text-align:left;"> Biochemist </td> </tr> </tbody> </table> .footnote[Example and gifs from [DataScience in a Box](https://datasciencebox.org/)] --- ## Inputs Information on women scientists .panelset[ .panel[.panel-name[professions] ```r professions ``` ``` ## # A tibble: 10 × 2 ## name profession ## <chr> <chr> ## 1 Ada Lovelace Mathematician ## 2 Marie Curie Physicist and Chemist ## 3 Janaki Ammal Botanist ## 4 Chien-Shiung Wu Physicist ## 5 Katherine Johnson Mathematician ## 6 Rosalind Franklin Chemist ## 7 Vera Rubin Astronomer ## 8 Gladys West Mathematician ## 9 Flossie Wong-Staal Virologist and Molecular Biologist ## 10 Jennifer Doudna Biochemist ``` ] .panel[.panel-name[dates] ```r dates ``` ``` ## # A tibble: 8 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Janaki Ammal 1897 1984 ## 2 Chien-Shiung Wu 1912 1997 ## 3 Katherine Johnson 1918 2020 ## 4 Rosalind Franklin 1920 1958 ## 5 Vera Rubin 1928 2016 ## 6 Gladys West 1930 NA ## 7 Flossie Wong-Staal 1947 NA ## 8 Jennifer Doudna 1964 NA ``` ] .panel[.panel-name[works] ```r works ``` ``` ## # A tibble: 9 × 2 ## name known_for ## <chr> <chr> ## 1 Ada Lovelace first computer algorithm ## 2 Marie Curie theory of radioactivity, discovery of elements polonium a… ## 3 Janaki Ammal hybrid species, biodiversity protection ## 4 Chien-Shiung Wu confim and refine theory of radioactive beta decy, Wu expe… ## 5 Katherine Johnson calculations of orbital mechanics critical to sending the … ## 6 Vera Rubin existence of dark matter ## 7 Gladys West mathematical modeling of the shape of the Earth which serv… ## 8 Flossie Wong-Staal first scientist to clone HIV and create a map of its genes… ## 9 Jennifer Doudna one of the primary developers of CRISPR, a ground-breaking… ``` ] ] --- ## Desired output We'd like to put together the data to look like: ``` ## # A tibble: 10 × 5 ## name profession birth_year death_year known_for ## <chr> <chr> <dbl> <dbl> <chr> ## 1 Ada Lovelace Mathematician NA NA first computer algo… ## 2 Marie Curie Physicist and … NA NA theory of radioacti… ## 3 Janaki Ammal Botanist 1897 1984 hybrid species, bio… ## 4 Chien-Shiung Wu Physicist 1912 1997 confim and refine t… ## 5 Katherine Johnson Mathematician 1918 2020 calculations of orb… ## 6 Rosalind Franklin Chemist 1920 1958 <NA> ## 7 Vera Rubin Astronomer 1928 2016 existence of dark m… ## 8 Gladys West Mathematician 1930 NA mathematical modeli… ## 9 Flossie Wong-Staal Virologist and… 1947 NA first scientist to … ## 10 Jennifer Doudna Biochemist 1964 NA one of the primary … ``` --- ## Inputs, reminder .pull-left[ ```r nrow(professions) ``` ``` ## [1] 10 ``` ```r nrow(dates) ``` ``` ## [1] 8 ``` ```r nrow(works) ``` ``` ## [1] 9 ``` ] .pull-right[ ```r names(professions) ``` ``` ## [1] "name" "profession" ``` ```r names(dates) ``` ``` ## [1] "name" "birth_year" "death_year" ``` ```r names(works) ``` ``` ## [1] "name" "known_for" ``` ] --- ## Setup For the next few slides... .pull-left[ ```r x ``` ``` ## # A tibble: 3 × 2 ## id value_x ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ## 3 3 x3 ``` ] .pull-right[ ```r y ``` ``` ## # A tibble: 3 × 2 ## id value_y ## <dbl> <chr> ## 1 1 y1 ## 2 2 y2 ## 3 4 y4 ``` ] --- ## `left_join()` .pull-left[ <img src="../images/left-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ```r left_join(x, y, by = "id") ``` ``` ## # A tibble: 3 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ``` ] --- ## `left_join()` ```r professions %>% left_join(dates, by = "name") ``` ``` ## # A tibble: 10 × 4 ## name profession birth_year death_year ## <chr> <chr> <dbl> <dbl> ## 1 Ada Lovelace Mathematician NA NA ## 2 Marie Curie Physicist and Chemist NA NA ## 3 Janaki Ammal Botanist 1897 1984 ## 4 Chien-Shiung Wu Physicist 1912 1997 ## 5 Katherine Johnson Mathematician 1918 2020 ## 6 Rosalind Franklin Chemist 1920 1958 ## 7 Vera Rubin Astronomer 1928 2016 ## 8 Gladys West Mathematician 1930 NA ## 9 Flossie Wong-Staal Virologist and Molecular Biologist 1947 NA ## 10 Jennifer Doudna Biochemist 1964 NA ``` --- ## `right_join()` .pull-left[ <img src="../images/right-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ```r right_join(x, y, by = "id") ``` ``` ## # A tibble: 3 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 4 <NA> y4 ``` ] --- ## `right_join()` ```r professions %>% right_join(dates, by = "name") ``` ``` ## # A tibble: 8 × 4 ## name profession birth_year death_year ## <chr> <chr> <dbl> <dbl> ## 1 Janaki Ammal Botanist 1897 1984 ## 2 Chien-Shiung Wu Physicist 1912 1997 ## 3 Katherine Johnson Mathematician 1918 2020 ## 4 Rosalind Franklin Chemist 1920 1958 ## 5 Vera Rubin Astronomer 1928 2016 ## 6 Gladys West Mathematician 1930 NA ## 7 Flossie Wong-Staal Virologist and Molecular Biologist 1947 NA ## 8 Jennifer Doudna Biochemist 1964 NA ``` --- ## `full_join()` .pull-left[ <img src="../images/full-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ```r full_join(x, y, by = "id") ``` ``` ## # A tibble: 4 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ## 4 4 <NA> y4 ``` ] --- ## `full_join()` ```r dates %>% full_join(works, by = "name") ``` ``` ## # A tibble: 10 × 4 ## name birth_year death_year known_for ## <chr> <dbl> <dbl> <chr> ## 1 Janaki Ammal 1897 1984 hybrid species, biodiversity protec… ## 2 Chien-Shiung Wu 1912 1997 confim and refine theory of radioac… ## 3 Katherine Johnson 1918 2020 calculations of orbital mechanics c… ## 4 Rosalind Franklin 1920 1958 <NA> ## 5 Vera Rubin 1928 2016 existence of dark matter ## 6 Gladys West 1930 NA mathematical modeling of the shape … ## 7 Flossie Wong-Staal 1947 NA first scientist to clone HIV and cr… ## 8 Jennifer Doudna 1964 NA one of the primary developers of CR… ## 9 Ada Lovelace NA NA first computer algorithm ## 10 Marie Curie NA NA theory of radioactivity, discovery… ``` --- ## `inner_join()` .pull-left[ <img src="../images/inner-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ```r inner_join(x, y, by = "id") ``` ``` ## # A tibble: 2 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ``` ] --- ## `inner_join()` ```r dates %>% inner_join(works, by = "name") ``` ``` ## # A tibble: 7 × 4 ## name birth_year death_year known_for ## <chr> <dbl> <dbl> <chr> ## 1 Janaki Ammal 1897 1984 hybrid species, biodiversity protect… ## 2 Chien-Shiung Wu 1912 1997 confim and refine theory of radioact… ## 3 Katherine Johnson 1918 2020 calculations of orbital mechanics cr… ## 4 Vera Rubin 1928 2016 existence of dark matter ## 5 Gladys West 1930 NA mathematical modeling of the shape o… ## 6 Flossie Wong-Staal 1947 NA first scientist to clone HIV and cre… ## 7 Jennifer Doudna 1964 NA one of the primary developers of CRI… ``` --- ## `semi_join()` .pull-left[ <img src="../images/semi-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ```r semi_join(x, y, by = "id") ``` ``` ## # A tibble: 2 × 2 ## id value_x ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ``` ] --- ## `semi_join()` ```r dates %>% semi_join(works, by = "name") ``` ``` ## # A tibble: 7 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Janaki Ammal 1897 1984 ## 2 Chien-Shiung Wu 1912 1997 ## 3 Katherine Johnson 1918 2020 ## 4 Vera Rubin 1928 2016 ## 5 Gladys West 1930 NA ## 6 Flossie Wong-Staal 1947 NA ## 7 Jennifer Doudna 1964 NA ``` --- ## `anti_join()` .pull-left[ <img src="../images/anti-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ```r anti_join(x, y, by = "id") ``` ``` ## # A tibble: 1 × 2 ## id value_x ## <dbl> <chr> ## 1 3 x3 ``` ] --- ## `anti_join()` ```r dates %>% anti_join(works, by = "name") ``` ``` ## # A tibble: 1 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Rosalind Franklin 1920 1958 ``` --- ## Putting it altogether ```r professions %>% left_join(dates, by = "name") %>% left_join(works, by = "name") ``` ``` ## # A tibble: 10 × 5 ## name profession birth_year death_year known_for ## <chr> <chr> <dbl> <dbl> <chr> ## 1 Ada Lovelace Mathematician NA NA first computer algo… ## 2 Marie Curie Physicist and … NA NA theory of radioacti… ## 3 Janaki Ammal Botanist 1897 1984 hybrid species, bio… ## 4 Chien-Shiung Wu Physicist 1912 1997 confim and refine t… ## 5 Katherine Johnson Mathematician 1918 2020 calculations of orb… ## 6 Rosalind Franklin Chemist 1920 1958 <NA> ## 7 Vera Rubin Astronomer 1928 2016 existence of dark m… ## 8 Gladys West Mathematician 1930 NA mathematical modeli… ## 9 Flossie Wong-Staal Virologist and… 1947 NA first scientist to … ## 10 Jennifer Doudna Biochemist 1964 NA one of the primary … ``` --- ### From wide to long and long to wide * <mark>`pivot_longer()`</mark> makes the dataframe "longer" -- many columns into a few columns (more rows): `pivot_longer(data, cols, names_to = , value_to = )` * <mark>`pivot_wider()`</mark> makes the dataframe "wider" -- a few columns into many columns (fewer rows): `pivot_wider(data, names_from = , values_from = )` <img src="../images/pivotlonger.png" width="47%" style="display: block; margin: auto;" /><img src="../images/pivotwider.png" width="47%" style="display: block; margin: auto;" /> --- ### `pivot_longer` `pivot_longer` will be demonstrated using datasets come from GapMinder. The first represents country, year, and <mark>female literacy rate</mark>. ```r library(googlesheets4) gs4_deauth() litF <- read_sheet("https://docs.google.com/spreadsheets/d/1hDinTIRHQIaZg1RUn6Z_6mo12PtKwEPFIz_mJVF6P5I/pub?gid=0") head(litF) ``` ``` ## # A tibble: 6 × 38 ## `Adult (15+) literacy… `1975` `1976` `1977` `1978` `1979` `1980` `1981` `1982` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan NA NA NA NA 4.99 NA NA NA ## 2 Albania NA NA NA NA NA NA NA NA ## 3 Algeria NA NA NA NA NA NA NA NA ## 4 Andorra NA NA NA NA NA NA NA NA ## 5 Angola NA NA NA NA NA NA NA NA ## 6 Anguilla NA NA NA NA NA NA NA NA ## # … with 29 more variables: 1983 <dbl>, 1984 <dbl>, 1985 <dbl>, 1986 <dbl>, ## # 1987 <dbl>, 1988 <dbl>, 1989 <dbl>, 1990 <dbl>, 1991 <dbl>, 1992 <dbl>, ## # 1993 <dbl>, 1994 <dbl>, 1995 <dbl>, 1996 <dbl>, 1997 <dbl>, 1998 <dbl>, ## # 1999 <dbl>, 2000 <dbl>, 2001 <dbl>, 2002 <dbl>, 2003 <dbl>, 2004 <dbl>, ## # 2005 <dbl>, 2006 <dbl>, 2007 <dbl>, 2008 <dbl>, 2009 <dbl>, 2010 <dbl>, ## # 2011 <dbl> ``` --- ### `pivot_longer` ```r litF <- litF %>% select(country=starts_with("Adult"), starts_with("1"), starts_with("2")) %>% pivot_longer(cols = -country, names_to = "year", values_to = "litRateF") %>% filter(!is.na(litRateF)) litF ``` ``` ## # A tibble: 571 × 3 ## country year litRateF ## <chr> <chr> <dbl> ## 1 Afghanistan 1979 4.99 ## 2 Afghanistan 2011 13 ## 3 Albania 2001 98.3 ## 4 Albania 2008 94.7 ## 5 Albania 2011 95.7 ## 6 Algeria 1987 35.8 ## 7 Algeria 2002 60.1 ## 8 Algeria 2006 63.9 ## 9 Angola 2001 54.2 ## 10 Angola 2011 58.6 ## # … with 561 more rows ``` --- ### `pivot_longer` The second dataset (GDP) represents country, year, and <mark>gdp</mark> (in fixed 2000 US$). ```r GDP <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0") GDP <- GDP %>% select(country = starts_with("Income"), starts_with("1"), starts_with("2")) %>% pivot_longer(cols = -country, names_to = "year", values_to = "gdp") %>% filter(!is.na(gdp)) GDP ``` ``` ## # A tibble: 7,988 × 3 ## country year gdp ## <chr> <chr> <dbl> ## 1 Albania 1980 1061. ## 2 Albania 1981 1100. ## 3 Albania 1982 1111. ## 4 Albania 1983 1101. ## 5 Albania 1984 1065. ## 6 Albania 1985 1060. ## 7 Albania 1986 1092. ## 8 Albania 1987 1054. ## 9 Albania 1988 1014. ## 10 Albania 1989 1092. ## # … with 7,978 more rows ``` --- ### `pivot_wider` `pivot_wider` will be demonstrated using the babynames dataset. ```r library(babynames) babynames ``` ``` ## # A tibble: 1,924,665 × 5 ## year sex name n prop ## <dbl> <chr> <chr> <int> <dbl> ## 1 1880 F Mary 7065 0.0724 ## 2 1880 F Anna 2604 0.0267 ## 3 1880 F Emma 2003 0.0205 ## 4 1880 F Elizabeth 1939 0.0199 ## 5 1880 F Minnie 1746 0.0179 ## 6 1880 F Margaret 1578 0.0162 ## 7 1880 F Ida 1472 0.0151 ## 8 1880 F Alice 1414 0.0145 ## 9 1880 F Bertha 1320 0.0135 ## 10 1880 F Sarah 1288 0.0132 ## # … with 1,924,655 more rows ``` --- ```r babynames %>% select(-prop) %>% pivot_wider(names_from = sex, values_from = n) ``` ``` ## # A tibble: 1,756,284 × 4 ## year name F M ## <dbl> <chr> <int> <int> ## 1 1880 Mary 7065 27 ## 2 1880 Anna 2604 12 ## 3 1880 Emma 2003 10 ## 4 1880 Elizabeth 1939 9 ## 5 1880 Minnie 1746 9 ## 6 1880 Margaret 1578 NA ## 7 1880 Ida 1472 8 ## 8 1880 Alice 1414 NA ## 9 1880 Bertha 1320 NA ## 10 1880 Sarah 1288 NA ## # … with 1,756,274 more rows ``` --- ```r babynames %>% select(-prop) %>% pivot_wider(names_from = sex, values_from = n) %>% filter(!is.na(F), !is.na(M)) %>% arrange(desc(year), desc(M)) ``` ``` ## # A tibble: 168,381 × 4 ## year name F M ## <dbl> <chr> <int> <int> ## 1 2017 Liam 36 18728 ## 2 2017 Noah 170 18326 ## 3 2017 William 18 14904 ## 4 2017 James 77 14232 ## 5 2017 Logan 1103 13974 ## 6 2017 Benjamin 8 13733 ## 7 2017 Mason 58 13502 ## 8 2017 Elijah 26 13268 ## 9 2017 Oliver 15 13141 ## 10 2017 Jacob 16 13106 ## # … with 168,371 more rows ``` --- ```r babynames %>% pivot_wider(names_from = sex, values_from = n) %>% mutate(maxcount = pmax(F,M, na.rm = TRUE)) %>% arrange(desc(maxcount)) ``` ``` ## # A tibble: 1,924,653 × 6 ## year name prop F M maxcount ## <dbl> <chr> <dbl> <int> <int> <int> ## 1 1947 Linda 0.0548 99686 NA 99686 ## 2 1948 Linda 0.0552 96209 NA 96209 ## 3 1947 James 0.0510 NA 94756 94756 ## 4 1957 Michael 0.0424 NA 92695 92695 ## 5 1947 Robert 0.0493 NA 91642 91642 ## 6 1949 Linda 0.0518 91016 NA 91016 ## 7 1956 Michael 0.0423 NA 90620 90620 ## 8 1958 Michael 0.0420 NA 90520 90520 ## 9 1948 James 0.0497 NA 88588 88588 ## 10 1954 Michael 0.0428 NA 88514 88514 ## # … with 1,924,643 more rows ``` --- ## Practice `litF` and `GDP` from Gapminder. ### left ```r litGDPleft <- left_join(litF, GDP, by=c("country", "year")) dim(litGDPleft) ``` ``` ## [1] 571 4 ``` ```r litGDPleft ``` ``` ## # A tibble: 571 × 4 ## country year litRateF gdp ## <chr> <chr> <dbl> <dbl> ## 1 Afghanistan 1979 4.99 NA ## 2 Afghanistan 2011 13 NA ## 3 Albania 2001 98.3 1282. ## 4 Albania 2008 94.7 1804. ## 5 Albania 2011 95.7 1966. ## 6 Algeria 1987 35.8 1902. ## 7 Algeria 2002 60.1 1872. ## 8 Algeria 2006 63.9 2125. ## 9 Angola 2001 54.2 298. ## 10 Angola 2011 58.6 630. ## # … with 561 more rows ``` --- ### right ```r litGDPright <- right_join(litF, GDP, by=c("country", "year")) dim(litGDPright) ``` ``` ## [1] 7988 4 ``` ```r litGDPright ``` ``` ## # A tibble: 7,988 × 4 ## country year litRateF gdp ## <chr> <chr> <dbl> <dbl> ## 1 Albania 2001 98.3 1282. ## 2 Albania 2008 94.7 1804. ## 3 Albania 2011 95.7 1966. ## 4 Algeria 1987 35.8 1902. ## 5 Algeria 2002 60.1 1872. ## 6 Algeria 2006 63.9 2125. ## 7 Angola 2001 54.2 298. ## 8 Angola 2011 58.6 630. ## 9 Antigua and Barbuda 2001 99.4 9640. ## 10 Antigua and Barbuda 2011 99.4 9978. ## # … with 7,978 more rows ``` --- ### inner ```r litGDPinner <- inner_join(litF, GDP, by=c("country", "year")) dim(litGDPinner) ``` ``` ## [1] 505 4 ``` ```r litGDPinner ``` ``` ## # A tibble: 505 × 4 ## country year litRateF gdp ## <chr> <chr> <dbl> <dbl> ## 1 Albania 2001 98.3 1282. ## 2 Albania 2008 94.7 1804. ## 3 Albania 2011 95.7 1966. ## 4 Algeria 1987 35.8 1902. ## 5 Algeria 2002 60.1 1872. ## 6 Algeria 2006 63.9 2125. ## 7 Angola 2001 54.2 298. ## 8 Angola 2011 58.6 630. ## 9 Antigua and Barbuda 2001 99.4 9640. ## 10 Antigua and Barbuda 2011 99.4 9978. ## # … with 495 more rows ``` --- ### full ```r litGDPfull <- full_join(litF, GDP, by=c("country", "year")) dim(litGDPfull) ``` ``` ## [1] 8054 4 ``` ```r litGDPfull ``` ``` ## # A tibble: 8,054 × 4 ## country year litRateF gdp ## <chr> <chr> <dbl> <dbl> ## 1 Afghanistan 1979 4.99 NA ## 2 Afghanistan 2011 13 NA ## 3 Albania 2001 98.3 1282. ## 4 Albania 2008 94.7 1804. ## 5 Albania 2011 95.7 1966. ## 6 Algeria 1987 35.8 1902. ## 7 Algeria 2002 60.1 1872. ## 8 Algeria 2006 63.9 2125. ## 9 Angola 2001 54.2 298. ## 10 Angola 2011 58.6 630. ## # … with 8,044 more rows ``` --- ## `join` to **merge** two datasets <img src="../images/join.png" width="90%" style="display: block; margin: auto;" /> If you ever need to understand which join is the right join for you, try to find an image that will lay out what the function is doing. I found this one that is quite good and is taken from Statistics Globe blog: https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti --- ## **purrr** for functional programming .pull-left[ The `map` functions are *named* by the **output** the produce. For example: * `map(.x, .f)` is the main mapping function and returns a list * `map_df(.x, .f)` returns a data frame * `map_dbl(.x, .f)` returns a numeric (double) vector * `map_chr(.x, .f)` returns a character vector * `map_lgl(.x, .f)` returns a logical vector ] .pull-right[ <img src="../images/purrr_map.png" width="90%" style="display: block; margin: auto;" /> ] Note that the first argument is always the data object and the second object is always the function you want to iteratively apply to each element in the input object. --- ### Input The **input** to a `map` function is always either a *vector* (like a column), a *list* (which can be non-rectangular), or a *dataframe* (like a rectangle). A list is a way to hold things which might be very different in shape: ```r a_list <- list(a_number = 5, a_vector = c("a", "b", "c"), a_dataframe = data.frame(a = 1:3, b = c("q", "b", "z"), c = c("bananas", "are", "so very great"))) a_list ``` ``` ## $a_number ## [1] 5 ## ## $a_vector ## [1] "a" "b" "c" ## ## $a_dataframe ## a b c ## 1 1 q bananas ## 2 2 b are ## 3 3 z so very great ``` --- ### Output ```r add_ten <- function(x) { return(x + 10) } ``` We can `map()` the `add_ten()` function across a vector. Note that the output is a list (the default). ```r library(tidyverse) map(.x = c(2, 5, 10), .f = add_ten) ``` ``` ## [[1]] ## [1] 12 ## ## [[2]] ## [1] 15 ## ## [[3]] ## [1] 20 ``` --- ### Output & Input What if we use a different type of input? The default behavior is to still return a list! ```r data.frame(a = 2, b = 5, c = 10) %>% map(add_ten) ``` ``` ## $a ## [1] 12 ## ## $b ## [1] 15 ## ## $c ## [1] 20 ``` What if we want a different type of output? We use a different `map()` function, `map_df()`, for example. ```r data.frame(a = 2, b = 5, c = 10) %>% map_df(add_ten) ``` ``` ## # A tibble: 1 × 3 ## a b c ## <dbl> <dbl> <dbl> ## 1 12 15 20 ``` --- ### Shorthand Shorthand lets us get away from pre-defining the function (which will be useful). Use the tilde `~` to indicate that you have a function: ```r data.frame(a = 2, b = 5, c = 10) %>% map_df(~{.x + 10}) ``` ``` ## # A tibble: 1 × 3 ## a b c ## <dbl> <dbl> <dbl> ## 1 12 15 20 ``` --- ### Shorthand in more complex settings ```r library(palmerpenguins) library(broom) penguins %>% split(.$species) %>% map(~ lm(body_mass_g ~ flipper_length_mm, data = .x)) %>% map_df(tidy) # map(tidy) ``` ``` ## # A tibble: 6 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -2536. 965. -2.63 9.48e- 3 ## 2 flipper_length_mm 32.8 5.08 6.47 1.34e- 9 ## 3 (Intercept) -3037. 997. -3.05 3.33e- 3 ## 4 flipper_length_mm 34.6 5.09 6.79 3.75e- 9 ## 5 (Intercept) -6787. 1093. -6.21 7.65e- 9 ## 6 flipper_length_mm 54.6 5.03 10.9 1.33e-19 ``` ```r penguins %>% group_by(species) %>% group_map(~lm(body_mass_g ~ flipper_length_mm, data = .x)) %>% map(tidy) # map_df(tidy) ``` ``` ## [[1]] ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -2536. 965. -2.63 0.00948 ## 2 flipper_length_mm 32.8 5.08 6.47 0.00000000134 ## ## [[2]] ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -3037. 997. -3.05 0.00333 ## 2 flipper_length_mm 34.6 5.09 6.79 0.00000000375 ## ## [[3]] ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -6787. 1093. -6.21 7.65e- 9 ## 2 flipper_length_mm 54.6 5.03 10.9 1.33e-19 ``` --- ## `lubridate` `lubridate` is a another R package meant for data wrangling! In particular, `lubridate` makes it very easy to work with days, times, and dates. The base idea is to start with dates in a `ymd` (year month day) format and transform the information into whatever you want. Example from https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html --- ### If anyone drove a time machine, they would crash The length of months and years change so often that doing arithmetic with them can be unintuitive. <mark>Consider a simple operation, January 31st + one month.</mark> --- ### If anyone drove a time machine, they would crash The length of months and years change so often that doing arithmetic with them can be unintuitive. <mark>Consider a simple operation, January 31st + one month.</mark> Should the answer be: 1. February 31st (which doesn't exist) 2. March 4th (31 days after January 31), or 3. February 28th (assuming its not a leap year) A basic property of arithmetic is that a + b - b = a. Only solution 1 obeys the mathematical property, but it is an invalid date. Wickham wants to make lubridate as consistent as possible by invoking the following rule: <mark>if adding or subtracting a month or a year creates an invalid date, lubridate will return an NA. </mark> If you thought solution 2 or 3 was more useful, no problem. You can still get those results with clever arithmetic, or by using the special `%m+%` and `%m-%` operators. `%m+%` and `%m-%` automatically roll dates back to the last day of the month, should that be necessary. --- ## basics in `lubridate` ```r library(lubridate); rightnow <- now() day(rightnow) ``` ``` ## [1] 18 ``` ```r week(rightnow) ``` ``` ## [1] 38 ``` ```r month(rightnow, label=FALSE) ``` ``` ## [1] 9 ``` ```r month(rightnow, label=TRUE) ``` ``` ## [1] Sep ## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec ``` ```r year(rightnow) ``` ``` ## [1] 2021 ``` --- ## basics in `lubridate` ```r minute(rightnow) ``` ``` ## [1] 9 ``` ```r hour(rightnow) ``` ``` ## [1] 14 ``` ```r yday(rightnow) ``` ``` ## [1] 261 ``` ```r mday(rightnow) ``` ``` ## [1] 18 ``` ```r wday(rightnow, label=FALSE) ``` ``` ## [1] 7 ``` ```r wday(rightnow, label=TRUE) ``` ``` ## [1] Sat ## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat ``` --- ## But how do I create a date object? ```r jan31 <- ymd("2021-01-31") jan31 + months(0:11) ``` ``` ## [1] "2021-01-31" NA "2021-03-31" NA "2021-05-31" ## [6] NA "2021-07-31" "2021-08-31" NA "2021-10-31" ## [11] NA "2021-12-31" ``` ```r floor_date(jan31, "month") + months(0:11) + days(31) ``` ``` ## [1] "2021-02-01" "2021-03-04" "2021-04-01" "2021-05-02" "2021-06-01" ## [6] "2021-07-02" "2021-08-01" "2021-09-01" "2021-10-02" "2021-11-01" ## [11] "2021-12-02" "2022-01-01" ``` ```r jan31 + months(0:11) + days(31) ``` ``` ## [1] "2021-03-03" NA "2021-05-01" NA "2021-07-01" ## [6] NA "2021-08-31" "2021-10-01" NA "2021-12-01" ## [11] NA "2022-01-31" ``` ```r jan31 %m+% months(0:11) ``` ``` ## [1] "2021-01-31" "2021-02-28" "2021-03-31" "2021-04-30" "2021-05-31" ## [6] "2021-06-30" "2021-07-31" "2021-08-31" "2021-09-30" "2021-10-31" ## [11] "2021-11-30" "2021-12-31" ``` --- ## NYC flights ```r library(nycflights13) names(flights) ``` ``` ## [1] "year" "month" "day" "dep_time" ## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time" ## [9] "arr_delay" "carrier" "flight" "tailnum" ## [13] "origin" "dest" "air_time" "distance" ## [17] "hour" "minute" "time_hour" ``` --- ## NYC flights ```r flightsWK <- flights %>% mutate(ymdday = ymd(paste(year, month,day, sep="-"))) %>% mutate(weekdy = wday(ymdday, label=TRUE), whichweek = week(ymdday)) flightsWK %>% select(year, month, day, ymdday, weekdy, whichweek, dep_time, arr_time, air_time) ``` ``` ## # A tibble: 336,776 × 9 ## year month day ymdday weekdy whichweek dep_time arr_time air_time ## <int> <int> <int> <date> <ord> <dbl> <int> <int> <dbl> ## 1 2013 1 1 2013-01-01 Tue 1 517 830 227 ## 2 2013 1 1 2013-01-01 Tue 1 533 850 227 ## 3 2013 1 1 2013-01-01 Tue 1 542 923 160 ## 4 2013 1 1 2013-01-01 Tue 1 544 1004 183 ## 5 2013 1 1 2013-01-01 Tue 1 554 812 116 ## 6 2013 1 1 2013-01-01 Tue 1 554 740 150 ## 7 2013 1 1 2013-01-01 Tue 1 555 913 158 ## 8 2013 1 1 2013-01-01 Tue 1 557 709 53 ## 9 2013 1 1 2013-01-01 Tue 1 557 838 140 ## 10 2013 1 1 2013-01-01 Tue 1 558 753 138 ## # … with 336,766 more rows ``` --- ## `reprex` > Help me help you `repr`oducible `ex`ample ... Step 1. Copy code onto the clipboard Step 2. Type `reprex()` into the Console Step 3. Look at the Viewer to the right. Copy the Viewer output into GitHub, Piazza, Discord, an email, stackexchange, etc. --- ## `reprex` demo ``` reprex( jan31 + months(0:11) + days(31) ) ``` multiple lines of code: ``` reprex({ jan31 <- ymd("2021-01-31") jan31 + months(0:11) + days(31) }) ``` ``` reprex({ library(lubridate) jan31 <- ymd("2021-01-31") jan31 + months(0:11) + days(31) }) ```