Working with Data in R

class: right, top, my-title, title-slide

# Working with Data in R
### Jo Hardin
### September 14 & 16, 2021

---

Much of this material can be found at:

http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

---

# Agenda  9/14/21

1. Tidy data
2. Data verbs

---

# The data

## What does a data set *look like*?

- **Observations** down the rows
- **Variables** across the columns
- Flat file versus relational database.

---

## Active Duty Military

The Active Duty data are not tidy!  What are the cases?  How are the data not tidy?  What might the data look like in tidy form?  Suppose that the case was "an individual in the armed forces."  What variables would you use to capture the information in the following table?

https://docs.google.com/spreadsheets/d/1Ow6Cm4z-Z1Yybk3i352msulYCEDOUaOghmo9ALajyHo/edit#gid=1811988794

---

## Tidy packages: the tidyverse

<div class="figure">
<img src="../images/tidyverse.png" alt="Image of hex stickers for the eight core tidyverse packages including ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats." width="4107" />
<p class="caption">image credit: https://www.tidyverse.org/.</p>
</div>

---

## Reading in data from a file

Hosted online:

```r
movies <- read_csv("http://pages.pomona.edu/~jsh04747/courses/math58/Math58Data/movies2.csv")
```

Hosted locally:

```r
movies <- read_csv("movies2.csv")
```

Things to note:

- The assign arrow is used to create **objects** in R, which are stored in your
environment.
- Object names don't have to correspond to file names.
- Be sure R knows where to look for the file!

---

## Viewing data - the viewer / Environment

- `View()` can be used in RStudio to bring up an excel-style spreadsheet. Only
for viewing, not editing!
- The dimensions of the data can be found in the environment pane.
- The names of the variables can be seen at the top of the viewer.
- `View()` has a capital letter `V`
- `View()` should not be used in a Markdown or Sweave document

---
## Viewing data - inside .Rmd / the console

- `head()` can be used to print the first several lines of the dataset to the
console.
- `dim()` can be used to find the dimensions (rows then columns).
- `names()` can be used to find the names of the variables.

---

## Practice

Running in to problems? Ask your neighbor or try google!

1. What are the dimensions of the data set?
2. What appears to be the unit of observation?
3. What are the variables?

```r
dim(movies)
```

```
## [1] 134   5
```

```r
head(movies,3)
```

```
## # A tibble: 3 × 5
##   ...1             score2 rating2 genre2     `box office2`
##   <chr>             <dbl> <chr>   <chr>              <dbl>
## 1 2 Fast 2 Furious   48.9 PG-13   action             127. 
## 2 28 Days Later      78.2 R       horror              45.1
## 3 A Guy Thing        39.5 PG-13   rom comedy          15.5
```

```r
names(movies)
```

```
## [1] "...1"        "score2"      "rating2"     "genre2"      "box office2"
```

---
## Reading in data from a package

For now, we'll work with all flights out of the three NYC airports in 2013.

1. Download and install the package from CRAN (done in the Console, only once).
    
    ```r
    install.packages("nycflights13")
    ```
2. Load the package (in the .Rmd file, need it for the .Rmd file to compile appropriately).
    
    ```r
    library(nycflights13)
    ```
3. Make the data set visible.
    
    ```r
    data(flights)
    ```
4. Get help.
    
    ```r
    ?flights
    ```

---

# Slice and dice with dplyr (a package within the tidyverse)

## **dplyr**

> Whenever you're learning a new tool, for a long time you're going to suck ...
> but the good news is that is typical, that's something that happens to everyone, 
> and it's only temporary.

-Hadley Wickham

---

## Why dplyr?

Data sets are often of high *volume* (lots of rows) and high *variety* (lots of columns). This is overwhelming to visualize and analyze, so we find ourselves chopping the data set up into more manageable and meaningful chunks. We also  often need to perform operations to organize and clean our data.

This is all possible in base R, but with `dplyr`, it is simple, readible, and fast.

---

## Some Basic Verbs

- `filter()`
- `arrange()`
- `select()`
- `distinct()`
- `mutate()`
- `summarize()`
- `sample_n()`

---

## `filter()`

Allows you to select a subset of the **rows** of a data frame. The first
argument is the name of the data frame, the following arguments are the
filters that you'd like to apply

For all flights on January 1st:

```r
filter(flights, month == 1, day == 1)
```

```
## # A tibble: 842 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 832 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
```

---

## Constructing filters

Filters are constructed of **logical operators**: `<`, `>`, `<=`, `>=`, `==`,
`!=` (and some others).

Adding them one by one to `filter()` is akin to saying "this AND that". To say
"this OR that OR both", use |.

```r
filter(flights, month == 1 | month == 2)
```

```
## # A tibble: 51,955 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 51,945 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
```

---
## Practice

Construct filters to isolate:

1. Flights that left on St. Patrick's Day.
2. Flights that were destined for Chicago's primary airport.
3. Flights that were destined for Chicago's primary airport and were operated by United Airlines.
4. Flights with flight times more than 2000 miles or that were in the air more than 5 hours.

---

1. Flights that left on St. Patrick's Day.
2. Flights that were destined for Chicago's primary airport.
3. Flights that were destined for Chicago's primary airport and were operated by
United Airlines.
4. Flights with flight times more than 2000 miles or that were in the air more
than 5 hours.

```r
filter(flights, month == 3, day == 17)
filter(flights, dest == "ORD")
filter(flights, dest == "ORD", carrier == "UA")
filter(flights, distance > 2000 | air_time > 5*60)
```

---

## `arrange()`

`arrange()` reorders the rows: It takes a data frame, and a set of column 
names (or more complicated expressions) to order by. If you provide more than 
one column name, each additional column will be used to break ties in the values
of preceding columns:

```r
arrange(flights, year, month, day)
```

Use `desc()` to sort in descending order.

```r
arrange(flights, desc(arr_delay))
```

---

## `select()`

Often you work with large datasets with many columns where only a few are 
actually of interest to you. `select()` allows you to rapidly zoom in on a useful
subset using operations that usually only work on numeric variable positions:

```r
select(flights, year, month, day)
```

You can exclude columns using `-` and specify a range using `:`.

```r
select(flights, -(year:day))
```

---

## `distinct()`

A common use of `select()` is to find out which values a set of variables takes.  This is particularly useful in conjunction with the `distinct()` verb which only  returns the unique values in a table.

What do the following data correspond to?

```r
distinct(select(flights, origin, dest))
```

```
## # A tibble: 224 × 2
##    origin dest 
##    <chr>  <chr>
##  1 EWR    IAH  
##  2 LGA    IAH  
##  3 JFK    MIA  
##  4 JFK    BQN  
##  5 LGA    ATL  
##  6 EWR    ORD  
##  7 EWR    FLL  
##  8 LGA    IAD  
##  9 JFK    MCO  
## 10 LGA    ORD  
## # … with 214 more rows
```

---

## `mutate()`

As well as selecting from the set of existing columns, it's often useful to add 
new columns that are functions of existing columns. This is the job of `mutate()`:

```r
select(mutate(flights, gain = dep_delay - arr_delay), 
       flight, dep_delay, arr_delay, gain)
```

```
## # A tibble: 336,776 × 4
##    flight dep_delay arr_delay  gain
##     <int>     <dbl>     <dbl> <dbl>
##  1   1545         2        11    -9
##  2   1714         4        20   -16
##  3   1141         2        33   -31
##  4    725        -1       -18    17
##  5    461        -6       -25    19
##  6   1696        -4        12   -16
##  7    507        -5        19   -24
##  8   5708        -3       -14    11
##  9     79        -3        -8     5
## 10    301        -2         8   -10
## # … with 336,766 more rows
```

---

## `summarize()` and `sample_n()`

`summarize()` collapses a data frame to a single row. It's not very useful yet. 
`sample_n()` provides you with a random sample of the rows.

```r
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))
```

```
## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6
```

```r
sample_n(flights, 10)
```

```
## # A tibble: 10 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     2     5     1749           1753        -4     2057           2105
##  2  2013     8    27     1004            955         9     1246           1310
##  3  2013     7     9     1149           1200       -11     1308           1333
##  4  2013     5     3       NA            729        NA       NA           1020
##  5  2013     2     8       NA           2035        NA       NA           2142
##  6  2013     8     7      642            645        -3      913            934
##  7  2013    10    27      759            800        -1     1138           1155
##  8  2013     2    25      558            600        -2      655            708
##  9  2013     1     4      754            755        -1      956           1030
## 10  2013     3    20     1155           1135        20     1516           1449
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
```

---
## Practice

**Mutate** the data to create a new column that contains the average speed traveled by the plane for each flight.

**Select** the new variable and save it, along with tailnum, as a new data frame object.

---
## Practice

**Mutate** the data to create a new column that contains the average speed traveled by the plane for each flight.

**Select** the new variable and save it, along with tailnum, as a new data frame object.

```r
flights2 <- mutate(flights, speed = distance/(air_time/60))
speed_data <- select(flights2, tailnum, speed)
```

---

## `group_by()`

`summarize()` and `sample_n()` are even more powerful when combined with the idea of  "group by", repeating the operation separately on groups of observations within the dataset. The `group_by()` function describes how to break a dataset down into groups of rows.

---

## `group_by()`

Find the fastest airplanes in the bunch, measured as the average speed per airplane.

```r
by_tailnum <- group_by(speed_data, tailnum)
avg_speed <- summarize(by_tailnum, 
                       count = n(), 
                       avg_speed = mean(speed, na.rm = TRUE))
arrange(avg_speed, desc(avg_speed))
```

```
## # A tibble: 4,044 × 3
##    tailnum count avg_speed
##    <chr>   <int>     <dbl>
##  1 N228UA      1      501.
##  2 N315AS      1      499.
##  3 N654UA      1      499.
##  4 N819AW      1      490.
##  5 N382HA     26      486.
##  6 N388HA     36      484.
##  7 N391HA     21      484.
##  8 N777UA      1      483.
##  9 N385HA     28      483.
## 10 N392HA     13      482.
## # … with 4,034 more rows
```

---

## Chaining

Instead of applying each verb step-by-step, we can chain them into a single
data pipeline, connected with the `%>%` operator. You start the pipeline with
a data frame and then pass it to each function in turn.

The pipe syntax (`%>%`) takes a data frame and sends it to the argument of a function.  The mapping goes to the first available argument in the function.  For example:

`x %>% f(y)` is the same as `f(x, y)`

` y %>% f(x, ., z)` is the same as `f(x,y,z)`

---

## Mornings

```
step1 <- dress(me, what = sports)  
step2 <- exercise(step1, how = running)  
step3 <- eat(step2, choice = cereal)  
step4 <- dress(step3, what = school)  
step5 <- commute(step4, transportation = bike)  
```
---

## Mornings

```
commute(dress(eat(exercise(dress(me, what = sports), how = running), choice = cereal), what = school), transportation = bike)
```

---

## Morning

(better??)

```
commute(
  dress(
    eat(
      exercise(
        dress(me, 
              what = sports), 
        how = running), 
      choice = cereal), 
    what = school), 
  transportation = bike)
```

---

## Mornings

```
me %>%
  dress(what = sports) %>%
  exercise(how = running) %>%
  eat(choice = cereal) %>%
  dress(what = school) %>%
  commute(transportation = bike)
```

---

## Mornings

```
me %>%
  dress(what = sports) %>%
  exercise(how = running) %>%
  eat(choice = cereal) %>%
  dress(what = school) %>%
  commute(transportation = bike)
```

The pipe syntax (`%>%`) takes a data frame and sends it to the argument of a function.  The mapping goes to the first available argument in the function.  For example:

`x %>% f(y)` is the same as `f(x, y)`

` y %>% f(x, ., z)` is the same as `f(x,y,z)`

---

#### Little Bunny Foo Foo

From Hadley Wickham, how to think about tidy data.

> Little bunny Foo Foo

> Went hopping through the forest

> Scooping up the field mice

> And bopping them on the head

---

#### Little Bunny Foo Foo

The nursery rhyme could be created by a series of steps where the output from each step is saved as an object along the way.

```
foo_foo <- little_bunny()
foo_foo_1 <- hop(foo_foo, through = forest)
foo_foo_2 <- scoop(foo_foo_2, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)
```
---

#### Little Bunny Foo Foo

Another approach is to concatenate the functions so that there is only one output.

```
bop(
   scoop(
      hop(foo_foo, through = forest),
      up = field_mice),
   on = head)
```
---

#### Little Bunny Foo Foo

Or even worse, as one line:

```
bop(scoop(hop(foo_foo, through = forest), up = field_mice), on = head)))
```
---
#### Little Bunny Foo Foo

Instead, the code can be written using the pipe in the **order** in which the function is evaluated:

```
foo_foo %>%
   hop(through = forest) %>%
       scoop(up = field_mice) %>%
           bop(on = head)
```

---

#### Little Bunny Foo Foo

Instead, the code can be written using the pipe in the **order** in which the function is evaluated:

```
foo_foo %>%
   hop(through = forest) %>%
       scoop(up = field_mice) %>%
           bop(on = head)
```

The pipe syntax (`%>%`) takes a data frame and sends it to the argument of a function.  The mapping goes to the first available argument in the function.  For example:

`x %>% f(y)` is the same as `f(x, y)`

` y %>% f(x, ., z)` is the same as `f(x,y,z)`
---

```r
flights2 %>%
  select(tailnum, speed) %>%
  group_by(tailnum) %>% 
  summarize(number = n(), avg_speed = mean(speed, na.rm = TRUE)) %>%
  arrange(desc(avg_speed))
```

```
## # A tibble: 4,044 × 3
##    tailnum number avg_speed
##    <chr>    <int>     <dbl>
##  1 N228UA       1      501.
##  2 N315AS       1      499.
##  3 N654UA       1      499.
##  4 N819AW       1      490.
##  5 N382HA      26      486.
##  6 N388HA      36      484.
##  7 N391HA      21      484.
##  8 N777UA       1      483.
##  9 N385HA      28      483.
## 10 N392HA      13      482.
## # … with 4,034 more rows
```

---

## Practice

Form a chain that creates a data frame containing only carrier and the mean departure delay time. Which carriers have the highest and lowest mean delays?

---

## Practice

Form a chain that creates a data frame containing only carrier and the mean departure delay time. Which carriers have the highest and lowest mean delays?

```r
flights %>%
  group_by(carrier) %>%
  summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  arrange(desc(avg_delay))
```

```
## # A tibble: 16 × 2
##    carrier avg_delay
##    <chr>       <dbl>
##  1 F9          20.2 
##  2 EV          20.0 
##  3 YV          19.0 
##  4 FL          18.7 
##  5 WN          17.7 
##  6 9E          16.7 
##  7 B6          13.0 
##  8 VX          12.9 
##  9 OO          12.6 
## 10 UA          12.1 
## 11 MQ          10.6 
## 12 DL           9.26
## 13 AA           8.59
## 14 AS           5.80
## 15 HA           4.90
## 16 US           3.78
```

---

## Practice again

Say you're curious about the relationship between the number of flights each plane made in 2013, the mean distance that each of those planes flew, and the mean arrival delay. You also want to exclude the edge cases from your analysis, so focus on the planes that have logged more than 20 flights and flown an average distance of less than 2000 miles. Please form the chain that creates this dataset.

---

## Practice again

```r
delay_data <- flights %>%
  group_by(tailnum) %>%
  summarize(number = n(),
            dist = mean(distance, na.rm = TRUE), 
            delay = mean(arr_delay, na.rm = TRUE)) %>%
  filter(number > 20, dist < 2000)
```

## Visualizing the data

.pull-left[ 
![](2021-09-14-wrangling_files/figure-html/unnamed-chunk-25-1.png)
]

.pull-right[

```r
delay_data %>%
 ggplot(aes(dist, delay)) +
 geom_point(aes(size = number), 
            alpha = 1/2) +
 geom_smooth() +
 scale_size_area()
```
When `scale_size_area` is used, the default behavior is to scale the area of points to be proportional to the value.
]

---

# Agenda  9/16/21

1. Relational data (`_join`)
2. `pivot`ing
3. `map`ping
4. **lubridate**

---

## Relational data (multiple data frames)

See the [RStudio cheatsheets](https://www.rstudio.com/resources/cheatsheets/)

---
  
  
## Joining two (or more) dataframes:
  
* <mark>`left_join`</mark> returns all rows from the left table, and any rows with matching keys from the right table.
* <mark>`inner_join`</mark> returns only the rows in which the left table have matching keys in the right table (i.e., matching rows in both sets).
* <mark>`full_join`</mark> returns all rows from both tables, join records from the left which have matching keys in the right table.

Good practice:  always specify the `by` argument when joining data frames.

---

## Women in Science

10 women in science who changed the world (source: Discover Magazine)

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:left;"> profession </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Ada Lovelace </td>
   <td style="text-align:left;"> Mathematician </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Marie Curie </td>
   <td style="text-align:left;"> Physicist and Chemist </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Janaki Ammal </td>
   <td style="text-align:left;"> Botanist </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Chien-Shiung Wu </td>
   <td style="text-align:left;"> Physicist </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Katherine Johnson </td>
   <td style="text-align:left;"> Mathematician </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Rosalind Franklin </td>
   <td style="text-align:left;"> Chemist </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Vera Rubin </td>
   <td style="text-align:left;"> Astronomer </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Gladys West </td>
   <td style="text-align:left;"> Mathematician </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Flossie Wong-Staal </td>
   <td style="text-align:left;"> Virologist and Molecular Biologist </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Jennifer Doudna </td>
   <td style="text-align:left;"> Biochemist </td>
  </tr>
</tbody>
</table>

.footnote[Example and gifs from [DataScience in a Box](https://datasciencebox.org/)]
---

## Inputs

Information on women scientists

.panelset[

.panel[.panel-name[professions]

```r
professions
```

```
## # A tibble: 10 × 2
##    name               profession                        
##    <chr>              <chr>                             
##  1 Ada Lovelace       Mathematician                     
##  2 Marie Curie        Physicist and Chemist             
##  3 Janaki Ammal       Botanist                          
##  4 Chien-Shiung Wu    Physicist                         
##  5 Katherine Johnson  Mathematician                     
##  6 Rosalind Franklin  Chemist                           
##  7 Vera Rubin         Astronomer                        
##  8 Gladys West        Mathematician                     
##  9 Flossie Wong-Staal Virologist and Molecular Biologist
## 10 Jennifer Doudna    Biochemist
```
]

.panel[.panel-name[dates]

```r
dates
```

```
## # A tibble: 8 × 3
##   name               birth_year death_year
##   <chr>                   <dbl>      <dbl>
## 1 Janaki Ammal             1897       1984
## 2 Chien-Shiung Wu          1912       1997
## 3 Katherine Johnson        1918       2020
## 4 Rosalind Franklin        1920       1958
## 5 Vera Rubin               1928       2016
## 6 Gladys West              1930         NA
## 7 Flossie Wong-Staal       1947         NA
## 8 Jennifer Doudna          1964         NA
```
]

.panel[.panel-name[works]

```r
works
```

```
## # A tibble: 9 × 2
##   name               known_for                                                  
##   <chr>              <chr>                                                      
## 1 Ada Lovelace       first computer algorithm                                   
## 2 Marie Curie        theory of radioactivity,  discovery of elements polonium a…
## 3 Janaki Ammal       hybrid species, biodiversity protection                    
## 4 Chien-Shiung Wu    confim and refine theory of radioactive beta decy, Wu expe…
## 5 Katherine Johnson  calculations of orbital mechanics critical to sending the …
## 6 Vera Rubin         existence of dark matter                                   
## 7 Gladys West        mathematical modeling of the shape of the Earth which serv…
## 8 Flossie Wong-Staal first scientist to clone HIV and create a map of its genes…
## 9 Jennifer Doudna    one of the primary developers of CRISPR, a ground-breaking…
```
]

]

---

## Desired output

We'd like to put together the data to look like:

```
## # A tibble: 10 × 5
##    name               profession      birth_year death_year known_for           
##    <chr>              <chr>                <dbl>      <dbl> <chr>               
##  1 Ada Lovelace       Mathematician           NA         NA first computer algo…
##  2 Marie Curie        Physicist and …         NA         NA theory of radioacti…
##  3 Janaki Ammal       Botanist              1897       1984 hybrid species, bio…
##  4 Chien-Shiung Wu    Physicist             1912       1997 confim and refine t…
##  5 Katherine Johnson  Mathematician         1918       2020 calculations of orb…
##  6 Rosalind Franklin  Chemist               1920       1958 <NA>                
##  7 Vera Rubin         Astronomer            1928       2016 existence of dark m…
##  8 Gladys West        Mathematician         1930         NA mathematical modeli…
##  9 Flossie Wong-Staal Virologist and…       1947         NA first scientist to …
## 10 Jennifer Doudna    Biochemist            1964         NA one of the primary …
```

---

## Inputs, reminder

.pull-left[

```r
nrow(professions)
```

```
## [1] 10
```

```r
nrow(dates)
```

```
## [1] 8
```

```r
nrow(works)
```

```
## [1] 9
```
]
.pull-right[

```r
names(professions)
```

```
## [1] "name"       "profession"
```

```r
names(dates)
```

```
## [1] "name"       "birth_year" "death_year"
```

```r
names(works)
```

```
## [1] "name"      "known_for"
```
]

---

## Setup

For the next few slides...

.pull-left[

```r
x
```

```
## # A tibble: 3 × 2
##      id value_x
##   <dbl> <chr>  
## 1     1 x1     
## 2     2 x2     
## 3     3 x3
```
]
.pull-right[

```r
y
```

```
## # A tibble: 3 × 2
##      id value_y
##   <dbl> <chr>  
## 1     1 y1     
## 2     2 y2     
## 3     4 y4
```
]

---

## `left_join()`

.pull-left[
<img src="../images/left-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

```r
left_join(x, y, by = "id")
```

```
## # A tibble: 3 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     3 x3      <NA>
```
]

---

## `left_join()`

```r
professions %>%
  left_join(dates, by = "name") 
```

```
## # A tibble: 10 × 4
##    name               profession                         birth_year death_year
##    <chr>              <chr>                                   <dbl>      <dbl>
##  1 Ada Lovelace       Mathematician                              NA         NA
##  2 Marie Curie        Physicist and Chemist                      NA         NA
##  3 Janaki Ammal       Botanist                                 1897       1984
##  4 Chien-Shiung Wu    Physicist                                1912       1997
##  5 Katherine Johnson  Mathematician                            1918       2020
##  6 Rosalind Franklin  Chemist                                  1920       1958
##  7 Vera Rubin         Astronomer                               1928       2016
##  8 Gladys West        Mathematician                            1930         NA
##  9 Flossie Wong-Staal Virologist and Molecular Biologist       1947         NA
## 10 Jennifer Doudna    Biochemist                               1964         NA
```

---

## `right_join()`

.pull-left[
<img src="../images/right-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

```r
right_join(x, y, by = "id")
```

```
## # A tibble: 3 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     4 <NA>    y4
```
]

---

## `right_join()`

```r
professions %>%
  right_join(dates, by = "name") 
```

```
## # A tibble: 8 × 4
##   name               profession                         birth_year death_year
##   <chr>              <chr>                                   <dbl>      <dbl>
## 1 Janaki Ammal       Botanist                                 1897       1984
## 2 Chien-Shiung Wu    Physicist                                1912       1997
## 3 Katherine Johnson  Mathematician                            1918       2020
## 4 Rosalind Franklin  Chemist                                  1920       1958
## 5 Vera Rubin         Astronomer                               1928       2016
## 6 Gladys West        Mathematician                            1930         NA
## 7 Flossie Wong-Staal Virologist and Molecular Biologist       1947         NA
## 8 Jennifer Doudna    Biochemist                               1964         NA
```

---

## `full_join()`

.pull-left[
<img src="../images/full-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

```r
full_join(x, y, by = "id")
```

```
## # A tibble: 4 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     3 x3      <NA>   
## 4     4 <NA>    y4
```
]

---

## `full_join()`

```r
dates %>%
  full_join(works, by = "name") 
```

```
## # A tibble: 10 × 4
##    name               birth_year death_year known_for                           
##    <chr>                   <dbl>      <dbl> <chr>                               
##  1 Janaki Ammal             1897       1984 hybrid species, biodiversity protec…
##  2 Chien-Shiung Wu          1912       1997 confim and refine theory of radioac…
##  3 Katherine Johnson        1918       2020 calculations of orbital mechanics c…
##  4 Rosalind Franklin        1920       1958 <NA>                                
##  5 Vera Rubin               1928       2016 existence of dark matter            
##  6 Gladys West              1930         NA mathematical modeling of the shape …
##  7 Flossie Wong-Staal       1947         NA first scientist to clone HIV and cr…
##  8 Jennifer Doudna          1964         NA one of the primary developers of CR…
##  9 Ada Lovelace               NA         NA first computer algorithm            
## 10 Marie Curie                NA         NA theory of radioactivity,  discovery…
```

---

## `inner_join()`

.pull-left[
<img src="../images/inner-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

```r
inner_join(x, y, by = "id")
```

```
## # A tibble: 2 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2
```
]

---

## `inner_join()`

```r
dates %>%
  inner_join(works, by = "name") 
```

```
## # A tibble: 7 × 4
##   name               birth_year death_year known_for                            
##   <chr>                   <dbl>      <dbl> <chr>                                
## 1 Janaki Ammal             1897       1984 hybrid species, biodiversity protect…
## 2 Chien-Shiung Wu          1912       1997 confim and refine theory of radioact…
## 3 Katherine Johnson        1918       2020 calculations of orbital mechanics cr…
## 4 Vera Rubin               1928       2016 existence of dark matter             
## 5 Gladys West              1930         NA mathematical modeling of the shape o…
## 6 Flossie Wong-Staal       1947         NA first scientist to clone HIV and cre…
## 7 Jennifer Doudna          1964         NA one of the primary developers of CRI…
```

---

## `semi_join()`

.pull-left[
<img src="../images/semi-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

```r
semi_join(x, y, by = "id")
```

```
## # A tibble: 2 × 2
##      id value_x
##   <dbl> <chr>  
## 1     1 x1     
## 2     2 x2
```
]

---

## `semi_join()`

```r
dates %>%
  semi_join(works, by = "name") 
```

```
## # A tibble: 7 × 3
##   name               birth_year death_year
##   <chr>                   <dbl>      <dbl>
## 1 Janaki Ammal             1897       1984
## 2 Chien-Shiung Wu          1912       1997
## 3 Katherine Johnson        1918       2020
## 4 Vera Rubin               1928       2016
## 5 Gladys West              1930         NA
## 6 Flossie Wong-Staal       1947         NA
## 7 Jennifer Doudna          1964         NA
```

---

## `anti_join()`

.pull-left[
<img src="../images/anti-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

```r
anti_join(x, y, by = "id")
```

```
## # A tibble: 1 × 2
##      id value_x
##   <dbl> <chr>  
## 1     3 x3
```
]

---

## `anti_join()`

```r
dates %>%
  anti_join(works, by = "name") 
```

```
## # A tibble: 1 × 3
##   name              birth_year death_year
##   <chr>                  <dbl>      <dbl>
## 1 Rosalind Franklin       1920       1958
```

---

## Putting it altogether

```r
professions %>%
  left_join(dates, by = "name") %>%
  left_join(works, by = "name")
```

---

### From wide to long and long to wide

* <mark>`pivot_longer()`</mark> makes the dataframe "longer" -- many columns into a few columns (more rows): `pivot_longer(data, cols,  names_to = , value_to = )`
* <mark>`pivot_wider()`</mark> makes the dataframe "wider" -- a few columns into many columns (fewer rows): `pivot_wider(data, names_from = , values_from = )`

---
  
### `pivot_longer`
  
`pivot_longer` will be demonstrated using datasets come from GapMinder.  The first represents country, year, and <mark>female literacy rate</mark>.

```r
library(googlesheets4)
gs4_deauth()
litF <- read_sheet("https://docs.google.com/spreadsheets/d/1hDinTIRHQIaZg1RUn6Z_6mo12PtKwEPFIz_mJVF6P5I/pub?gid=0")

head(litF)
```

```
## # A tibble: 6 × 38
##   `Adult (15+) literacy… `1975` `1976` `1977` `1978` `1979` `1980` `1981` `1982`
##   <chr>                   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Afghanistan                NA     NA     NA     NA   4.99     NA     NA     NA
## 2 Albania                    NA     NA     NA     NA  NA        NA     NA     NA
## 3 Algeria                    NA     NA     NA     NA  NA        NA     NA     NA
## 4 Andorra                    NA     NA     NA     NA  NA        NA     NA     NA
## 5 Angola                     NA     NA     NA     NA  NA        NA     NA     NA
## 6 Anguilla                   NA     NA     NA     NA  NA        NA     NA     NA
## # … with 29 more variables: 1983 <dbl>, 1984 <dbl>, 1985 <dbl>, 1986 <dbl>,
## #   1987 <dbl>, 1988 <dbl>, 1989 <dbl>, 1990 <dbl>, 1991 <dbl>, 1992 <dbl>,
## #   1993 <dbl>, 1994 <dbl>, 1995 <dbl>, 1996 <dbl>, 1997 <dbl>, 1998 <dbl>,
## #   1999 <dbl>, 2000 <dbl>, 2001 <dbl>, 2002 <dbl>, 2003 <dbl>, 2004 <dbl>,
## #   2005 <dbl>, 2006 <dbl>, 2007 <dbl>, 2008 <dbl>, 2009 <dbl>, 2010 <dbl>,
## #   2011 <dbl>
```

---
### `pivot_longer`

```r
litF <- litF %>% select(country=starts_with("Adult"), 
                              starts_with("1"), starts_with("2")) %>%
  pivot_longer(cols = -country, 
                      names_to = "year", values_to = "litRateF") %>%
  filter(!is.na(litRateF))

litF
```

```
## # A tibble: 571 × 3
##    country     year  litRateF
##    <chr>       <chr>    <dbl>
##  1 Afghanistan 1979      4.99
##  2 Afghanistan 2011     13   
##  3 Albania     2001     98.3 
##  4 Albania     2008     94.7 
##  5 Albania     2011     95.7 
##  6 Algeria     1987     35.8 
##  7 Algeria     2002     60.1 
##  8 Algeria     2006     63.9 
##  9 Angola      2001     54.2 
## 10 Angola      2011     58.6 
## # … with 561 more rows
```

---
  
### `pivot_longer`
  
The second dataset (GDP) represents country, year, and <mark>gdp</mark> (in fixed 2000 US$).

```r
GDP <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")

GDP <- GDP %>% select(country = starts_with("Income"), 
                            starts_with("1"), starts_with("2")) %>%
  pivot_longer(cols = -country, names_to = "year", 
                      values_to = "gdp") %>%
  filter(!is.na(gdp))

GDP
```

```
## # A tibble: 7,988 × 3
##    country year    gdp
##    <chr>   <chr> <dbl>
##  1 Albania 1980  1061.
##  2 Albania 1981  1100.
##  3 Albania 1982  1111.
##  4 Albania 1983  1101.
##  5 Albania 1984  1065.
##  6 Albania 1985  1060.
##  7 Albania 1986  1092.
##  8 Albania 1987  1054.
##  9 Albania 1988  1014.
## 10 Albania 1989  1092.
## # … with 7,978 more rows
```

---
  
### `pivot_wider`

`pivot_wider`  will be demonstrated using the babynames dataset.

```r
library(babynames)

babynames
```

```
## # A tibble: 1,924,665 × 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # … with 1,924,655 more rows
```

---

```r
babynames %>% select(-prop) %>%
  pivot_wider(names_from = sex, values_from = n) 
```

```
## # A tibble: 1,756,284 × 4
##     year name          F     M
##    <dbl> <chr>     <int> <int>
##  1  1880 Mary       7065    27
##  2  1880 Anna       2604    12
##  3  1880 Emma       2003    10
##  4  1880 Elizabeth  1939     9
##  5  1880 Minnie     1746     9
##  6  1880 Margaret   1578    NA
##  7  1880 Ida        1472     8
##  8  1880 Alice      1414    NA
##  9  1880 Bertha     1320    NA
## 10  1880 Sarah      1288    NA
## # … with 1,756,274 more rows
```

---

```r
babynames %>% 
  select(-prop) %>% 
  pivot_wider(names_from = sex, values_from = n) %>%
  filter(!is.na(F), !is.na(M)) %>%
  arrange(desc(year), desc(M))
```

```
## # A tibble: 168,381 × 4
##     year name         F     M
##    <dbl> <chr>    <int> <int>
##  1  2017 Liam        36 18728
##  2  2017 Noah       170 18326
##  3  2017 William     18 14904
##  4  2017 James       77 14232
##  5  2017 Logan     1103 13974
##  6  2017 Benjamin     8 13733
##  7  2017 Mason       58 13502
##  8  2017 Elijah      26 13268
##  9  2017 Oliver      15 13141
## 10  2017 Jacob       16 13106
## # … with 168,371 more rows
```

---

```r
babynames %>% 
  pivot_wider(names_from = sex, values_from = n) %>%
  mutate(maxcount = pmax(F,M, na.rm = TRUE)) %>%
  arrange(desc(maxcount))
```

```
## # A tibble: 1,924,653 × 6
##     year name      prop     F     M maxcount
##    <dbl> <chr>    <dbl> <int> <int>    <int>
##  1  1947 Linda   0.0548 99686    NA    99686
##  2  1948 Linda   0.0552 96209    NA    96209
##  3  1947 James   0.0510    NA 94756    94756
##  4  1957 Michael 0.0424    NA 92695    92695
##  5  1947 Robert  0.0493    NA 91642    91642
##  6  1949 Linda   0.0518 91016    NA    91016
##  7  1956 Michael 0.0423    NA 90620    90620
##  8  1958 Michael 0.0420    NA 90520    90520
##  9  1948 James   0.0497    NA 88588    88588
## 10  1954 Michael 0.0428    NA 88514    88514
## # … with 1,924,643 more rows
```

---

## Practice `litF` and `GDP` from Gapminder.

### left

```r
litGDPleft <- left_join(litF, GDP, by=c("country", "year"))
dim(litGDPleft)
```

```
## [1] 571   4
```

```r
litGDPleft
```

```
## # A tibble: 571 × 4
##    country     year  litRateF   gdp
##    <chr>       <chr>    <dbl> <dbl>
##  1 Afghanistan 1979      4.99   NA 
##  2 Afghanistan 2011     13      NA 
##  3 Albania     2001     98.3  1282.
##  4 Albania     2008     94.7  1804.
##  5 Albania     2011     95.7  1966.
##  6 Algeria     1987     35.8  1902.
##  7 Algeria     2002     60.1  1872.
##  8 Algeria     2006     63.9  2125.
##  9 Angola      2001     54.2   298.
## 10 Angola      2011     58.6   630.
## # … with 561 more rows
```

---

### right

```r
litGDPright <- right_join(litF, GDP, by=c("country", "year"))
dim(litGDPright)
```

```
## [1] 7988    4
```

```r
litGDPright
```

```
## # A tibble: 7,988 × 4
##    country             year  litRateF   gdp
##    <chr>               <chr>    <dbl> <dbl>
##  1 Albania             2001      98.3 1282.
##  2 Albania             2008      94.7 1804.
##  3 Albania             2011      95.7 1966.
##  4 Algeria             1987      35.8 1902.
##  5 Algeria             2002      60.1 1872.
##  6 Algeria             2006      63.9 2125.
##  7 Angola              2001      54.2  298.
##  8 Angola              2011      58.6  630.
##  9 Antigua and Barbuda 2001      99.4 9640.
## 10 Antigua and Barbuda 2011      99.4 9978.
## # … with 7,978 more rows
```

---

### inner

```r
litGDPinner <- inner_join(litF, GDP, by=c("country", "year"))
dim(litGDPinner)
```

```
## [1] 505   4
```

```r
litGDPinner
```

```
## # A tibble: 505 × 4
##    country             year  litRateF   gdp
##    <chr>               <chr>    <dbl> <dbl>
##  1 Albania             2001      98.3 1282.
##  2 Albania             2008      94.7 1804.
##  3 Albania             2011      95.7 1966.
##  4 Algeria             1987      35.8 1902.
##  5 Algeria             2002      60.1 1872.
##  6 Algeria             2006      63.9 2125.
##  7 Angola              2001      54.2  298.
##  8 Angola              2011      58.6  630.
##  9 Antigua and Barbuda 2001      99.4 9640.
## 10 Antigua and Barbuda 2011      99.4 9978.
## # … with 495 more rows
```

---

### full

```r
litGDPfull <- full_join(litF, GDP, by=c("country", "year"))
dim(litGDPfull)
```

```
## [1] 8054    4
```

```r
litGDPfull
```

```
## # A tibble: 8,054 × 4
##    country     year  litRateF   gdp
##    <chr>       <chr>    <dbl> <dbl>
##  1 Afghanistan 1979      4.99   NA 
##  2 Afghanistan 2011     13      NA 
##  3 Albania     2001     98.3  1282.
##  4 Albania     2008     94.7  1804.
##  5 Albania     2011     95.7  1966.
##  6 Algeria     1987     35.8  1902.
##  7 Algeria     2002     60.1  1872.
##  8 Algeria     2006     63.9  2125.
##  9 Angola      2001     54.2   298.
## 10 Angola      2011     58.6   630.
## # … with 8,044 more rows
```
---

## `join` to **merge** two datasets

If you ever need to understand which join is the right join for you, try to find an image that will lay out what the function is doing.  I found this one that is quite good and is taken from Statistics Globe blog: https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti

---

## **purrr** for functional programming

.pull-left[
The `map` functions are *named* by the **output** the produce.  For example:

* `map(.x, .f)` is the main mapping function and returns a list

* `map_df(.x, .f)` returns a data frame

* `map_dbl(.x, .f)` returns a numeric (double) vector

* `map_chr(.x, .f)` returns a character vector

* `map_lgl(.x, .f)` returns a logical vector
]

.pull-right[
<img src="../images/purrr_map.png" width="90%" style="display: block; margin: auto;" />
]

Note that the first argument is always the data object and the second object is always the function you want to iteratively apply to each element in the input object.

---

### Input

The **input** to a `map` function is always either a *vector* (like a column), a *list* (which can be non-rectangular), or a *dataframe* (like a rectangle).

A list is a way to hold things which might be very different in shape:

```r
a_list <- list(a_number = 5,
                      a_vector = c("a", "b", "c"),
                      a_dataframe = data.frame(a = 1:3, 
                                               b = c("q", "b", "z"), 
                                               c = c("bananas", "are", "so very great")))

a_list
```

```
## $a_number
## [1] 5
## 
## $a_vector
## [1] "a" "b" "c"
## 
## $a_dataframe
##   a b             c
## 1 1 q       bananas
## 2 2 b           are
## 3 3 z so very great
```

---

###  Output

```r
add_ten <- function(x) {
  return(x + 10)
  }
```

We can `map()` the `add_ten()` function across a vector.  Note that the output is a list (the default).

```r
library(tidyverse)
map(.x = c(2, 5, 10),
    .f = add_ten)
```

```
## [[1]]
## [1] 12
## 
## [[2]]
## [1] 15
## 
## [[3]]
## [1] 20
```

---

###  Output & Input

What if we use a different type of input?  The default behavior is to still return a list!

```r
data.frame(a = 2, b = 5, c = 10) %>%
  map(add_ten)
```

```
## $a
## [1] 12
## 
## $b
## [1] 15
## 
## $c
## [1] 20
```

What if we want a different type of output?  We use a different `map()` function, `map_df()`, for example.

```r
data.frame(a = 2, b = 5, c = 10) %>%
  map_df(add_ten)
```

```
## # A tibble: 1 × 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1    12    15    20
```

---

### Shorthand

Shorthand lets us get away from pre-defining the function (which will be useful).  Use the tilde `~` to indicate that you have a function:

```r
data.frame(a = 2, b = 5, c = 10) %>%
  map_df(~{.x + 10})
```

```
## # A tibble: 1 × 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1    12    15    20
```

---
### Shorthand in more complex settings

```r
library(palmerpenguins)
library(broom)

penguins %>%
  split(.$species) %>%
  map(~ lm(body_mass_g ~ flipper_length_mm, data = .x)) %>%
  map_df(tidy)  # map(tidy)
```

```
## # A tibble: 6 × 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)        -2536.     965.       -2.63 9.48e- 3
## 2 flipper_length_mm     32.8      5.08      6.47 1.34e- 9
## 3 (Intercept)        -3037.     997.       -3.05 3.33e- 3
## 4 flipper_length_mm     34.6      5.09      6.79 3.75e- 9
## 5 (Intercept)        -6787.    1093.       -6.21 7.65e- 9
## 6 flipper_length_mm     54.6      5.03     10.9  1.33e-19
```

```r
penguins %>%
  group_by(species) %>%
  group_map(~lm(body_mass_g ~ flipper_length_mm, data = .x)) %>%
  map(tidy)  # map_df(tidy)
```

```
## [[1]]
## # A tibble: 2 × 5
##   term              estimate std.error statistic       p.value
##   <chr>                <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept)        -2536.     965.       -2.63 0.00948      
## 2 flipper_length_mm     32.8      5.08      6.47 0.00000000134
## 
## [[2]]
## # A tibble: 2 × 5
##   term              estimate std.error statistic       p.value
##   <chr>                <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept)        -3037.     997.       -3.05 0.00333      
## 2 flipper_length_mm     34.6      5.09      6.79 0.00000000375
## 
## [[3]]
## # A tibble: 2 × 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)        -6787.    1093.       -6.21 7.65e- 9
## 2 flipper_length_mm     54.6      5.03     10.9  1.33e-19
```

---

## `lubridate`
  
`lubridate` is a another R package meant for data wrangling!

In particular, `lubridate` makes it very easy to work with days, times, and dates.  The base idea is to start with dates in a `ymd` (year month day) format and transform the information into whatever you want.

Example from  https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html

---

### If anyone drove a time machine, they would crash

The length of months and years change so often that doing arithmetic with them can be unintuitive. <mark>Consider a simple operation, January 31st + one month.</mark>

---

### If anyone drove a time machine, they would crash

The length of months and years change so often that doing arithmetic with them can be unintuitive. <mark>Consider a simple operation, January 31st + one month.</mark>

Should the answer be:
  
1. February 31st (which doesn't exist)
2. March 4th (31 days after January 31), or
3. February 28th (assuming its not a leap year)

A basic property of arithmetic is that a + b - b = a. Only solution 1 obeys the mathematical property, but it is an invalid date. Wickham wants to  make lubridate as consistent as possible by invoking the following rule: <mark>if adding or subtracting a month or a year creates an invalid date, lubridate will return an NA. </mark>

If you thought solution 2 or 3 was more useful, no problem. You can still get those results with clever arithmetic, or by using the special `%m+%` and `%m-%` operators. `%m+%` and `%m-%` automatically roll dates back to the last day of the month, should that be necessary.
---

## basics in `lubridate`

```r
library(lubridate); rightnow <- now()

day(rightnow)
```

```
## [1] 18
```

```r
week(rightnow)
```

```
## [1] 38
```

```r
month(rightnow, label=FALSE)
```

```
## [1] 9
```

```r
month(rightnow, label=TRUE)
```

```
## [1] Sep
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
```

```r
year(rightnow)
```

```
## [1] 2021
```

---
## basics in `lubridate`

```r
minute(rightnow)
```

```
## [1] 9
```

```r
hour(rightnow)
```

```
## [1] 14
```

```r
yday(rightnow)
```

```
## [1] 261
```

```r
mday(rightnow)
```

```
## [1] 18
```

```r
wday(rightnow, label=FALSE)
```

```
## [1] 7
```

```r
wday(rightnow, label=TRUE)
```

```
## [1] Sat
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
```

---

## But how do I create a date object?

```r
jan31 <- ymd("2021-01-31")
jan31 + months(0:11)
```

```
##  [1] "2021-01-31" NA           "2021-03-31" NA           "2021-05-31"
##  [6] NA           "2021-07-31" "2021-08-31" NA           "2021-10-31"
## [11] NA           "2021-12-31"
```

```r
floor_date(jan31, "month") + months(0:11) + days(31)
```

```
##  [1] "2021-02-01" "2021-03-04" "2021-04-01" "2021-05-02" "2021-06-01"
##  [6] "2021-07-02" "2021-08-01" "2021-09-01" "2021-10-02" "2021-11-01"
## [11] "2021-12-02" "2022-01-01"
```

```r
jan31 + months(0:11) + days(31)
```

```
##  [1] "2021-03-03" NA           "2021-05-01" NA           "2021-07-01"
##  [6] NA           "2021-08-31" "2021-10-01" NA           "2021-12-01"
## [11] NA           "2022-01-31"
```

```r
jan31 %m+% months(0:11)
```

```
##  [1] "2021-01-31" "2021-02-28" "2021-03-31" "2021-04-30" "2021-05-31"
##  [6] "2021-06-30" "2021-07-31" "2021-08-31" "2021-09-30" "2021-10-31"
## [11] "2021-11-30" "2021-12-31"
```

---

## NYC flights

```r
library(nycflights13)
names(flights)
```

```
##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"
```

---

## NYC flights

```r
flightsWK <- flights %>% 
   mutate(ymdday = ymd(paste(year, month,day, sep="-"))) %>%
   mutate(weekdy = wday(ymdday, label=TRUE), 
          whichweek = week(ymdday))

flightsWK %>% select(year, month, day, ymdday, weekdy, whichweek, 
                     dep_time, arr_time, air_time) 
```

```
## # A tibble: 336,776 × 9
##     year month   day ymdday     weekdy whichweek dep_time arr_time air_time
##    <int> <int> <int> <date>     <ord>      <dbl>    <int>    <int>    <dbl>
##  1  2013     1     1 2013-01-01 Tue            1      517      830      227
##  2  2013     1     1 2013-01-01 Tue            1      533      850      227
##  3  2013     1     1 2013-01-01 Tue            1      542      923      160
##  4  2013     1     1 2013-01-01 Tue            1      544     1004      183
##  5  2013     1     1 2013-01-01 Tue            1      554      812      116
##  6  2013     1     1 2013-01-01 Tue            1      554      740      150
##  7  2013     1     1 2013-01-01 Tue            1      555      913      158
##  8  2013     1     1 2013-01-01 Tue            1      557      709       53
##  9  2013     1     1 2013-01-01 Tue            1      557      838      140
## 10  2013     1     1 2013-01-01 Tue            1      558      753      138
## # … with 336,766 more rows
```

---

## `reprex`

> Help me help you

`repr`oducible `ex`ample ...

Step 1.  Copy code onto the clipboard

Step 2.  Type `reprex()` into the Console

Step 3.  Look at the Viewer to the right.  Copy the Viewer output into GitHub, Piazza, Discord, an email, stackexchange, etc.

---

##  `reprex` demo

```
reprex(
  jan31 + months(0:11) + days(31)
)
```
multiple lines of code:

```
reprex({
  jan31 <- ymd("2021-01-31")
  jan31 + months(0:11) + days(31)
})
```

```
reprex({
  library(lubridate)
  jan31 <- ymd("2021-01-31")
  jan31 + months(0:11) + days(31)
})
```