Wrangling Data I Tidy Verbs

September 9 + 11, 2024

Jo Hardin

Much of this material can be found at the introduction to dplyr vignette.

Agenda 9/9/24

  1. Tidy data
  2. Data verbs

The data

What does a data set look like?

  • Observations down the rows
  • Variables across the columns
  • Flat file versus relational database.

Active Duty Military

The Active Duty data are not tidy! What are the cases? How are the data not tidy? What might the data look like in tidy form? Suppose that the case was “an individual in the armed forces.” What variables would you use to capture the information in the following table?

Tidy packages: the tidyverse

image credit: https://www.tidyverse.org/.

Reading in data from a file

Hosted online:

movies <- read.csv("http://pages.pomona.edu/~jsh04747/courses/math58/Math58Data/movies2.csv")

Hosted locally:

movies <- read_csv("movies2.csv")

Things to note:

  • The assign arrow is used to create objects in R, which are stored in your environment.
  • Object names don’t have to correspond to file names.
  • Be sure R knows where to look for the file!

Viewing data - the viewer / Environment

  • View() can be used in RStudio to bring up an excel-style spreadsheet. Only for viewing, not editing!
  • The dimensions of the data are found in the environment pane.
  • The names of the variables are seen at the top of the viewer.
  • View() has a capital letter V.
  • View() should not be used in the Quarto document.

Viewing data - inside .qmd / the console

  • dim() is used to find the dimensions (rows x columns).
  • names() is used to find the names of the variables.
  • head() is used to print the first several lines of the dataset to the console.

Practice

  1. What are the dimensions of the data set?
  2. What are the variables?
  3. What appears to be the unit of observation?
dim(movies)
[1] 134   5
names(movies)
[1] "X"           "score2"      "rating2"     "genre2"      "box.office2"
head(movies,3)
                 X score2 rating2     genre2 box.office2
1 2 Fast 2 Furious   48.9   PG-13     action     127.146
2    28 Days Later   78.2       R     horror      45.065
3      A Guy Thing   39.5   PG-13 rom comedy      15.545

Reading in data from a package

For now, we’ll work with all flights out of the three NYC airports in 2013.

  1. Download and install the package from CRAN (done in the Console, only once).
install.packages("nycflights13")
  1. Load the package (in the .qmd file, need it for the .qmd file to compile appropriately).
library(nycflights13)
  1. Make the data set visible.
data(flights)
  1. Get help.
?flights

Slice and dice with dplyr (a package within the tidyverse)

dplyr

Whenever you’re learning a new tool, for a long time you’re going to suck … but the good news is that is typical, that’s something that happens to everyone, and it’s only temporary.

-Hadley Wickham

Why dplyr?

Data sets are often of high volume (lots of rows) and high variety (lots of columns). This is overwhelming to visualize and analyze, so we find ourselves chopping the data set up into more manageable and meaningful chunks. We also often need to perform operations to organize and clean our data.

This is all possible in base R, but with dplyr, it is simple, readable, and fast.

Verbs

Most data wrangling happens with a set of data verbs. Verbs are functions that act on data frames.

The first argument of each data verb is the data frame.

Some Basic Verbs

  • filter()
  • arrange()
  • select()
  • distinct()
  • mutate()
  • summarize()
  • sample_n()

filter()

Allows you to select a subset of the rows of a data frame. The first argument is the name of the data frame, the following arguments are the filters that you’d like to apply

For all flights on January 1st:

filter(flights, month == 1, day == 1)
# A tibble: 842 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 832 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Constructing filters

Filters are constructed of logical operators: <, >, <=, >=, ==, != (and some others).

Adding them one by one to filter() is akin to saying “this AND that”. To say “this OR that OR both”, use |.

filter(flights, month == 1 | month == 2)
# A tibble: 51,955 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 51,945 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Practice

Construct filters to isolate:

  1. Flights that left on St. Patrick’s Day.
  2. Flights that were destined for Chicago’s primary airport.
  3. Flights that were destined for Chicago’s primary airport and were operated by United Airlines.
  4. Flights with flight times more than 2000 miles or that were in the air more than 5 hours.

Solution

  1. Flights that left on St. Patrick’s Day.
  2. Flights that were destined for Chicago’s primary airport.
  3. Flights that were destined for Chicago’s primary airport and were operated by United Airlines.
  4. Flights with flight times more than 2000 miles or that were in the air more than 5 hours.
filter(flights, month == 3, day == 17)
filter(flights, dest == "ORD")
filter(flights, dest == "ORD", carrier == "UA")
filter(flights, distance > 2000 | air_time > 5*60)

arrange()

arrange() reorders the rows: It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

arrange(flights, year, month, day)

Use desc() to sort in descending order.

arrange(flights, desc(arr_delay))

select()

Often you work with large datasets with many columns where only a few are actually of interest to you. select() allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:

select(flights, year, month, day)

You can exclude columns using - and specify a range using :.

select(flights, -(year:day))

distinct()

A common use of select() is to find out which values a set of variables takes. This is particularly useful in conjunction with the distinct() verb which only returns the unique values in a table.

What do the following data correspond to?

distinct(select(flights, origin, dest))
# A tibble: 224 × 2
   origin dest 
   <chr>  <chr>
 1 EWR    IAH  
 2 LGA    IAH  
 3 JFK    MIA  
 4 JFK    BQN  
 5 LGA    ATL  
 6 EWR    ORD  
 7 EWR    FLL  
 8 LGA    IAD  
 9 JFK    MCO  
10 LGA    ORD  
# ℹ 214 more rows

mutate()

As well as selecting from the set of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate():

select(mutate(flights, gain = dep_delay - arr_delay), 
       flight, dep_delay, arr_delay, gain)
# A tibble: 336,776 × 4
   flight dep_delay arr_delay  gain
    <int>     <dbl>     <dbl> <dbl>
 1   1545         2        11    -9
 2   1714         4        20   -16
 3   1141         2        33   -31
 4    725        -1       -18    17
 5    461        -6       -25    19
 6   1696        -4        12   -16
 7    507        -5        19   -24
 8   5708        -3       -14    11
 9     79        -3        -8     5
10    301        -2         8   -10
# ℹ 336,766 more rows

summarize() and sample_n()

summarize() collapses a data frame to a single row based on some function. It’s not very useful yet, but it will be.

sample_n() provides you with a random sample of rows.

summarize(flights, delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 × 1
  delay
  <dbl>
1  12.6
sample_n(flights, 10)
# A tibble: 10 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     3    28     1731           1738        -7     1943           2004
 2  2013    12    30      640            645        -5      836            858
 3  2013     5    16     1519           1525        -6     1703           1715
 4  2013     2    17     2344           2100       164      159           2346
 5  2013     8     9      748            752        -4      931            913
 6  2013     6    28      846            815        31     1026           1030
 7  2013     7    23     2215           2129        46     2400           2326
 8  2013     5    11       23           2245        98      126           2357
 9  2013    11     5     1548           1550        -2     1655           1710
10  2013     9    10     1125           1130        -5     1317           1334
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Practice

Mutate the data to create a new column that contains the average speed traveled by the plane for each flight.

Select the new variable and save it, along with tailnum, as a new data frame object.

Practice

Mutate the data to create a new column that contains the average speed traveled by the plane for each flight.

Select the new variable and save it, along with tailnum, as a new data frame object.

flights2 <- mutate(flights, speed = distance/(air_time/60))
speed_data <- select(flights2, tailnum, speed)

group_by()

summarize() and sample_n() are even more powerful when combined with the idea of “group by”, repeating the operation separately on groups of observations within the dataset.

The group_by() function describes how to break a dataset down into groups of rows.

group_by()

Find the fastest airplanes in the bunch, measured as the average speed per airplane.

by_tailnum <- group_by(speed_data, tailnum)
avg_speed <- summarize(by_tailnum, 
                       count = n(), 
                       avg_speed = mean(speed, na.rm = TRUE))
arrange(avg_speed, desc(avg_speed))
# A tibble: 4,044 × 3
   tailnum count avg_speed
   <chr>   <int>     <dbl>
 1 N228UA      1      501.
 2 N315AS      1      499.
 3 N654UA      1      499.
 4 N819AW      1      490.
 5 N382HA     26      486.
 6 N388HA     36      484.
 7 N391HA     21      484.
 8 N777UA      1      483.
 9 N385HA     28      483.
10 N392HA     13      482.
# ℹ 4,034 more rows

Agenda 9/11/24

  1. Data verbs
  2. Chaining

Chaining

Instead of applying each verb step-by-step, we can chain them into a single data pipeline, connected with the |> operator. You start the pipeline with a data frame and then pass it to each function in turn.

The pipe syntax (|>) takes a data frame and sends it to the argument of a function. The mapping goes to the first available argument in the function. For example:

x |> f() is the same as f(x)

x |> f(y) is the same as f(x, y)

Mornings

me_step1 <- dress(me, what = sports)  
me_step2 <- exercise(me_step1, how = running)  
me_step3 <- eat(me_step2, choice = cereal)  
me_step4 <- dress(me_step3, what = school)  
me_step5 <- commute(me_step4, transportation = bike)  

Mornings

commute(dress(eat(exercise(dress(me, what = sports), how = running), choice = cereal), what = school), transportation = bike)

Morning

(better??)

commute(
  dress(
    eat(
      exercise(
        dress(me, 
              what = sports), 
        how = running), 
      choice = cereal), 
    what = school), 
  transportation = bike)

Mornings

me |> 
  dress(what = sports) |> 
  exercise(how = running) |> 
  eat(choice = cereal) |> 
  dress(what = school) |> 
  commute(transportation = bike)

Little Bunny Foo Foo

From Hadley Wickham, how to think about tidy data.

Little bunny Foo Foo

Went hopping through the forest

Scooping up the field mice

And bopping them on the head

Little Bunny Foo Foo

The nursery rhyme could be created by a series of steps where the output from each step is saved as an object along the way.

foo_foo <- little_bunny()
foo_foo_1 <- hop(foo_foo, through = forest)
foo_foo_2 <- scoop(foo_foo_2, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)

Little Bunny Foo Foo

Another approach is to concatenate the functions so that there is only one output.

bop(
   scoop(
      hop(foo_foo, through = forest),
      up = field_mice),
   on = head)

Little Bunny Foo Foo

Or even worse, as one line:

bop(scoop(hop(foo_foo, through = forest), up = field_mice), on = head)))

Little Bunny Foo Foo

Instead, the code can be written using the pipe in the order in which the function is evaluated:

foo_foo |> 
   hop(through = forest) |> 
       scoop(up = field_mice) |> 
           bop(on = head)

Flights

flights2 <- mutate(flights, speed = distance/(air_time/60))
tail_speed <- select(flights2, tailnum, speed)
tail_speed_grp <- group_by(tail_speed, tailnum)
tail_ave <- summarize(tail_speed_grp, number = n(),
                      avg_speed = mean(speed, na.rm = TRUE))
arrange(tail_ave, desc(avg_speed))
# A tibble: 4,044 × 3
   tailnum number avg_speed
   <chr>    <int>     <dbl>
 1 N228UA       1      501.
 2 N315AS       1      499.
 3 N654UA       1      499.
 4 N819AW       1      490.
 5 N382HA      26      486.
 6 N388HA      36      484.
 7 N391HA      21      484.
 8 N777UA       1      483.
 9 N385HA      28      483.
10 N392HA      13      482.
# ℹ 4,034 more rows
flights |> 
  mutate(speed = distance / (air_time/60)) |> 
  select(tailnum, speed) |> 
  group_by(tailnum) |>  
  summarize(number = n(), 
            avg_speed = mean(speed, na.rm = TRUE)) |> 
  arrange(desc(avg_speed))
# A tibble: 4,044 × 3
   tailnum number avg_speed
   <chr>    <int>     <dbl>
 1 N228UA       1      501.
 2 N315AS       1      499.
 3 N654UA       1      499.
 4 N819AW       1      490.
 5 N382HA      26      486.
 6 N388HA      36      484.
 7 N391HA      21      484.
 8 N777UA       1      483.
 9 N385HA      28      483.
10 N392HA      13      482.
# ℹ 4,034 more rows

Practice

Form a chain that creates a data frame containing only carrier and each carrier’s mean departure delay time. Which carriers have the highest and lowest mean delays?

Practice

Form a chain that creates a data frame containing only carrier and the mean departure delay time. Which carriers have the highest and lowest mean delays?

flights |> 
  group_by(carrier) |> 
  summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) |> 
  arrange(desc(avg_delay))
# A tibble: 16 × 2
   carrier avg_delay
   <chr>       <dbl>
 1 F9          20.2 
 2 EV          20.0 
 3 YV          19.0 
 4 FL          18.7 
 5 WN          17.7 
 6 9E          16.7 
 7 B6          13.0 
 8 VX          12.9 
 9 OO          12.6 
10 UA          12.1 
11 MQ          10.6 
12 DL           9.26
13 AA           8.59
14 AS           5.80
15 HA           4.90
16 US           3.78

Practice again

Say you’re curious about the relationship between the number of flights that each plane made in 2013, the mean distance that each of those planes flew, and the mean arrival delay. You also want to exclude the edge cases from your analysis, so focus on the planes that have logged more than 20 flights and flown an average distance of less than 2000 miles. Please form the chain that creates this dataset.

Solution

delay_data <- flights |> 
  group_by(tailnum) |> 
  summarize(number = n(),
            dist = mean(distance, na.rm = TRUE), 
            delay = mean(arr_delay, na.rm = TRUE)) |> 
  filter(number > 20, dist < 2000)

Visualizing the data

delay_data |> 
 ggplot(aes(dist, delay)) +
 geom_point(aes(size = number), 
            alpha = 1/2) +
 geom_smooth() +
 scale_size_area()

When scale_size_area is used, the default behavior is to scale the area of points to be proportional to the value.