September 9 + 11, 2024
Much of this material can be found at the introduction to dplyr vignette.
The Active Duty data are not tidy! What are the cases? How are the data not tidy? What might the data look like in tidy form? Suppose that the case was “an individual in the armed forces.” What variables would you use to capture the information in the following table?
image credit: https://www.tidyverse.org/.
Hosted online:
Hosted locally:
Things to note:
View()
can be used in RStudio to bring up an excel-style spreadsheet. Only for viewing, not editing!View()
has a capital letter V
.View()
should not be used in the Quarto document.dim()
is used to find the dimensions (rows x columns).names()
is used to find the names of the variables.head()
is used to print the first several lines of the dataset to the console.[1] 134 5
[1] "...1" "score2" "rating2" "genre2" "box office2"
# A tibble: 3 × 5
...1 score2 rating2 genre2 `box office2`
<chr> <dbl> <chr> <chr> <dbl>
1 2 Fast 2 Furious 48.9 PG-13 action 127.
2 28 Days Later 78.2 R horror 45.1
3 A Guy Thing 39.5 PG-13 rom comedy 15.5
For now, we’ll work with all flights out of the three NYC airports in 2013.
Whenever you’re learning a new tool, for a long time you’re going to suck … but the good news is that is typical, that’s something that happens to everyone, and it’s only temporary.
-Hadley Wickham
Data sets are often of high volume (lots of rows) and high variety (lots of columns). This is overwhelming to visualize and analyze, so we find ourselves chopping the data set up into more manageable and meaningful chunks. We also often need to perform operations to organize and clean our data.
This is all possible in base R, but with dplyr
, it is simple, readable, and fast.
Most data wrangling happens with a set of data verbs. Verbs are functions that act on dataframes.
The first argument of each data verb is the dataframe.
filter()
arrange()
select()
distinct()
mutate()
summarize()
sample_n()
filter()
Allows you to select a subset of the rows of a data frame. The first argument is the name of the data frame, the following arguments are the filters that you’d like to apply
For all flights on January 1st:
# A tibble: 842 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 832 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Filters are constructed of logical operators: <
, >
, <=
, >=
, ==
, !=
(and some others).
Adding them one by one to filter()
is akin to saying “this AND that”. To say “this OR that OR both”, use |.
# A tibble: 51,955 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 51,945 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Construct filters to isolate:
arrange()
arrange()
reorders the rows: It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
Use desc()
to sort in descending order.
select()
Often you work with large datasets with many columns where only a few are actually of interest to you. select()
allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:
You can exclude columns using -
and specify a range using :
.
distinct()
A common use of select()
is to find out which values a set of variables takes. This is particularly useful in conjunction with the distinct()
verb which only returns the unique values in a table.
What do the following data correspond to?
mutate()
As well as selecting from the set of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate()
:
# A tibble: 336,776 × 4
flight dep_delay arr_delay gain
<int> <dbl> <dbl> <dbl>
1 1545 2 11 -9
2 1714 4 20 -16
3 1141 2 33 -31
4 725 -1 -18 17
5 461 -6 -25 19
6 1696 -4 12 -16
7 507 -5 19 -24
8 5708 -3 -14 11
9 79 -3 -8 5
10 301 -2 8 -10
# ℹ 336,766 more rows
summarize()
and sample_n()
summarize()
collapses a data frame to a single row based on some function. It’s not very useful yet, but it will be.
sample_n()
provides you with a random sample of rows.
# A tibble: 1 × 1
delay
<dbl>
1 12.6
# A tibble: 10 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 2 4 2306 2250 16 2357 2353
2 2013 11 12 1221 1215 6 1409 1445
3 2013 9 21 939 940 -1 1211 1235
4 2013 11 19 727 730 -3 830 844
5 2013 8 15 651 655 -4 942 930
6 2013 7 8 1015 1025 -10 1204 1222
7 2013 7 30 1159 1200 -1 1318 1310
8 2013 7 11 1743 1735 8 2018 2030
9 2013 5 20 2114 2117 -3 2335 2351
10 2013 2 24 1443 1445 -2 1652 1710
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Mutate the data to create a new column that contains the average speed traveled by the plane for each flight.
Select the new variable and save it, along with tailnum, as a new data frame object.
Mutate the data to create a new column that contains the average speed traveled by the plane for each flight.
Select the new variable and save it, along with tailnum, as a new data frame object.
group_by()
summarize()
and sample_n()
are even more powerful when combined with the idea of “group by”, repeating the operation separately on groups of observations within the dataset.
The group_by()
function describes how to break a dataset down into groups of rows.
group_by()
Find the fastest airplanes in the bunch, measured as the average speed per airplane.
by_tailnum <- group_by(speed_data, tailnum)
avg_speed <- summarize(by_tailnum,
count = n(),
avg_speed = mean(speed, na.rm = TRUE))
arrange(avg_speed, desc(avg_speed))
# A tibble: 4,044 × 3
tailnum count avg_speed
<chr> <int> <dbl>
1 N228UA 1 501.
2 N315AS 1 499.
3 N654UA 1 499.
4 N819AW 1 490.
5 N382HA 26 486.
6 N388HA 36 484.
7 N391HA 21 484.
8 N777UA 1 483.
9 N385HA 28 483.
10 N392HA 13 482.
# ℹ 4,034 more rows
Instead of applying each verb step-by-step, we can chain them into a single data pipeline, connected with the |>
operator. You start the pipeline with a data frame and then pass it to each function in turn.
The pipe syntax (|>
) takes a data frame and sends it to the argument of a function. The mapping goes to the first available argument in the function. For example:
x |> f()
is the same as f(x)
x |> f(y)
is the same as f(x, y)
(better??)
From Hadley Wickham, how to think about tidy data.
Little bunny Foo Foo
Went hopping through the forest
Scooping up the field mice
And bopping them on the head
The nursery rhyme could be created by a series of steps where the output from each step is saved as an object along the way.
Another approach is to concatenate the functions so that there is only one output.
Or even worse, as one line:
Instead, the code can be written using the pipe in the order in which the function is evaluated:
flights2 <- mutate(flights, speed = distance/(air_time/60))
tail_speed <- select(flights2, tailnum, speed)
tail_speed_grp <- group_by(tail_speed, tailnum)
tail_ave <- summarize(tail_speed_grp, number = n(),
avg_speed = mean(speed, na.rm = TRUE))
arrange(tail_ave, desc(avg_speed))
# A tibble: 4,044 × 3
tailnum number avg_speed
<chr> <int> <dbl>
1 N228UA 1 501.
2 N315AS 1 499.
3 N654UA 1 499.
4 N819AW 1 490.
5 N382HA 26 486.
6 N388HA 36 484.
7 N391HA 21 484.
8 N777UA 1 483.
9 N385HA 28 483.
10 N392HA 13 482.
# ℹ 4,034 more rows
flights |>
mutate(speed = distance / (air_time/60)) |>
select(tailnum, speed) |>
group_by(tailnum) |>
summarize(number = n(),
avg_speed = mean(speed, na.rm = TRUE)) |>
arrange(desc(avg_speed))
# A tibble: 4,044 × 3
tailnum number avg_speed
<chr> <int> <dbl>
1 N228UA 1 501.
2 N315AS 1 499.
3 N654UA 1 499.
4 N819AW 1 490.
5 N382HA 26 486.
6 N388HA 36 484.
7 N391HA 21 484.
8 N777UA 1 483.
9 N385HA 28 483.
10 N392HA 13 482.
# ℹ 4,034 more rows
Form a chain that creates a data frame containing only carrier and each carrier’s mean departure delay time. Which carriers have the highest and lowest mean delays?
Form a chain that creates a data frame containing only carrier and the mean departure delay time. Which carriers have the highest and lowest mean delays?
flights |>
group_by(carrier) |>
summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) |>
arrange(desc(avg_delay))
# A tibble: 16 × 2
carrier avg_delay
<chr> <dbl>
1 F9 20.2
2 EV 20.0
3 YV 19.0
4 FL 18.7
5 WN 17.7
6 9E 16.7
7 B6 13.0
8 VX 12.9
9 OO 12.6
10 UA 12.1
11 MQ 10.6
12 DL 9.26
13 AA 8.59
14 AS 5.80
15 HA 4.90
16 US 3.78
Say you’re curious about the relationship between the number of flights that each plane made in 2013, the mean distance that each of those planes flew, and the mean arrival delay. You also want to exclude the edge cases from your analysis, so focus on the planes that have logged more than 20 flights and flown an average distance of less than 2000 miles. Please form the chain that creates this dataset.
pivot
ing_join
)pivot
ingimage credit: https://www.garrickadenbuie.com/project/tidyexplain/
pivot_longer()
makes the dataframe “longer” – many columns into a few columns (more rows): pivot_longer(data, cols, names_to = , value_to = )
pivot_wider()
makes the dataframe “wider” – a few columns into many columns (fewer rows): pivot_wider(data, names_from = , values_from = )
pivot_longer
pivot_longer
will be demonstrated using datasets from GapMinder.
litF
represents country, year, and female literacy rate.
library(googlesheets4)
gs4_deauth()
litF <- read_sheet("https://docs.google.com/spreadsheets/d/1hDinTIRHQIaZg1RUn6Z_6mo12PtKwEPFIz_mJVF6P5I/pub?gid=0")
litF
# A tibble: 260 × 38
Adult (15+) literacy rate …¹ `1975` `1976` `1977` `1978` `1979` `1980` `1981`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan NA NA NA NA 4.99 NA NA
2 Albania NA NA NA NA NA NA NA
3 Algeria NA NA NA NA NA NA NA
4 Andorra NA NA NA NA NA NA NA
5 Angola NA NA NA NA NA NA NA
6 Anguilla NA NA NA NA NA NA NA
7 Antigua and Barbuda NA NA NA NA NA NA NA
8 Argentina NA NA NA NA NA 93.6 NA
9 Armenia NA NA NA NA NA NA NA
10 Aruba NA NA NA NA NA NA NA
# ℹ 250 more rows
# ℹ abbreviated name: ¹`Adult (15+) literacy rate (%). Female`
# ℹ 30 more variables: `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
# `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
# `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>,
# `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>,
# `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, …
pivot_longer
litF <- litF |>
select(country = starts_with("Adult"), everything()) |>
pivot_longer(cols = -country,
names_to = "year",
values_to = "litRateF") |>
filter(!is.na(litRateF))
litF
# A tibble: 571 × 3
country year litRateF
<chr> <chr> <dbl>
1 Afghanistan 1979 4.99
2 Afghanistan 2011 13
3 Albania 2001 98.3
4 Albania 2008 94.7
5 Albania 2011 95.7
6 Algeria 1987 35.8
7 Algeria 2002 60.1
8 Algeria 2006 63.9
9 Angola 2001 54.2
10 Angola 2011 58.6
# ℹ 561 more rows
pivot_longer
GDP
represents country, year, and gdp (in fixed 2000 US$).
GDP <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")
GDP <- GDP |>
select(country = starts_with("Income"), everything()) |>
pivot_longer(cols = -country,
names_to = "year",
values_to = "gdp") |>
filter(!is.na(gdp))
GDP
# A tibble: 7,988 × 3
country year gdp
<chr> <chr> <dbl>
1 Albania 1980 1061.
2 Albania 1981 1100.
3 Albania 1982 1111.
4 Albania 1983 1101.
5 Albania 1984 1065.
6 Albania 1985 1060.
7 Albania 1986 1092.
8 Albania 1987 1054.
9 Albania 1988 1014.
10 Albania 1989 1092.
# ℹ 7,978 more rows
pivot_wider
pivot_wider
will be demonstrated using the babynames
dataset.
# A tibble: 1,924,665 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
5 1880 F Minnie 1746 0.0179
6 1880 F Margaret 1578 0.0162
7 1880 F Ida 1472 0.0151
8 1880 F Alice 1414 0.0145
9 1880 F Bertha 1320 0.0135
10 1880 F Sarah 1288 0.0132
# ℹ 1,924,655 more rows
pivot_wider
babynames# A tibble: 1,756,284 × 4
year name F M
<dbl> <chr> <int> <int>
1 1880 Mary 7065 27
2 1880 Anna 2604 12
3 1880 Emma 2003 10
4 1880 Elizabeth 1939 9
5 1880 Minnie 1746 9
6 1880 Margaret 1578 NA
7 1880 Ida 1472 8
8 1880 Alice 1414 NA
9 1880 Bertha 1320 NA
10 1880 Sarah 1288 NA
# ℹ 1,756,274 more rows
pivot_wider
babynamesbabynames |>
select(-prop) |>
pivot_wider(names_from = sex, values_from = n) |>
filter(!is.na(F), !is.na(M)) |>
arrange(desc(year), desc(F))
# A tibble: 168,381 × 4
year name F M
<dbl> <chr> <int> <int>
1 2017 Emma 19738 14
2 2017 Olivia 18632 10
3 2017 Ava 15902 12
4 2017 Isabella 15100 12
5 2017 Sophia 14831 17
6 2017 Mia 13437 16
7 2017 Charlotte 12893 7
8 2017 Amelia 11800 11
9 2017 Evelyn 10675 11
10 2017 Abigail 10551 6
# ℹ 168,371 more rows
pivot_wider
babynamesbabynames |>
pivot_wider(names_from = sex, values_from = n) |>
mutate(maxcount = pmax(F, M, na.rm = TRUE)) |>
arrange(desc(maxcount))
# A tibble: 1,924,653 × 6
year name prop F M maxcount
<dbl> <chr> <dbl> <int> <int> <int>
1 1947 Linda 0.0548 99686 NA 99686
2 1948 Linda 0.0552 96209 NA 96209
3 1947 James 0.0510 NA 94756 94756
4 1957 Michael 0.0424 NA 92695 92695
5 1947 Robert 0.0493 NA 91642 91642
6 1949 Linda 0.0518 91016 NA 91016
7 1956 Michael 0.0423 NA 90620 90620
8 1958 Michael 0.0420 NA 90520 90520
9 1948 James 0.0497 NA 88588 88588
10 1954 Michael 0.0428 NA 88514 88514
# ℹ 1,924,643 more rows
See the Posit cheatsheets on wrangling & joining and pivoting.
left_join
returns all rows from the left table, and any rows with matching keys from the right table.inner_join
returns only the rows in which the left table have matching keys in the right table (i.e., matching rows in both sets).full_join
returns all rows from both tables, join records from the left which have matching keys in the right table.Good practice: always specify the by
argument when joining data frames.
10 women in science who changed the world (source: Discover Magazine)1
name | profession |
---|---|
Ada Lovelace | Mathematician |
Marie Curie | Physicist and Chemist |
Janaki Ammal | Botanist |
Chien-Shiung Wu | Physicist |
Katherine Johnson | Mathematician |
Rosalind Franklin | Chemist |
Vera Rubin | Astronomer |
Gladys West | Mathematician |
Flossie Wong-Staal | Virologist and Molecular Biologist |
Jennifer Doudna | Biochemist |
# A tibble: 10 × 2
name profession
<chr> <chr>
1 Ada Lovelace Mathematician
2 Marie Curie Physicist and Chemist
3 Janaki Ammal Botanist
4 Chien-Shiung Wu Physicist
5 Katherine Johnson Mathematician
6 Rosalind Franklin Chemist
7 Vera Rubin Astronomer
8 Gladys West Mathematician
9 Flossie Wong-Staal Virologist and Molecular Biologist
10 Jennifer Doudna Biochemist
# A tibble: 9 × 2
name known_for
<chr> <chr>
1 Ada Lovelace first computer algorithm
2 Marie Curie theory of radioactivity, discovery of elements polonium a…
3 Janaki Ammal hybrid species, biodiversity protection
4 Chien-Shiung Wu confim and refine theory of radioactive beta decy, Wu expe…
5 Katherine Johnson calculations of orbital mechanics critical to sending the …
6 Vera Rubin existence of dark matter
7 Gladys West mathematical modeling of the shape of the Earth which serv…
8 Flossie Wong-Staal first scientist to clone HIV and create a map of its genes…
9 Jennifer Doudna one of the primary developers of CRISPR, a ground-breaking…
We’d like to put together the data to look like:
# A tibble: 10 × 5
name profession birth_year death_year known_for
<chr> <chr> <dbl> <dbl> <chr>
1 Ada Lovelace Mathematician NA NA first co…
2 Marie Curie Physicist and Chemist NA NA theory o…
3 Janaki Ammal Botanist 1897 1984 hybrid s…
4 Chien-Shiung Wu Physicist 1912 1997 confim a…
5 Katherine Johnson Mathematician 1918 2020 calculat…
6 Rosalind Franklin Chemist 1920 1958 <NA>
7 Vera Rubin Astronomer 1928 2016 existenc…
8 Gladys West Mathematician 1930 NA mathemat…
9 Flossie Wong-Staal Virologist and Molecular … 1947 2020 first sc…
10 Jennifer Doudna Biochemist 1964 NA one of t…
For the next few slides…
left_join()
left_join()
# A tibble: 10 × 4
name profession birth_year death_year
<chr> <chr> <dbl> <dbl>
1 Ada Lovelace Mathematician NA NA
2 Marie Curie Physicist and Chemist NA NA
3 Janaki Ammal Botanist 1897 1984
4 Chien-Shiung Wu Physicist 1912 1997
5 Katherine Johnson Mathematician 1918 2020
6 Rosalind Franklin Chemist 1920 1958
7 Vera Rubin Astronomer 1928 2016
8 Gladys West Mathematician 1930 NA
9 Flossie Wong-Staal Virologist and Molecular Biologist 1947 2020
10 Jennifer Doudna Biochemist 1964 NA
right_join()
right_join()
# A tibble: 8 × 4
name profession birth_year death_year
<chr> <chr> <dbl> <dbl>
1 Janaki Ammal Botanist 1897 1984
2 Chien-Shiung Wu Physicist 1912 1997
3 Katherine Johnson Mathematician 1918 2020
4 Rosalind Franklin Chemist 1920 1958
5 Vera Rubin Astronomer 1928 2016
6 Gladys West Mathematician 1930 NA
7 Flossie Wong-Staal Virologist and Molecular Biologist 1947 2020
8 Jennifer Doudna Biochemist 1964 NA
full_join()
full_join()
# A tibble: 10 × 4
name birth_year death_year known_for
<chr> <dbl> <dbl> <chr>
1 Janaki Ammal 1897 1984 hybrid species, biodiversity protec…
2 Chien-Shiung Wu 1912 1997 confim and refine theory of radioac…
3 Katherine Johnson 1918 2020 calculations of orbital mechanics c…
4 Rosalind Franklin 1920 1958 <NA>
5 Vera Rubin 1928 2016 existence of dark matter
6 Gladys West 1930 NA mathematical modeling of the shape …
7 Flossie Wong-Staal 1947 2020 first scientist to clone HIV and cr…
8 Jennifer Doudna 1964 NA one of the primary developers of CR…
9 Ada Lovelace NA NA first computer algorithm
10 Marie Curie NA NA theory of radioactivity, discovery…
inner_join()
inner_join()
# A tibble: 7 × 4
name birth_year death_year known_for
<chr> <dbl> <dbl> <chr>
1 Janaki Ammal 1897 1984 hybrid species, biodiversity protect…
2 Chien-Shiung Wu 1912 1997 confim and refine theory of radioact…
3 Katherine Johnson 1918 2020 calculations of orbital mechanics cr…
4 Vera Rubin 1928 2016 existence of dark matter
5 Gladys West 1930 NA mathematical modeling of the shape o…
6 Flossie Wong-Staal 1947 2020 first scientist to clone HIV and cre…
7 Jennifer Doudna 1964 NA one of the primary developers of CRI…
semi_join()
semi_join()
anti_join()
anti_join()
# A tibble: 10 × 5
name profession birth_year death_year known_for
<chr> <chr> <dbl> <dbl> <chr>
1 Ada Lovelace Mathematician NA NA first co…
2 Marie Curie Physicist and Chemist NA NA theory o…
3 Janaki Ammal Botanist 1897 1984 hybrid s…
4 Chien-Shiung Wu Physicist 1912 1997 confim a…
5 Katherine Johnson Mathematician 1918 2020 calculat…
6 Rosalind Franklin Chemist 1920 1958 <NA>
7 Vera Rubin Astronomer 1928 2016 existenc…
8 Gladys West Mathematician 1930 NA mathemat…
9 Flossie Wong-Staal Virologist and Molecular … 1947 2020 first sc…
10 Jennifer Doudna Biochemist 1964 NA one of t…
litF
and GDP
from Gapminder.[1] 571 4
# A tibble: 571 × 4
country year litRateF gdp
<chr> <chr> <dbl> <dbl>
1 Afghanistan 1979 4.99 NA
2 Afghanistan 2011 13 NA
3 Albania 2001 98.3 1282.
4 Albania 2008 94.7 1804.
5 Albania 2011 95.7 1966.
6 Algeria 1987 35.8 1902.
7 Algeria 2002 60.1 1872.
8 Algeria 2006 63.9 2125.
9 Angola 2001 54.2 298.
10 Angola 2011 58.6 630.
# ℹ 561 more rows
[1] 7988 4
# A tibble: 7,988 × 4
country year litRateF gdp
<chr> <chr> <dbl> <dbl>
1 Albania 2001 98.3 1282.
2 Albania 2008 94.7 1804.
3 Albania 2011 95.7 1966.
4 Algeria 1987 35.8 1902.
5 Algeria 2002 60.1 1872.
6 Algeria 2006 63.9 2125.
7 Angola 2001 54.2 298.
8 Angola 2011 58.6 630.
9 Antigua and Barbuda 2001 99.4 9640.
10 Antigua and Barbuda 2011 99.4 9978.
# ℹ 7,978 more rows
[1] 505 4
# A tibble: 505 × 4
country year litRateF gdp
<chr> <chr> <dbl> <dbl>
1 Albania 2001 98.3 1282.
2 Albania 2008 94.7 1804.
3 Albania 2011 95.7 1966.
4 Algeria 1987 35.8 1902.
5 Algeria 2002 60.1 1872.
6 Algeria 2006 63.9 2125.
7 Angola 2001 54.2 298.
8 Angola 2011 58.6 630.
9 Antigua and Barbuda 2001 99.4 9640.
10 Antigua and Barbuda 2011 99.4 9978.
# ℹ 495 more rows
[1] 8054 4
# A tibble: 8,054 × 4
country year litRateF gdp
<chr> <chr> <dbl> <dbl>
1 Afghanistan 1979 4.99 NA
2 Afghanistan 2011 13 NA
3 Albania 2001 98.3 1282.
4 Albania 2008 94.7 1804.
5 Albania 2011 95.7 1966.
6 Algeria 1987 35.8 1902.
7 Algeria 2002 60.1 1872.
8 Algeria 2006 63.9 2125.
9 Angola 2001 54.2 298.
10 Angola 2011 58.6 630.
# ℹ 8,044 more rows
join
to merge two datasetsIf you ever need to understand which join is the right join for you, try to find an image that will lay out what the function is doing. I found this one that is quite good and is taken from the Statistics Globe blog.
lubridate is a another R package meant for data wrangling!
In particular, lubridate makes it very easy to work with days, times, and dates. The base idea is to start with dates in a ymd
(year month day) format and transform the information into whatever you want.
Example from the lubridate vignette.
The length of months and years change so often that doing arithmetic with them can be unintuitive. Consider a simple operation: January 31st + one month.
The length of months and years change so often that doing arithmetic with them can be unintuitive. Consider a simple operation: January 31st + one month.
Should the answer be:
A basic property of arithmetic is that a + b - b = a. Only solution 1 obeys the mathematical property, but it is an invalid date. Wickham wants to make lubridate as consistent as possible by invoking the following rule: if adding or subtracting a month or a year creates an invalid date, lubridate will return an NA.
If you thought solution 2 or 3 was more useful, no problem. You can still get those results with clever arithmetic, or by using the special %m+%
and %m-%
operators. %m+%
and %m-%
automatically roll dates back to the last day of the month, should that be necessary.
lubridate
lubridate
[1] "2021-01-31" NA "2021-03-31" NA "2021-05-31"
[6] NA "2021-07-31" "2021-08-31" NA "2021-10-31"
[11] NA "2021-12-31"
[1] "2021-02-01" "2021-03-04" "2021-04-01" "2021-05-02" "2021-06-01"
[6] "2021-07-02" "2021-08-01" "2021-09-01" "2021-10-02" "2021-11-01"
[11] "2021-12-02" "2022-01-01"
[1] "2021-03-03" NA "2021-05-01" NA "2021-07-01"
[6] NA "2021-08-31" "2021-10-01" NA "2021-12-01"
[11] NA "2022-01-31"
[1] "2021-01-31" "2021-02-28" "2021-03-31" "2021-04-30" "2021-05-31"
[6] "2021-06-30" "2021-07-31" "2021-08-31" "2021-09-30" "2021-10-31"
[11] "2021-11-30" "2021-12-31"
flightsWK <- flights |>
mutate(ymdday = ymd(paste(year, month,day, sep="-"))) |>
mutate(weekdy = wday(ymdday, label=TRUE),
whichweek = week(ymdday))
flightsWK |> select(year, month, day, ymdday, weekdy, whichweek,
dep_time, arr_time, air_time)
# A tibble: 336,776 × 9
year month day ymdday weekdy whichweek dep_time arr_time air_time
<int> <int> <int> <date> <ord> <dbl> <int> <int> <dbl>
1 2013 1 1 2013-01-01 Tue 1 517 830 227
2 2013 1 1 2013-01-01 Tue 1 533 850 227
3 2013 1 1 2013-01-01 Tue 1 542 923 160
4 2013 1 1 2013-01-01 Tue 1 544 1004 183
5 2013 1 1 2013-01-01 Tue 1 554 812 116
6 2013 1 1 2013-01-01 Tue 1 554 740 150
7 2013 1 1 2013-01-01 Tue 1 555 913 158
8 2013 1 1 2013-01-01 Tue 1 557 709 53
9 2013 1 1 2013-01-01 Tue 1 557 838 140
10 2013 1 1 2013-01-01 Tue 1 558 753 138
# ℹ 336,766 more rows
reprex
Help me help you
repr
oducible ex
ample …
Step 1. Copy code onto the clipboard
Step 2. Type reprex()
into the Console
Step 3. Look at the Viewer to the right. Copy the Viewer output into GitHub, Piazza, Discord, an email, stackexchange, etc.
reprex
demoreprex(
jan31 + months(0:11) + days(31)
)
multiple lines of code:
reprex({
jan31 <- ymd("2021-01-31")
jan31 + months(0:11) + days(31)
})
reprex({
library(lubridate)
jan31 <- ymd("2021-01-31")
jan31 + months(0:11) + days(31)
})