to go along with
Modern Data Science with R, 3rd edition by Baumer, Kaplan, and Horton
Introduction to Statistical Learning with Applications in R by James, Witten, Hastie, and Tibshirani
geom_point()aes() functionwday.ggplot() function.aes() function.aes() function=Bakery should be upper casetype should not be in quotesstarbucks in wrong place| year | Algeria | Brazil | Columbia |
|---|---|---|---|
| 2000 | 7 | 12 | 16 |
| 2001 | 9 | 14 | 18 |
| country | Y2000 | Y2001 |
|---|---|---|
| Algeria | 7 | 9 |
| Brazil | 12 | 14 |
| Columbia | 16 | 18 |
| country | year | value |
|---|---|---|
| Algeria | 2000 | 7 |
| Algeria | 2001 | 9 |
| Brazil | 2000 | 12 |
| Brazil | 2001 | 14 |
| Columbia | 2000 | 16 |
| Columbia | 2001 | 18 |
#(a)
starbucks |>
group_by(type) |>
summarize(average_fat = mean(fat))
#(b)
group_by(starbucks, type) |>
summarize(average_fat = mean(fat))
#(c)
group_by(starbucks, type) |>
summarize(average_fat = sum(fat))
#(d)
temp <- group_by(starbucks, type)
summarize(temp, average_fat = mean(fat))
#(e)
summarize(group_by(starbucks, type),
average_fat = mean(fat))filter()arrange()select()mutate()group_by()(theme, price)(theme, year)(year, price)(pieces, year)(pieces, price)n_distinct(pieces)n_distinct(price)sum(pieces)sum(pages)mean(pieces)library(openintro)
lego_sample |>
filter(!is.na(minifigures)) |>
# keep only those with minifigures
group_by(theme, year) |>
# for each theme for each year
summarize(ave_pieces = mean(pieces))# A tibble: 9 × 3
# Groups: theme [3]
theme year ave_pieces
<chr> <dbl> <dbl>
1 City 2018 189.
2 City 2019 257.
3 City 2020 349
4 DUPLO® 2018 50.5
5 DUPLO® 2019 32.5
6 DUPLO® 2020 45.8
7 Friends 2018 354.
8 Friends 2019 259.
9 Friends 2020 250.
# A tibble: 7 × 2
type average_fat
<fct> <dbl>
1 bakery 14.6
2 bistro box 18.4
3 hot breakfast 13.7
4 parfait 6.5
5 petite 9.33
6 salad 0
7 sandwich 14.7
# A tibble: 7 × 2
type average_fat
<fct> <dbl>
1 bakery 14.6
2 bistro box 18.4
3 hot breakfast 13.7
4 parfait 6.5
5 petite 9.33
6 salad 0
7 sandwich 14.7
# A tibble: 7 × 2
type average_fat
<fct> <dbl>
1 bakery 597
2 bistro box 147
3 hot breakfast 110.
4 parfait 19.5
5 petite 84
6 salad 0
7 sandwich 103
# A tibble: 7 × 2
type average_fat
<fct> <dbl>
1 bakery 14.6
2 bistro box 18.4
3 hot breakfast 13.7
4 parfait 6.5
5 petite 9.33
6 salad 0
7 sandwich 14.7
# A tibble: 7 × 2
type average_fat
<fct> <dbl>
1 bakery 14.6
2 bistro box 18.4
3 hot breakfast 13.7
4 parfait 6.5
5 petite 9.33
6 salad 0
7 sandwich 14.7
gdpyeargdpvalcountry–countrygdpyeargdpvalcountry–countrygdpyeargdpvalcountry–countryggplot() code. Which data frame should you use?35
pivot_wider() on raw datapivot_longer() on raw data# A tibble: 18 × 11
Subject day_0 day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 308 250. 259. 251. 321. 357. 415. 382. 290. 431. 466.
2 309 223. 205. 203. 205. 208. 216. 214. 218. 224. 237.
3 310 199. 194. 234. 233. 229. 220. 235. 256. 261. 248.
4 330 322. 300. 284. 285. 286. 298. 280. 318. 305. 354.
5 331 288. 285 302. 320. 316. 293. 290. 335. 294. 372.
6 332 235. 243. 273. 310. 317. 310 454. 347. 330. 254.
7 333 284. 290. 277. 300. 297. 338. 332. 349. 333. 362.
8 334 265. 276. 243. 255. 279. 284. 306. 332. 336. 377.
9 335 242. 274. 254. 271. 251. 255. 245. 235. 236. 237.
10 337 312. 314. 292. 346. 366. 392. 404. 417. 456. 459.
11 349 236. 230. 239. 255. 251. 270. 282. 308. 336. 352.
12 350 256. 243. 256. 256. 269. 330. 379. 363. 394. 389.
13 351 251. 300. 270. 281. 272. 305. 288. 267. 322. 348.
14 352 222. 298. 327. 347. 349. 353. 354. 360. 376. 389.
15 369 272. 268. 257. 278. 315. 317. 298. 348. 340. 367.
16 370 225. 235. 239. 240. 268. 344. 281. 348. 365. 372.
17 371 270. 272. 278. 282. 279. 285. 259. 305. 351. 369.
18 372 269. 273. 298. 311. 287. 330. 334. 343. 369. 364.
sleep_long <- sleep_wide |>
pivot_longer(cols = -Subject,
names_to = "day",
names_prefix = "day_",
values_to = "reaction_time")
sleep_long# A tibble: 180 × 3
Subject day reaction_time
<dbl> <chr> <dbl>
1 308 0 250.
2 308 1 259.
3 308 2 251.
4 308 3 321.
5 308 4 357.
6 308 5 415.
7 308 6 382.
8 308 7 290.
9 308 8 431.
10 308 9 466.
# ℹ 170 more rows
right_join()?36right_join()?37namebandplaysplays variable in a full_join()?38NANULLaddTen() function. The following output is a result of which map_*() call?39map(c(1,4,7), addTen)map_dbl(c(1,4,7), addTen)map_chr(c(1,4,7), addTen)map_lgl(c(1,4,7), addTen)[1] "11.000000" "14.000000" "17.000000"
map(c(1, 4, 7), addTen)map(list(1, 4, 7), addTen)map(data.frame(a=1, b=4, c=7), addTen)map(c(1, 4, 7), addTen)map(c(1, 4, 7), ~addTen(.x))map(c(1, 4, 7), ~addTen)map(c(1, 4, 7), function(hi) (hi + 10))map(c(1, 4, 7), ~(.x + 10))ifelse() function takes the arguments:43set.seed() function45The group averages for the next few questions:
library(NHANES)
GM <- NHANES |> summarize(mean(HHIncomeMid, na.rm=TRUE)) |> pull()
NH.means <- NHANES |>
filter(!is.na(HealthGen) & !is.na(HHIncomeMid)) |>
group_by(HealthGen) |>
summarize(IncMean = mean(HHIncomeMid), count=n())
NH.means# A tibble: 5 × 3
HealthGen IncMean count
<fct> <dbl> <int>
1 Excellent 69354. 817
2 Vgood 65011. 2342
3 Good 55662. 2744
4 Fair 44194. 899
5 Poor 37027. 164
group_by():72
group_by() of:73
group_by()c(4, 10, 8, 1, 2, 4)76
c(4, 4, 4, 4, 4, 4)c(4, 10, 8, 1, 2, 4)c(1, 2, 2, 4, 4, 2)c(10, 8, 1, 1, 8, 10)c(1, 2, 4, 3, 4, 10)kknn method can use any distance measure.95k in k-NN refers to96
k groupsk partitionsk neighborsV in V-fold CV refers to97
V groupsV partitionsV neighborswherever you are, make sure you are communicating with me when you have questions!
wherever you are, make sure you are communicating with me when you have questions!
no right answer here!
Yes! All the responses are reasons to make a figure.
aes() functionwday.aes() functionanswers may vary. I’d say c. putting the work in context. Others might say b. facilitating comparison or d. simplifying the story. However, I don’t think a correct answer is a. making the data stand out.
starbucks in wrong placesum() instead of the mean(). The other commands compute the average fat broken down by type of Starbucks itemfilter()(theme, year)mean(pieces)running the different code chunks with relevant output.
-countryyeargdpval (if possible, good idea to name variables something different from the name of the data frame)pivot_longer() on raw data. The reference to the study is: Gregory Belenky, Nancy J. Wesensten, David R. Thorne, Maria L. Thomas, Helen C. Sing, Daniel P. Redmond, Michael B. Russo and Thomas J. Balkin (2003) Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: a sleep dose-response study. Journal of Sleep Research 12, 1–12.NA (it would be NULL in SQL)map_chr(c(1,4,7), addTen) because the output is in quotes, the values are strings, not numbers.map() function allows vectors, lists, and data frames as input.map(c(1, 4, 7), ~addTen). The ~ acts on functions that do not have their own name or that are defined by function(...). By adding the argument (.x) we’ve expanded the addTen() function, and so it needs a ~. The addTen() function all alone does not use a ~.we always need d. random sampling / random allocation for appropriate conclusions. The theory is derived from b. normal data. If c.
c(1, 2, 4, 3, 4, 10) because there is no 3 in the original dataset.step_dummy().step_dummy(..., one_hot = TRUE).p. When p=2, Minkowski is the same as Euclidean.k neighborsV partitions