Clicker Questions

to go along with

Modern Data Science with R, 3rd edition by Baumer, Kaplan, and Horton

Introduction to Statistical Learning with Applications in R by James, Witten, Hastie, and Tibshirani

The reason to take random samples is:¹

to make cause and effect conclusions
to get as many variables as possible
it’s easier to collect a large dataset
so that the data are a good representation of the population
I have no idea why one would take a random sample

The reason to allocate/assign explanatory variables is:²

to make cause and effect conclusions
to get as many variables as possible
it’s easier to collect a large dataset
so that the data are a good representation of the population
I have no idea what you mean by “allocate/assign” (or “explanatory variable” for that matter)

Approximately how big is a tweet?³
1. 0.01Kb
2. 0.1Kb
3. 1Kb
4. 100Kb
5. 1000Kb = 1Mb

\(R^2\) measures:⁴

the proportion of variability in vote margin as explained by tweet share.
the proportion of variability in tweet share as explained by vote margin.
how appropriate the linear part of the linear model is.
whether or not particular variables should be included in the model.

R / R Studio / Quarto⁵
1. all good
2. started, progress is slow and steady
3. started, very stuck
4. haven’t started yet
5. what do you mean by “R”?

Git / GitHub⁶
1. all good
2. started, progress is slow and steady
3. started, very stuck
4. haven’t started yet
5. what do you mean by “Git”?

Which of the following includes talking to the remote version of GitHub?⁷
1. changing your name (updating the YAML)
2. committing the file(s)
3. pushing the file(s)
4. some of the above
5. all of the above

What is the error?⁸
1. poor assignment operator
2. unmatched quotes
3. improper syntax for function argument
4. invalid object name
5. no mistake

shup2 <-- "Hello to you!"

What is the error?⁹
1. poor assignment operator
2. unmatched quotes
3. improper syntax for function argument
4. invalid object name
5. no mistake

3shup <-  "Hello to you!"

What is the error?¹⁰
1. poor assignment operator
2. unmatched quotes
3. improper syntax for function argument
4. invalid object name
5. no mistake

shup4 <-  "Hello to you!

What is the error?¹¹
1. poor assignment operator
2. unmatched quotes
3. improper syntax for function argument
4. invalid object name
5. no mistake

shup5 <-  date()

What is the error?¹²
1. poor assignment operator
2. unmatched quotes
3. improper syntax for function argument
4. invalid object name
5. no mistake

shup6 <-  sqrt 10

Do you keep a calendar / schedule / planner?¹³
1. Yes
2. No

Do you keep a calendar / schedule / planner? If you answered “Yes” …¹⁴
1. Yes, on Google Calendar
2. Yes, on Calendar for macOS
3. Yes, on Outlook for Windows
4. Yes, in some other app
5. Yes, by hand

Where should I put things I’ve created for the HW (e.g., data, .ics file, etc.)¹⁵
1. Upload into remote GitHub directory
2. In the local folder which also has the R project
3. In my Downloads
4. Somewhere on my Desktop
5. In my Home directory

The goal of making a figure is…¹⁶
1. To draw attention to your work.
2. To facilitate comparisons.
3. To provide as much information as possible.

A good reason to make a particular choice of a graph is:¹⁷
1. Because the journal / field has particular expectations for how the data are presented.
2. Because some variables naturally fit better on some graphs (e.g., numbers on scatter plots).
3. Because that graphic displays the message you want as optimally as possible.

Why are the points orange?¹⁸
1. R translates “navy” into orange.
2. color must be specified in geom_point()
3. color must be specified outside the aes() function
4. the default plot color is orange

ggplot(data = Births78, 
       aes(x = date, y = births, color = "navy")) + 
  geom_point() +          
  labs(title = "US Births in 1978")

Why are the dots blue and the lines colored?¹⁹
1. dot color is given as “navy”, line color is given as wday.
2. both colors are specified in the ggplot() function.
3. dot coloring takes precedence over line coloring.
4. line coloring takes precedence over dot coloring.

Setting vs. Mapping. If I want information to be passed to all data points (not variable):²⁰
1. map the information inside the aes() function.
2. set the information outside the aes() function

The Snow figure was most successful at:²¹
1. making the data stand out
2. facilitating comparison
3. putting the work in context
4. simplifying the story

The Challenger figure(s) was(were) least successful at:²²
1. making the data stand out
2. facilitating comparison
3. putting the work in context
4. simplifying the story

The biggest difference between Snow and the Challenger was:²³
1. The amount of information portrayed.
2. One was better at displaying cause.
3. One showed the relevant comparison better.
4. One was more artistic.

Caffeine and Calories. What was the biggest concern over the average value axes?²⁴
1. It isn’t at the origin.
2. They should have used all the data possible to find averages.
3. There wasn’t a random sample.
4. There wasn’t a label explaining why the axes were where they were.

What is wrong with the following code?²⁵
1. should only be one =
2. Bakery should be upper case
3. type should not be in quotes
4. use mutate instead of filter
5. starbucks in wrong place

Result <- |> filter(starbucks,
        type == "bakery")

Which data represents the ideal format for ggplot2 and dplyr?²⁶

table a
year	Algeria	Brazil	Columbia
2000	7	12	16
2001	9	14	18

table b
country	Y2000	Y2001
Algeria	7	9
Brazil	12	14
Columbia	16	18

table c
country	year	value
Algeria	2000	7
Algeria	2001	9
Brazil	2000	12
Brazil	2001	14
Columbia	2000	16
Columbia	2001	18

Each of the statements except one will accomplish the same calculation. Which one does not match?²⁷

#(a) 
starbucks |> 
  group_by(type) |> 
  summarize(average_fat = mean(fat))

#(b) 
group_by(starbucks, type) |> 
  summarize(average_fat = mean(fat))

#(c)
group_by(starbucks, type) |> 
  summarize(average_fat = sum(fat))

#(d)
temp <- group_by(starbucks, type)

summarize(temp, average_fat = mean(fat))

#(e)
summarize(group_by(starbucks, type), 
          average_fat = mean(fat))

Fill in Q1.²⁸
1. filter()
2. arrange()
3. select()
4. mutate()
5. group_by()

result <- lego_sample |>
  Q1(!is.na(minifigures)) |> 
  # keep only those with minifigures
  group_by(Q2, Q2) |> 
  summarize(total = Q3)

Fill in Q2.²⁹
1. (theme, price)
2. (theme, year)
3. (year, price)
4. (pieces, year)
5. (pieces, price)

result <- lego_sample |>
  Q1(!is.na(minifigures)) |> 
  group_by(Q2, Q2) |> 
  # for each theme and year
  summarize(total = Q3)

Fill in Q3.³⁰
1. n_distinct(pieces)
2. n_distinct(price)
3. sum(pieces)
4. sum(pages)
5. mean(pieces)

result <- lego_sample |>
  Q1(!is.na(minifigures)) |> 
  group_by(Q2, Q2) |> 
  summarize(ave_pieces = Q3)
  # average number of pieces (each theme, each year)

Running the code.³¹

library(openintro)
lego_sample |>
  filter(!is.na(minifigures)) |> 
  # keep only those with minifigures
  group_by(theme, year) |> 
  # for each theme for each year
  summarize(ave_pieces = mean(pieces))

# A tibble: 9 × 3
# Groups:   theme [3]
  theme    year ave_pieces
  <chr>   <dbl>      <dbl>
1 City     2018      189. 
2 City     2019      257. 
3 City     2020      349  
4 DUPLO®   2018       50.5
5 DUPLO®   2019       32.5
6 DUPLO®   2020       45.8
7 Friends  2018      354. 
8 Friends  2019      259. 
9 Friends  2020      250.

#(a)
starbucks |> 
  group_by(type) |> 
  summarize(average_fat = mean(fat))

# A tibble: 7 × 2
  type          average_fat
  <fct>               <dbl>
1 bakery              14.6 
2 bistro box          18.4 
3 hot breakfast       13.7 
4 parfait              6.5 
5 petite               9.33
6 salad                0   
7 sandwich            14.7

#(b) 
group_by(starbucks, type) |> 
  summarize(average_fat = mean(fat))

# A tibble: 7 × 2
  type          average_fat
  <fct>               <dbl>
1 bakery              14.6 
2 bistro box          18.4 
3 hot breakfast       13.7 
4 parfait              6.5 
5 petite               9.33
6 salad                0   
7 sandwich            14.7

#(c)
group_by(starbucks, type) |> 
  summarize(average_fat = sum(fat))

# A tibble: 7 × 2
  type          average_fat
  <fct>               <dbl>
1 bakery              597  
2 bistro box          147  
3 hot breakfast       110. 
4 parfait              19.5
5 petite               84  
6 salad                 0  
7 sandwich            103

#(d)
temp <- group_by(starbucks, type)

summarize(temp, average_fat = mean(fat))

# A tibble: 7 × 2
  type          average_fat
  <fct>               <dbl>
1 bakery              14.6 
2 bistro box          18.4 
3 hot breakfast       13.7 
4 parfait              6.5 
5 petite               9.33
6 salad                0   
7 sandwich            14.7

#(e)
summarize(group_by(starbucks, type), 
          average_fat = mean(fat))

# A tibble: 7 × 2
  type          average_fat
  <fct>               <dbl>
1 bakery              14.6 
2 bistro box          18.4 
3 hot breakfast       13.7 
4 parfait              6.5 
5 petite               9.33
6 salad                0   
7 sandwich            14.7

Fill in Q1.³²
1. gdp
2. year
3. gdpval
4. country
5. –country

GDP |>  
  select(country = starts_with("Income"), everything()) |> 
       pivot_longer(cols = Q1, 
                    names_to = Q2, 
                    values_to = Q3)

Fill in Q2.³³
1. gdp
2. year
3. gdpval
4. country
5. –country

GDP |>  
  select(country = starts_with("Income"), everything()) |> 
       pivot_longer(cols = Q1, 
                    names_to = Q2, 
                    values_to = Q3)

Fill in Q3.³⁴
1. gdp
2. year
3. gdpval
4. country
5. –country

GDP |>  
  select(country = starts_with("Income"), everything()) |> 
       pivot_longer(cols = Q1, 
                    names_to = Q2, 
                    values_to = Q3)

Response to stimulus (in ms) after only 3 hrs of sleep for 9 days. You want to make a plot with the subject’s reaction time (y-axis) vs the number of days of sleep restriction (x-axis) using the following ggplot() code. Which data frame should you use?³⁵
1. use raw data
2. use pivot_wider() on raw data
3. use pivot_longer() on raw data

ggplot(___, aes(x = ___, y = ___, color = ___)) + 
  geom_line()

# A tibble: 18 × 11
   Subject day_0 day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9
     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     308  250.  259.  251.  321.  357.  415.  382.  290.  431.  466.
 2     309  223.  205.  203.  205.  208.  216.  214.  218.  224.  237.
 3     310  199.  194.  234.  233.  229.  220.  235.  256.  261.  248.
 4     330  322.  300.  284.  285.  286.  298.  280.  318.  305.  354.
 5     331  288.  285   302.  320.  316.  293.  290.  335.  294.  372.
 6     332  235.  243.  273.  310.  317.  310   454.  347.  330.  254.
 7     333  284.  290.  277.  300.  297.  338.  332.  349.  333.  362.
 8     334  265.  276.  243.  255.  279.  284.  306.  332.  336.  377.
 9     335  242.  274.  254.  271.  251.  255.  245.  235.  236.  237.
10     337  312.  314.  292.  346.  366.  392.  404.  417.  456.  459.
11     349  236.  230.  239.  255.  251.  270.  282.  308.  336.  352.
12     350  256.  243.  256.  256.  269.  330.  379.  363.  394.  389.
13     351  251.  300.  270.  281.  272.  305.  288.  267.  322.  348.
14     352  222.  298.  327.  347.  349.  353.  354.  360.  376.  389.
15     369  272.  268.  257.  278.  315.  317.  298.  348.  340.  367.
16     370  225.  235.  239.  240.  268.  344.  281.  348.  365.  372.
17     371  270.  272.  278.  282.  279.  285.  259.  305.  351.  369.
18     372  269.  273.  298.  311.  287.  330.  334.  343.  369.  364.

sleep_long <- sleep_wide |>
  pivot_longer(cols = -Subject,
               names_to = "day",
               names_prefix = "day_",
               values_to = "reaction_time")

sleep_long

# A tibble: 180 × 3
   Subject day   reaction_time
     <dbl> <chr>         <dbl>
 1     308 0              250.
 2     308 1              259.
 3     308 2              251.
 4     308 3              321.
 5     308 4              357.
 6     308 5              415.
 7     308 6              382.
 8     308 7              290.
 9     308 8              431.
10     308 9              466.
# ℹ 170 more rows

Consider band members from the Beatles and the Rolling Stones. Who is removed in a right_join()?³⁶

Mick
John
Paul
Keith
Impossible to know

band_members |> 
  right_join(band_instruments, by = "name")

band_members

# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles

band_instruments

# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar

Consider band members from the Beatles and the Rolling Stones. Which variables are removed in a right_join()?³⁷

name
band
plays
none of them

band_members |> 
  right_join(band_instruments, by = "name")

band_members

# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles

band_instruments

# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar

What happens to Mick’s plays variable in a full_join()?³⁸

Mick is removed
changes to guitar
changes to bass
NA
NULL

band_members |> 
  full_join(band_instruments, by = "name")

band_members

# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles

band_instruments

# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar

Consider the addTen() function. The following output is a result of which map_*() call?³⁹

map(c(1,4,7), addTen)
map_dbl(c(1,4,7), addTen)
map_chr(c(1,4,7), addTen)
map_lgl(c(1,4,7), addTen)

addTen <- function(wow) {
  return(wow + 10)
}

[1] "11.000000" "14.000000" "17.000000"

Which of the following input is allowed?⁴⁰
1. map(c(1, 4, 7), addTen)
2. map(list(1, 4, 7), addTen)
3. map(data.frame(a=1, b=4, c=7), addTen)
4. some of the above
5. all of the above

Which of the following produces a different output?⁴¹
1. map(c(1, 4, 7), addTen)
2. map(c(1, 4, 7), ~addTen(.x))
3. map(c(1, 4, 7), ~addTen)
4. map(c(1, 4, 7), function(hi) (hi + 10))
5. map(c(1, 4, 7), ~(.x + 10))

What will the following code output?⁴²
1. 3 random normals
2. 6 random normals
3. 18 random normals

input

# A tibble: 3 × 3
      n  mean    sd
  <dbl> <dbl> <dbl>
1     1     1     3
2     2     3     1
3     3    47    10

input |> 
  pmap(rnorm)

In R the ifelse() function takes the arguments:⁴³

question, yes, no
question, no, yes
statement, yes, no
statement, no, yes
option1, option2, option3

What is the output of the following:⁴⁴
1. “cat”, 30, “cat”, “cat”, 6
2. “cat”, “30”, “cat”, “cat”, “6”
3. 1, “cat”, 5, “cat”, “cat”
4. 1, “cat”, 5, NA, “cat”
5. “1”, “cat”, “5”, NA, “cat”

data <- c(1, 30, 5, NA, 6)

ifelse(data > 5, "cat", data)

In R, the set.seed() function⁴⁵

makes your computations go faster
keeps track of your computation time
provides an important parameter
repeats the function
makes your results reproducible

If I run a hypothesis test with a type I error cut off of \(\alpha = 0.05\) and the null hypothesis is true, what is the probability of rejecting \(H_0\)?⁴⁶

0.01
0.05
0.1
I don’t know.
No one knows.

If I run a hypothesis test with a type I error cut off of \(\alpha = 0.05\) and the null hypothesis is true, and also the technical conditions do not hold what is the probability of rejecting \(H_0\)?⁴⁷

0.01
0.05
0.1
I don’t know.
No one knows.

If I run a hypothesis test with a type I error cut off of \(\alpha = 0.05\) and the null hypothesis is false, what is the probability of rejecting \(H_0\)?⁴⁸

0.01
0.05
0.1
I don’t know.
No one knows.

Which is the best conclusion to the following? Changes in the technical conditions change the…⁴⁹
1. choice of test statistic
2. distribution of the test statistic (the sampling distribution)
3. computation of the p-value using standard software
4. hypotheses

If I aim to create a 95% confidence interval, and the technical conditions hold, what is the probability that the CI will contain the true value of the parameter?⁵⁰

0.90
0.95
0.99
I don’t know.
No one knows.

If I aim to create a 95% confidence interval, and the technical conditions do not hold, what is the probability that the CI will contain the true value of the parameter?⁵¹

0.90
0.95
0.99
I don’t know.
No one knows.

We typically compare means (across two groups) instead of medians because:⁵²

we don’t know the SE of the difference of medians
means are inherently more interesting than medians
permutation tests don’t work with medians
the Central Limit Theorem doesn’t apply for medians.

What are the technical conditions for a t-test?⁵³

none
normal data
\(n \geq 30\)
random sampling / random allocation for appropriate conclusions

What are the technical conditions for permutation tests?⁵⁴

none
normal data
\(n \geq 30\)
random sampling / random allocation for appropriate conclusions

Follow up to permutation test: the technical conditions change based on whether the statistic used is the mean, median, proportion, etc.⁵⁵

TRUE
FALSE

Why care about the distribution of the test statistic?⁵⁶

Better estimator
So we can find rejection region
So we can control power
Because we love the CLT

Given statistic T = r(X), how do we find a (sensible) test?⁵⁷

Maximize power
Minimize type I error
Control type I error
Minimize type II error
Control type II error

The group averages for the next few questions:

library(NHANES)
GM <- NHANES  |> summarize(mean(HHIncomeMid, na.rm=TRUE))  |> pull()

NH.means <- NHANES  |> 
  filter(!is.na(HealthGen) & !is.na(HHIncomeMid))  |> 
  group_by(HealthGen)  |> 
  summarize(IncMean = mean(HHIncomeMid), count=n())
NH.means

# A tibble: 5 × 3
  HealthGen IncMean count
  <fct>       <dbl> <int>
1 Excellent  69354.   817
2 Vgood      65011.  2342
3 Good       55662.  2744
4 Fair       44194.   899
5 Poor       37027.   164

The following code calculates which part of the test statistic?⁵⁸
1. \(\overline{X}\)
2. \((\overline{X}_{i\cdot} - \overline{X})\)
3. \((\overline{X}_{i\cdot} - \overline{X})^2\)
4. \(n_i\)
5. \(n_i \cdot (\overline{X}_{i\cdot} - \overline{X})^2\)

NH.means  |> select(IncMean)  |> pull() - GM

[1]  12148.175   7804.504  -1543.816 -13012.622 -20178.731

The following code calculates which part of the test statistic?⁵⁹
1. \(\overline{X}\)
2. \((\overline{X}_{i\cdot} - \overline{X})\)
3. \((\overline{X}_{i\cdot} - \overline{X})^2\)
4. \(n_i\)
5. \(n_i \cdot (\overline{X}_{i\cdot} - \overline{X})^2\)

(NH.means  |> select(IncMean)  |> pull() - GM)^2

[1] 147578150  60910286   2383368 169328332 407181201

The following code calculates which part of the test statistic?⁶⁰
1. \(\overline{X}\)
2. \((\overline{X}_{i\cdot} - \overline{X})\)
3. \((\overline{X}_{i\cdot} - \overline{X})^2\)
4. \(n_i\)
5. \(n_i \cdot (\overline{X}_{i\cdot} - \overline{X})^2\)

NH.means  |> select(count)  |> pull()

[1]  817 2342 2744  899  164

The following code calculates which part of the test statistic?⁶¹
1. \(\overline{X}\)
2. \((\overline{X}_{i\cdot} - \overline{X})\)
3. \((\overline{X}_{i\cdot} - \overline{X})^2\)
4. \(n_i\)
5. \(n_i \cdot (\overline{X}_{i\cdot} - \overline{X})^2\)

NH.means  |> select(count)  |> pull() * 
  (NH.means  |> select(IncMean)  |> pull() - GM)^2

[1] 120571348234 142651889943   6539963000 152226170649  66777716928

Type I error is⁶²

We give him a raise when he deserves it.
We don’t give him a raise when he deserves it.
We give him a raise when he doesn’t deserve it.
We don’t give him a raise when he doesn’t deserve it.

Type II error is⁶³

We give him a raise when he deserves it.
We don’t give him a raise when he deserves it.
We give him a raise when he doesn’t deserve it.
We don’t give him a raise when he doesn’t deserve it.

Power is the probability that:⁶⁴

We give him a raise when he deserves it.
We don’t give him a raise when he deserves it.
We give him a raise when he doesn’t deserve it.
We don’t give him a raise when he doesn’t deserve it.

Why don’t we always reject \(H_0\)?⁶⁵
1. type I error too high
2. type II error too high
3. level of sig too high
4. power too high

The player is more worried about⁶⁶
1. A type I error
2. A type II error

The coach is more worried about⁶⁷
1. A type I error
2. A type II error

Increasing your sample size⁶⁸
1. Increases your power
2. Decreases your power

Making your significance level more stringent (\(\alpha\) smaller)⁶⁹

Increases your power
Decreases your power

A more extreme alternative⁷⁰
1. Increases your power
2. Decreases your power

For the MacNell study, how should the data be permuted to address the question about perceived gender?⁷¹
1. Permute the identity variable (perceived gender)
2. Permute the gender variable (actual gender)
3. Permute the gender variable after grouping by the identity variable
4. Permute the identity variable after grouping by the gender variable

In order to “Permute the identity variable after grouping by the gender variable” we should group_by():⁷²
1. the identity variable (perceived gender)
2. the gender variable (actual gender)
3. both the gender variable and the identity variable
4. neither the identity variable nor the gender variable

In order to create a null sampling distribution, we need to calculate the statistic (difference in average score), which requires a group_by() of:⁷³
1. the identity variable (perceived gender)
2. the gender variable (actual gender)
3. both the gender variable and the permuted identity variable
4. the permuted identity variable
5. we don’t need to use group_by()

What is the primary reason to use a permutation test (instead of a test built on calculus)?⁷⁴

more power
lower type I error
more resistant to outliers
can be done on statistics with unknown sampling distributions

What is the primary reason to bootstrap a CI (instead of creating a CI from calculus)?⁷⁵

larger coverage probabilities
narrower intervals
more resistant to outliers
can be done on statistics with unknown sampling distributions

Which of the following could not possibly be a bootstrap sample from the vector: c(4, 10, 8, 1, 2, 4)⁷⁶
1. c(4, 4, 4, 4, 4, 4)
2. c(4, 10, 8, 1, 2, 4)
3. c(1, 2, 2, 4, 4, 2)
4. c(10, 8, 1, 1, 8, 10)
5. c(1, 2, 4, 3, 4, 10)

You have a sample of size n = 50. You sample with replacement 1000 times to get 1000 bootstrap samples. What is the sample size of each bootstrap sample?⁷⁷

50
1000

You have a sample of size n = 50. You sample with replacement 1000 times to get 1000 bootstrap samples. How many bootstrap statistics will you have?⁷⁸

50
1000

The bootstrap distribution is centered around the⁷⁹

population parameter
sample statistic
bootstrap statistic
bootstrap parameter

In \(B\) different random samples, how many values for \(\overline{X}\) do you have? How many values for \(SE(\overline{X}) = s / \sqrt{n}\) do you have?⁸⁰
1. 1, 1
2. 1, B
3. B, 1
4. B, B

In \(B\) bootstrap re-samples, how many values for \(\hat{\theta}^*(b)\) do you have? How many values for \(\hat{SE}^*\) do you have?⁸¹
1. 1, 1
2. 1, B
3. B, 1
4. B, B

What is wrong with using \(\frac{\hat{\theta}^*(b) - \hat{\theta}}{\hat{SE}^*}\) to estimate \(\frac{\hat{\theta} - \theta}{SE(\hat{\theta})}?\)⁸²
1. bootstrap version is too variable
2. bootstrap version is not variable enough
3. bootstrap version is biased
4. bootstrap version measures the wrong thing entirely

Using our bootstrap information, we can estimate the \(\alpha/2\) percentile of the distribution of \(\frac{\hat{\theta}-\theta}{SE(\hat{\theta})}\) as \(\hat{t}^*_{\alpha/2}\) using:⁸³
1. \(\frac{\# \bigg\{\frac{\hat{\theta}-\theta}{SE(\hat{\theta})} \leq \ \hat{t}^*_{\alpha/2} \bigg\} }{B} = \alpha/2\)
2. \(\frac{\# \bigg\{\frac{\hat{\theta}-\theta}{SE(\hat{\theta})} \leq \ \hat{t}^*_{\alpha/2} \bigg\} }{B} = \alpha\)
3. \(\frac{\# \bigg\{\frac{\hat{\theta}^*(b)-\hat{\theta}}{\hat{SE}^*(b)} \ \leq \hat{t}^*_{\alpha/2} \bigg\} }{B} = \alpha/2\)
4. \(\frac{\# \bigg\{\frac{\hat{\theta}^*(b)-\hat{\theta}}{\hat{SE}^*(b)} \ \leq \hat{t}^*_{\alpha/2} \bigg\} }{B} = \alpha\)

Is the bootstrap sampling distribution of \(\frac{\hat{\theta}^*(b)-\hat{\theta}}{\hat{SE}^*(b)}\) symmetric? That is, does \(\hat{t}^*_{\alpha/2} = -\hat{t}^*_{1 - \alpha/2}\)?⁸⁴
1. TRUE
2. FALSE

95% CI for the difference in proportions:⁸⁵
1. (0.15, 0.173)
2. (0.025, 0.975)
3. (0.72, 0.87)
4. (0.70, 0.873)
5. (0.12, 0.179)

https://www.lock5stat.com/StatKey/bootstrap_2_cat/bootstrap_2_cat.html

Suppose a 95% bootstrap CI for the difference in trimmed means was (3,9), would you reject H0?⁸⁶ (uh… What is the null hypothesis here???)

yes
no
not enough information to know

Given the situation where \(H_a\) is TRUE. Consider 100 CIs (for true difference in means, where each of the 100 CIs is created using a different dataset). The power of the test can be approximated by:⁸⁷

The proportion that contain the true difference in means.
The proportion that do not contain the true difference in means.
The proportion that contain zero.
The proportion that do not contain zero.

Which of the following best describes feature engineering?⁸⁸

The process of choosing the best machine learning algorithm for a dataset
The process of designing new variables or transforming existing ones to improve model performance
The process of tuning hyperparameters for a model
The process of splitting the data into training and testing sets

You have a categorical variable color with three levels: “red”, “blue”, and “green”. With dummy coding (with a reference level), how many new variables are created?⁸⁹

You have a categorical variable color with three levels: “red”, “blue”, and “green”. With one-hot encoding (without a reference level), how many new variables are created?⁹⁰

Which of the following best describes data leakage in feature engineering?⁹¹

Using too many features in the model
Including features that are highly correlated
Using information from the test set when creating features or training the model
Not standardizing numerical variables before training

Small k in k-NN will help reduce the risk of overfitting.⁹²
1. TRUE
2. FALSE

The training error for 1-NN classifier is zero.⁹³
1. TRUE
2. FALSE

Generally, the k-NN algorithm can take any distance measure.⁹⁴

TRUE
FALSE

In R, the kknn method can use any distance measure.⁹⁵

TRUE
FALSE

The k in k-NN refers to⁹⁶
1. k groups
2. k partitions
3. k neighbors

the V in V-fold CV refers to⁹⁷
1. V groups
2. V partitions
3. V neighbors

All of the following are TRUE for the use of CART except for:⁹⁸

Can deal with missing data
Require the technical conditions of statistical models
Variable selection is automatic
Produce rules that are easy to interpret and implement

Simplifying the decision tree by pruning peripheral branches will cause overfitting.⁹⁹

TRUE
FALSE

All are true with regards to PRUNING except:¹⁰⁰

Multiple (sequential) trees are possible to create by pruning
CART lets tree grow to full extent, then prunes it
Pruning generates successively smaller trees by pruning leaves
Pruning is only beneficial when purity improvement is statistically significant

Regression trees are invariant to monotonic transformations of:¹⁰¹

the explanatory (predictor) variables
the response variable
both types of variables
none of the variables

CART suffers from¹⁰²
1. high variance
2. high bias

bagging uses bootstrapping on:¹⁰³
1. the variables
2. the observations
3. both
4. neither

oob samples¹⁰⁴

are in the test data
are in the training data and provide independent predictions
are in the training data but do not provide independent predictions

oob samples are great because¹⁰⁵

oob is “boo” spelled backwards
oob samples allow for independent predictions
oob samples allow for more predictions than a “test group”
oob data frame is always bigger than the test sample data frame
some of the above

bagging is random forests with:¹⁰⁶

m = # predictor variables
all the observations
the most important predictor variables isolated
cross validation to choose m

We have 80 training observations and 20 test observations. To get the test MSE, we need¹⁰⁷

20 predictions from all trees
20 predictions from oob trees
80 predictions form all trees
80 predictions from oob trees

With random forests, the value for m is chosen¹⁰⁸

using OOB error rate
as p/3
as sqrt(p)
using cross validation

A tuning parameter:¹⁰⁹

makes the model fit the training data as well as possible.
makes the model fit the test data as well as possible.
allows for a good model that does not overfit the data.

With binary response and X1 and X2 continuous, kNN (k=1) creates a linear decision boundary.¹¹⁰

TRUE
FALSE

With binary response and X1 and X2 continuous, a classification tree with one split creates a linear decision boundary.¹¹¹

TRUE
FALSE

With binary response and X1 and X2 continuous, a classification tree with one split creates the best linear decision boundary.¹¹²

TRUE
FALSE

If the data are linearly separable, there exists a “widest street” in an SVM classifier.¹¹³

yes
no
up to a constant
with the alpha values “tuned” appropriately

In the case of linearly separable data, the SVM:¹¹⁴

has a tuning parameter of \(\alpha\)
has a tuning parameter of dimension
has no tuning parameters

Linear models are similar to SVM with linearly separable data in that they optimize the model instead of tuning it.¹¹⁵
1. TRUE
2. FALSE

If the data have a complex boundary, the value of gamma in RBF kernel should be:¹¹⁶
1. Big
2. Small

If the data have a simple boundary, the value of gamma in RBF kernel should be:¹¹⁷
1. Big
2. Small

If I am using all features of my dataset and I achieve 100% accuracy on my training set, but ~70% on the cross validation set, what should I look out for?¹¹⁸
1. Underfitting
2. Nothing, the model is perfect
3. Overfitting

For a large value of C in an SVM classifier, the model is expected to¹¹⁹
1. overfit the training data more as compared to a small C
2. overfit the training data less as compared to a small C
3. not related to overfitting the data

Suppose you have trained an SVM with linear decision boundary. You correctly infer that your training SVM model is underfitting. Which of the following should you consider?¹²⁰
1. increase number of observations
2. decrease number of observations
3. calculate more variables
4. reduce the number of features

Suppose you have trained an SVM with linear decision boundary. You correctly infer that your training SVM model is underfitting. Suppose you gave the correct answer in previous question. What do you think that is actually happening (when you take that action)?¹²¹

We are lowering the bias
We are lowering the variance
We are increasing the bias
We are increasing the variance
1. 1. and ii.
2. 1. and iii.
3. 1. and iv.
4. 1. and iv.

Suppose you are using SVM with polynomial kernel of degree 2. Your model perfectly predicts! That is, training and testing accuracy are both 100%. You increase the complexity (degree of polynomial). What will happen?¹²²

Increasing the complexity will overfit the data (increase variance)
Increasing the complexity will underfit the data (increase bias)
Nothing will happen since your model was already 100% accurate
None of these

Building on the previous question, after increasing the complexity, you found that training accuracy was still 100%. According to you what is the reason behind that?¹²³
1. Since data are fixed and we are fitting more polynomial terms, the algorithm starts memorizing everything in the data
2. Since data are fixed, SVM doesn’t need to search in high dimensional space
3. 1. and ii.
4. none of these

The cost parameter in the SVM means:¹²⁴
1. The number of cross-validations to be made
2. The kernel to be used
3. The trade-off between misclassification and simplicity of the model
4. None of the above

Suppose you have trained an SVM classifier with a RBF kernel, and it learned the following decision boundary on the training set. You suspect that the SVM is underfitting your dataset. What should you try?¹²⁵
1. decreasing C and/or decrease gamma
2. decreasing C and/or increase gamma
3. increasing C and/or decrease gamma
4. increasing C and/or increase gamma

Suppose you have trained an SVM classifier with a RBF kernel, and it learned the following decision boundary on the training set. When you measure the SVM’s performance on a cross validation set, it does poorly. What should you try?¹²⁶
1. decreasing C and/or decrease gamma
2. decreasing C and/or increase gamma
3. increasing C and/or decrease gamma
4. increasing C and/or increase gamma

Cross validation will guarantee that the model does not overfit.¹²⁷
1. TRUE
2. FALSE

The biggest problem with missing data is the resulting small sample size.¹²⁸
1. TRUE
2. FALSE

Which statement is not true about cluster analysis?¹²⁹
1. Objects in each cluster tend to be similar to each other and dissimilar to objects in the other clusters.
2. Cluster analysis is a type of unsupervised learning.
3. Groups or clusters are suggested by the data, not defined a priori.
4. Cluster analysis is a technique for analyzing data when the response variable is categorical and the predictor variables are continuous in nature.

A _____ or tree graph is a graphical device for displaying clustering results. Vertical lines represent clusters that are joined together. The position of the line on the scale indicates the distances at which clusters were joined. ¹³⁰
1. dendrogram
2. scatterplot
3. scree plot
4. segment plot

_____ is a clustering procedure characterized by the development of a tree-like structure.¹³¹
1. Partitioning clustering
2. Hierarchical clustering
3. Divisive clustering
4. Agglomerative clustering

_____ is a clustering procedure where all objects start out in one giant cluster. Clusters are formed by dividing this cluster into smaller and smaller clusters.¹³²
1. Non-hierarchical clustering
2. Hierarchical clustering
3. Divisive clustering
4. Agglomerative clustering

The _____ method uses information on all pairs of distances, not merely the minimum or maximum distances.¹³³
1. single linkage
2. medium linkage
3. complete linkage
4. average linkage

Hierarchical clustering is deterministic, but k-means clustering is not.¹³⁴
1. TRUE
2. FALSE

k-means is a clustering procedure characterized referred to as ________.¹³⁵
1. Partitioning clustering
2. Hierarchical clustering
3. Divisive clustering
4. Agglomerative clustering

One method of assessing reliability and validity of clustering is to use different methods of clustering and compare the results.¹³⁶
1. TRUE
2. FALSE

The choice of k, the number of clusters to partition a set of data into,…¹³⁷
1. is a personal choice that shouldn’t be discussed in public
2. depends on why you are clustering the data
3. should always be as large as your computer system can handle
4. has maximum 10

Which of the following is required by k-means clustering?¹³⁸
1. defined distance metric
2. number of clusters
3. initial guess as to cluster centroids
4. all of the above
5. some of the above

For which of the following tasks might clustering be a suitable approach?¹³⁹
1. Given sales data from many products in a supermarket, estimate future sales for each of these products.
2. Given a database of information about your users, automatically group them into different market segments.
3. From the user’s usage patterns on a website, identify different user groups.
4. Given historical weather records, predict if tomorrow’s weather will be sunny or rainy.

k-means is an iterative algorithm, and two of the following steps are repeatedly carried out. Which two?¹⁴⁰

Assign each point to its nearest cluster
Test on the cross-validation set
Update the cluster centroids based the current assignment
Using the elbow method to choose K
1. 1 & 2
2. 1 & 3
3. 1 & 4
4. 2 & 3
5. 2 & 4
6. 3 & 4

Footnotes

1. so that the data are a good representation of the population
1. to make cause and effect conclusions
1. about 0.1Kb. Turns out that 3.5 billion tweets * 0.1Kb = 350Gb (0.35 Tb). My laptop is pretty good, and it has 36 Gb of memory (RAM) and 4 Tb of storage. It would not be able to work with 3.5 billion tweets.
1. the proportion of variability in vote margin as explained by tweet share.
wherever you are, make sure you are communicating with me when you have questions!
wherever you are, make sure you are communicating with me when you have questions!
1. pushing the file(s)
1. poor assignment operator
1. invalid object name
1. unmatched quotes
1. no mistake
1. improper syntax for a function argument
1. I mean, the right answer has to be Yes, right!??!
no right answer here!
1. In the local folder which also has the R project. It could be on the Desktop or the Home directory, but it must be in the same place as the R project. Do not upload files to the remote GitHub directory or you will find yourself with two different copies of the files.
Yes! All the responses are reasons to make a figure.
1. Because that graphic displays the message you want as optimally as possible.
1. color must be specified outside the aes() function
1. dot color is specified as “navy”, line color is specified as wday.
1. set the information outside the aes() function
answers may vary. I’d say c. putting the work in context. Others might say b. facilitating comparison or d. simplifying the story. However, I don’t think a correct answer is a. making the data stand out.
1. making the data stand out
1. One showed the relevant comparison better.
1. It isn’t at the origin. in combination with d. There wasn’t a label explaining why the axes were where they were. The story associated with the average value axes is not clear to the reader.
1. starbucks in wrong place
1. Table c is best because the columns allow us to work with each of the variable separately.
1. does something different because it takes the sum() instead of the mean(). The other commands compute the average fat broken down by type of Starbucks item
1. filter()
1. (theme, year)
1. mean(pieces)
running the different code chunks with relevant output.
1. -country
1. year
1. gdpval (if possible, good idea to name variables something different from the name of the data frame)
1. use pivot_longer() on raw data. The reference to the study is: Gregory Belenky, Nancy J. Wesensten, David R. Thorne, Maria L. Thomas, Helen C. Sing, Daniel P. Redmond, Michael B. Russo and Thomas J. Balkin (2003) Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: a sleep dose-response study. Journal of Sleep Research 12, 1–12.
1. Mick
1. none of them (the default is to retain all the variables)
1. NA (it would be NULL in SQL)
1. map_chr(c(1,4,7), addTen) because the output is in quotes, the values are strings, not numbers.
1. all of the above. The map() function allows vectors, lists, and data frames as input.
1. map(c(1, 4, 7), ~addTen). The ~ acts on functions that do not have their own name or that are defined by function(...). By adding the argument (.x) we’ve expanded the addTen() function, and so it needs a ~. The addTen() function all alone does not use a ~.
1. 6 random normals (1 with mean 1, sd 3; 2 with mean 3, sd 1; 3 with mean 47, sd 10)
1. question, yes, no
1. “1”, “cat”, “5”, NA, “cat” (Note that the numbers were converted to character strings!)
1. makes your results reproducible
1. 0.05 If the null hypothesis is true and the technical conditions hold, then we should reject the null hypothesis \(\alpha \cdot 100\)% of the time.
1. No one knows. It totally depends on how and how much the technical conditions are violated and how resistant the test is to the technical conditions.
1. No one knows. It totally depends on the degree to which the null hypothesis is false.
1. distribution of the test statistic (the sampling distribution)
1. 0.95 If the technical conditions hold, 95% of all confidence intervals should contain the true parameter.
1. No one knows. If the technical conditions do not hold, the CI may or may not contain the true value of the parameter at the given confidence level (i.e., 95%).
1. the Central Limit Theorem doesn’t apply for medians.
we always need d. random sampling / random allocation for appropriate conclusions. The theory is derived from b. normal data. If c. \(n \geq 30\), then the theory holds really well, regardless of whether the data are normal.
1. random sampling / random allocation for appropriate conclusions
1. FALSE
1. So we can find rejection region
1. Control type I error
1. \((\overline{X}_{i\cdot} - \overline{X})\)
1. \((\overline{X}_{i\cdot} - \overline{X})^2\)
1. \(n_i\)
1. \(n_i \cdot (\overline{X}_{i\cdot} - \overline{X})^2\)
1. We give him a raise when he doesn’t deserve it.
1. We don’t give him a raise when he deserves it.
1. We give him a raise when he deserves it.
1. type I error too high
1. A type II error
1. A type I error
1. Increases your power
1. Decreases your power
1. Increases your power
1. Permute the identity variable after grouping by the gender variable
1. the gender variable (actual gender)
1. the permuted identity variable
1. can be done on statistics with unknown sampling distributions
1. can be done on statistics with unknown sampling distributions
1. c(1, 2, 4, 3, 4, 10) because there is no 3 in the original dataset.
1. 50
1. 1000
1. sample statistic
1. B, B
1. B, 1
1. bootstrap version is not variable enough
1. \(\frac{# \{\frac{\hat{\theta}^*(b)-\hat{\theta}}{\hat{SE}^*(b)} \leq \hat{t}^*_{\alpha/2} \} }{B} = \alpha/2\)
1. FALSE. There is no reason to assume it would necessarily be symmetric.
1. (0.12, 0.179)
1. yes (because the interval for the true difference in population trimmed means does not overlap zero.)
1. The proportion that do not contain zero.
1. The process of designing new variables or transforming existing ones to improve model performance
1. 2 Let’s say “red” is the reference level, then there are two binary variables (blue: yes/no and green: yes/no). Dummy variables are used in models like linear and logistic regression where there is a reference value and you want to be able to interpret the coefficients. Use step_dummy().
1. 3 There are thre binary variables (red: yes/no; blue: yes/no; green: yes/no). One-hot encoding is used in models where there is no reference variable and you want all categories to be represented. Use step_dummy(..., one_hot = TRUE).
1. Using information from the test set when creating features or training the model
1. FALSE
1. TRUE
1. TRUE
1. FALSE, it uses Minkowski(p) distance, with a user specified choice of p. When p=2, Minkowski is the same as Euclidean.
1. k neighbors
1. V partitions
1. Require the technical conditions of statistical models
1. FALSE. If you don’t prune, you will overfit.
1. Pruning is only beneficial when purity improvement is statistically significant (we don’t do hypothesis testing on trees)
1. the explanatory (predictor) variables
1. high variance
1. the observations
1. are in the training data and provide independent predictions
1. some of the above (both of b. and c. are great!) The oob data frame is exactly the same size as the training data, and it may or may not be bigger than the test data.
1. m = # predictor variables
1. 20 predictions from all trees
1. using cross validation (but is also often used as p/3 or sqrt(p))
1. allows for a good model that does not overfit the data.
1. FALSE
1. TRUE
1. FALSE (what is “best” ????)
1. yes
1. has no tuning parameters
1. TRUE
1. Big
1. Small
1. Overfitting
1. overfit the training data more as compared to a small C (because the act of misclassifying is heavily penalized)
1. calculate more variables (use feature engineering to see if you can get more information out of the variables)
1. 1. and iv. The model is too simple (i.e., biased), so we need more information to make it more complex. If we make the model more complex it will have lower bias but higher variance.
1. Increasing the complexity will overfit the data (increase variance) Even though you already perfectly fit the data, the model could potentially draw boundaries that were even more singular, thus increasing the variance and producing a worse model.
1. 1. Effectively. The polynomial bound becomes more wiggly as the degree of the polynomial increases.
1. The trade-off between misclassification and simplicity of the model (In the SVM derivation, we think of it as the trade-off between the misclassifications and the width of the street. But that width determines how complicated the model is.)
1. increasing C (to discourage misclassifications) and/or increase gamma (to encourage more complicated models)
1. decreasing C (to allow more misclassifications) and/or decrease gamma (to make a simpler model)
1. FALSE Nothing in statistics (or life) is guaranteed. Tuning parameters, however does help the model to avoid overfitting as much as possible.
1. FALSE The biggest problem with missing data is that missingness is almost always non-random. So the missing values are systematically different than the non-missing data. Makes conclusions difficult.
1. Cluster analysis is a technique for analyzing data when the response variable is categorical and the predictor variables are continuous in nature.
1. dendrogram
1. Hierarchical clustering
1. Divisive clustering
1. average linkage
1. TRUE (k-means starts with a random set of centers)
1. Partitioning clustering
1. TRUE
1. depends on why you are clustering the data
1. all of the above
1. From the user’s usage patterns on a website, identify different user groups. Or maybe b. No pre-defined response variable.
1. 1 & 3