Simulating

September 8 - 17, 2025

Jo Hardin

Agenda 9/8/25

  1. Simulating to understand bias-variance trade-off
  2. map()

Bias-Variance Trade-off

What is variance? What is bias?

Variance refers to the amount by which \(\hat{f}\) would change if we estimated it using a different training set.

Bias refers to the error that is introduced by approximating the “truth” by a model which is too simple.

Mathematically

We can measure how well a model does by looking at how close the fit of the model (\(\hat{f}\)) is to the observed data (\(y\)).

\[MSE = E\bigg[(Y - \hat{f}(x))^2\bigg]\]

Because \(Y = f(x) + \epsilon\), we know that:

\[\begin{eqnarray*} MSE &=& E\bigg[(Y - \hat{f}(x))^2\bigg]\\ &=& E\bigg[(f(x) + \epsilon - \hat{f}(x))^2\bigg]\\ &=& E\bigg[(f(x) - \hat{f}(x))^2\bigg] + 2 E\bigg[(f(x) - \hat{f}(x)) \cdot \epsilon \bigg] + E[\epsilon^2] \end{eqnarray*}\]

Term 3: The third term is \(\sigma^2\), the irreducible error (the part that even the perfect model can’t predict).

Term 2: Turns out to be zero because the errors are assumed to be independent of the explanatory variable, so the expected value can filter through.

Term 1: Takes a little bit of work to expand (something you’d do in, say, Math 152)1.

Simplified Term 1

\[E\bigg[(f(x) - \hat{f}(x))^2\bigg] = \bigg( f(x) - E[\hat{f}(x)] \bigg)^2 + E\bigg[ (E[\hat{f}(x)] - \hat{f}(x))^2 \bigg]\]

Where:

\(\bigg[\) Bias of \(\hat{f}(x) \bigg]^2\) is \(\bigg( f(x) - E[\hat{f}(x)] \bigg)^2\)

Variance of \(\hat{f}(x)\) is \(E\bigg[ (E[\hat{f}(x)] - \hat{f}(x))^2 \bigg]\)

MSE

\[\begin{eqnarray*} MSE &=& E\bigg[(Y - \hat{f}(x))^2\bigg]\\ &=& \bigg[ \mbox{ Bias of } \hat{f}(x) \bigg]^2 + \mbox{ Variance of } \hat{f}(x) + \sigma^2 \end{eqnarray*}\]

Simulation

Let’s consider a model: \(Y = f(x) + \epsilon\)

num_x <- length(seq(0, 10, by = 0.5))
set.seed(4774)
bias_var_data <- tibble(ex = seq(0, 10, by = 0.5),
                        eps = rnorm(num_x, mean = 0, sd = 5),
                        why = ex^2 + eps) |> 
  mutate(underfit = lm(why ~ ex)$fitted,
         `good fit` = lm(why ~ ex + I(ex^2))$fitted,
         overfit = lm(why ~ ex + I(ex^2) + I(ex^3) + I(ex^4) + I(ex^5) + I(ex^6) + I(ex^7) + I(ex^8) + I(ex^9) + I(ex^10) )$fitted) |> 
  pivot_longer(cols = underfit:overfit, 
               names_to = "fit", 
               values_to = "prediction") |> 
  mutate(fit = factor(fit, 
                        levels = c("underfit", "good fit", "overfit") ))

bias_var_data |> 
  ggplot(aes(x = ex, y = why)) + 
  geom_point() + 
  geom_line(aes(y = prediction)) + 
  facet_grid(~ fit)

Errors

Which is worse? Underfitting or overfitting? How can we measure the error?

# A tibble: 3 × 2
  fit      MSE_OG
  <fct>     <dbl>
1 underfit   74.8
2 good fit   21.9
3 overfit    10.5
bias_var_data |> 
  group_by(fit) |> 
  summarize(MSE_OG = mean((why - prediction)^2))

Fit to new data

WAIT! We don’t want the error on the data that built the model, we want the error on the population.

# A tibble: 3 × 2
  fit      MSE_newdata
  <fct>          <dbl>
1 underfit        98.6
2 good fit        27.3
3 overfit         35.8
set.seed(470)
new_data <- tibble(ex = seq(0, 10, by = 0.5),
                   eps = rnorm(num_x, mean = 0, sd = 5),
                   why = ex^2 + eps)

bias_var_data |> 
  ggplot(aes(x = ex, y = why)) + 
  geom_point(data = new_data, aes(x = ex, y = why)) + 
  geom_line(aes(y = prediction)) + 
  facet_grid(~ fit)
lm1 <- bias_var_data |> 
  filter(fit == "underfit") |> 
  lm(why ~ ex, data = _)

lm2 <- bias_var_data |> 
  filter(fit == "good fit") |> 
  lm(why ~ ex + I(ex^2), data = _)

lm3 <- bias_var_data |> 
  filter(fit == "overfit") |> 
  lm(why ~ ex + I(ex^2) + I(ex^3) + I(ex^4) + I(ex^5) + 
       I(ex^6) + I(ex^7) + I(ex^8) + I(ex^9) + I(ex^10), data = _)

new_data |> 
  mutate(underfit = predict.lm(lm1, newdata = new_data),
         `good fit` = predict(lm2, newdata = new_data),
         overfit = predict(lm3, newdata = new_data)) |> 
  pivot_longer(cols = underfit:overfit, 
               names_to = "fit", 
               values_to = "prediction") |> 
  mutate(fit = factor(fit, 
                        levels = c("underfit", "good fit", "overfit") )) |> 
  group_by(fit) |> 
  summarize(MSE_newdata = mean((why - prediction)^2))

Back to bias and variance

set.seed(47)
datasets <- 1:9 |> 
  map_dfr(\(i) {
    tibble(
      ex  = seq(0, 10, by = 0.5),
      eps = rnorm(num_x, mean = 0, sd = 5),
      why = ex^2 + eps,
      dataset = i
    )
  }) |> 
  group_by(dataset) |> 
  mutate(underfit = lm(why ~ ex)$fitted,
         `good fit` = lm(why ~ ex + I(ex^2))$fitted,
         overfit = lm(why ~ ex + I(ex^2) + I(ex^3) + I(ex^4) + I(ex^5) + I(ex^6) + I(ex^7) + I(ex^8) + I(ex^9) + I(ex^10) )$fitted) |> 
  pivot_longer(cols = underfit:overfit, 
               names_to = "fit", 
               values_to = "prediction") |> 
  mutate(fit = factor(fit, 
                        levels = c("underfit", "good fit", "overfit") ))

datasets
datasets |> 
  filter(fit == "underfit") |> 
  ggplot(aes(x = ex, y = why)) + 
  geom_point(size = 0.25) + 
  geom_line(aes(y = prediction)) + 
  facet_wrap(~ dataset)
datasets |> 
  filter(fit == "good fit") |> 
  ggplot(aes(x = ex, y = why)) + 
  geom_point(size = 0.25) + 
  geom_line(aes(y = prediction)) + 
  facet_wrap(~ dataset)
datasets |> 
  filter(fit == "overfit") |> 
  ggplot(aes(x = ex, y = why)) + 
  geom_point(size = 0.25) + 
  geom_line(aes(y = prediction)) + 
  facet_wrap(~ dataset)

What we know about bias-variance trade-off

  • The underfit model looks the same for every dataset! That is, the model has low variance.

  • The underfit model doesn’t fit the data very well. The model has high bias.

  • The overfit model model is very flexible, and looks like it fits the data very well. It has low bias.

  • The fitted models look quite different across the different datasets. The overfit model has high variance.

Agenda 9/15/25

  1. Monte Carlo sampling
  2. Simulating to estimate probabilities
  3. Simulating to consider statistical inference with fewer conditions.

Simulating: the big picture.

Why?

Three simulation methods are used for different purposes:

  1. Monte Carlo methods - use repeated sampling from a population with known characteristics.

  2. Randomization / Permutation methods - use shuffling (sampling without replacement from a sample) to test hypotheses of “no effect”.

  3. Bootstrap methods - use resampling (sampling with replacement from a sample) to establish confidence intervals.

Monte Carlo sample from a population to…

  • …simulate to understand complicated models.
  • …simulate as easier than calculus for estimating probabilities.
  • …simulate to consider statistical inference with fewer conditions.

Goals of simulating complicated models

The goal of simulating a complicated model is not only to create a program which will provide the desired results. We also hope to be able to write code such that:

  1. The problem is broken down into small pieces.
  2. The problem has checks in it to see what works (run the lines inside the functions!).
  3. uses simple code (as often as possible).

Examples

  • bias-variance trade-off
  • see class notes for more examples

Aside: a few R functions (ifelse())

data.frame(value = c(-2:2)) |>
  mutate(abs_value = ifelse(value >=0, value, -value))  # abs val
  value abs_value
1    -2         2
2    -1         1
3     0         0
4     1         1
5     2         2

Aside: a few R functions (case_when())

set.seed(4747)
diamonds |> select(carat, cut, color, price) |>
  sample_n(20) |>
  mutate(price_cat = case_when(
    price > 10000 ~ "expensive",
    price > 1500 ~ "medium", 
    TRUE ~ "inexpensive"))
# A tibble: 20 × 5
   carat cut       color price price_cat  
   <dbl> <ord>     <ord> <int> <chr>      
 1  1.23 Very Good F     10276 expensive  
 2  0.35 Premium   H       706 inexpensive
 3  0.7  Good      E      2782 medium     
 4  0.4  Ideal     D      1637 medium     
 5  0.53 Ideal     G      1255 inexpensive
 6  2.22 Ideal     G     14637 expensive  
 7  0.3  Ideal     G       878 inexpensive
 8  1.05 Ideal     H      4223 medium     
 9  0.53 Premium   E      1654 medium     
10  1.7  Ideal     H      7068 medium     
11  0.31 Good      E       698 inexpensive
12  0.31 Ideal     F       840 inexpensive
13  1.03 Ideal     H      4900 medium     
14  0.31 Premium   G       698 inexpensive
15  1.56 Premium   G      8858 medium     
16  1.71 Premium   G     11032 expensive  
17  1    Good      E      5345 medium     
18  1.86 Ideal     J     10312 expensive  
19  1.08 Very Good E      3726 medium     
20  0.31 Premium   E       698 inexpensive

Aside: a few R functions (sample_n())

We took a random sample of 20 rows from the diamonds dataset.

set.seed(4747)
diamonds |> select(carat, cut, color, price) |>
  sample_n(20) |>
  mutate(price_cat = case_when(
    price > 10000 ~ "expensive",
    price > 1500 ~ "medium", 
    TRUE ~ "inexpensive"))
# A tibble: 20 × 5
   carat cut       color price price_cat  
   <dbl> <ord>     <ord> <int> <chr>      
 1  1.23 Very Good F     10276 expensive  
 2  0.35 Premium   H       706 inexpensive
 3  0.7  Good      E      2782 medium     
 4  0.4  Ideal     D      1637 medium     
 5  0.53 Ideal     G      1255 inexpensive
 6  2.22 Ideal     G     14637 expensive  
 7  0.3  Ideal     G       878 inexpensive
 8  1.05 Ideal     H      4223 medium     
 9  0.53 Premium   E      1654 medium     
10  1.7  Ideal     H      7068 medium     
11  0.31 Good      E       698 inexpensive
12  0.31 Ideal     F       840 inexpensive
13  1.03 Ideal     H      4900 medium     
14  0.31 Premium   G       698 inexpensive
15  1.56 Premium   G      8858 medium     
16  1.71 Premium   G     11032 expensive  
17  1    Good      E      5345 medium     
18  1.86 Ideal     J     10312 expensive  
19  1.08 Very Good E      3726 medium     
20  0.31 Premium   E       698 inexpensive

Aside: a few R functions (sample())

sampling, shuffling, and resampling: sample()

set.seed(47)
alph <- letters[1:10]
alph
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
sample(alph, 5, replace = FALSE) # sample (from a population)
[1] "i" "b" "g" "d" "a"
sample(alph, 5, replace = TRUE) # sample (from a population)
[1] "j" "g" "f" "i" "f"
sample(alph, 10, replace = FALSE)  # shuffle
 [1] "f" "h" "i" "e" "g" "d" "c" "j" "b" "a"
sample(alph, 10, replace = TRUE)  # resample
 [1] "e" "j" "e" "b" "e" "c" "f" "a" "e" "a"

Aside: a few R functions (set.seed())

What if we want to be able to generate the same random numbers (here on the interval from 0 to 1) over and over?

set.seed(4747)
runif(4, 0, 1) # random uniform numbers
[1] 0.1949071 0.3390270 0.5147919 0.4516470
set.seed(123)
runif(4, 0, 1) # random uniform numbers
[1] 0.2875775 0.7883051 0.4089769 0.8830174
set.seed(4747)
runif(4, 0, 1) # random uniform numbers
[1] 0.1949071 0.3390270 0.5147919 0.4516470

Aside: a few R functions (n() and n_distinct())

n() counts the number of rows. n_distinct() counts the number of distinct rows.

library(nycflights13)

flights |> 
  select(origin, dest) |> 
  group_by(origin) |> 
  summarize(num_flt = n(), num_dest =  n_distinct(dest))
# A tibble: 3 × 3
  origin num_flt num_dest
  <chr>    <int>    <int>
1 EWR     120835       86
2 JFK     111279       70
3 LGA     104662       68

Monte Carlo sample from a population to…

  • …simulate to understand complicated models.
  • …simulate as easier than calculus for estimating probabilities.
  • …simulate to consider statistical inference with fewer conditions.

Small simulation example to calculate a probability

Sally & Joan

Consider a situation where Sally and Joan plan to meet to study in their college campus center (Mosteller 1987; Baumer, Kaplan, and Horton 2021). They are both impatient people who will wait only 10 minutes for the other before leaving.

But their planning was incomplete. Sally said, “Meet me between 7 and 8 tonight at the student center.” When should Joan plan to arrive at the campus center? And what is the probability that they actually meet?

Simulate their meeting

Assume that Sally and Joan are both equally likely to arrive at the campus center anywhere between 7pm and 8pm.

library(tictoc)
n <- 10000

tic()
meet_vect <- tibble(
  sally = runif(n, min = 0, max = 60),
  joan = runif(n, min = 0, max = 60),
  result = ifelse(
    abs(sally - joan) <= 10, "They meet", "They do not"
  )
)
toc()
0.002 sec elapsed
meet_func <- function(nada){
  tibble(
    sally = runif(1, min = 0, max = 60),
    joan = runif(1, min = 0, max = 60),
    result = ifelse(
    abs(sally - joan) <= 10, "They meet", "They do not"
  )
  )
}
tic()
meet_map <- 1:n |> 
  map(meet_func) |> 
  list_rbind()
toc()
2.208 sec elapsed
tic()
meet_for <- data.frame()
for(i in 1:n){
  meet_for <- meet_for |> rbind(
    tibble(
      sally = runif(1, min = 0, max = 60),
      joan = runif(1, min = 0, max = 60),
      result = ifelse(
    abs(sally - joan) <= 10, "They meet", "They do not"
  )))
}
toc()
3.749 sec elapsed

Results

The results themselves are equivalent. Differing values due to randomness in the simulation.

meet_vect |> 
  group_by(result) |> 
  summarize(number = n()) |> 
  mutate(proprotion = number / sum(number))
# A tibble: 2 × 3
  result      number proprotion
  <chr>        <int>      <dbl>
1 They do not   6966      0.697
2 They meet     3034      0.303
meet_map |> 
  group_by(result) |> 
  summarize(number = n())|> 
  mutate(proprotion = number / sum(number))
# A tibble: 2 × 3
  result      number proprotion
  <chr>        <int>      <dbl>
1 They do not   7087      0.709
2 They meet     2913      0.291
meet_for |> 
  group_by(result) |> 
  summarize(number = n())|> 
  mutate(proprotion = number / sum(number))
# A tibble: 2 × 3
  result      number proprotion
  <chr>        <int>      <dbl>
1 They do not   6942      0.694
2 They meet     3058      0.306

Visualizing the meet up

meet_map |> 
  ggplot(aes(x = joan, y = sally, color = result)) +
  geom_point(alpha = 0.3) + 
  geom_abline(intercept = 10, slope = 1) + 
  geom_abline(intercept = -10, slope = 1) + 
  scale_color_brewer(palette = "Paired")

Simulation best practices

  • no magic numbers
  • comment your code
  • use informative names
  • set a seed for reproducibility

Approach to simulating

  • break down the problem into small steps
  • check to see what works
  • simple is better

Examples

  • see class notes for more probability examples
    • pass the pigs
    • blackjack
    • roulette

Agenda 9/17/25

  1. Simulating to consider statistical inference – changes in technical conditions / assumptions.

Sensitivity of inferential conditions

  • Simulation to understand complicated models..
  • Simulation as easier than calculus for estimating probabilities.
  • Simulation to consider statistical inference with fewer conditions

t-test with exponential data

The t-test says that if the null hypothesis is true, we’d expect to reject a true null hypothesis \(\alpha \cdot 100\) percent of the time. That is, if \(\alpha = 0.05\), we’d reject true null hypotheses 5% of the time.

The guarantee of 5% (across many tests) happens only when the data are normal (or reasonably normal).

Non-normal data

data.frame(exp_data = rexp(n = 100000, rate = 20)) |> 
  ggplot(aes(x = exp_data)) +
  geom_histogram()

t-test function

t.test.pval <- function(data1, data2){
  t.test(data1, data2) |> # does the t-test
    broom::tidy() |>      # modifies the output to be a data frame
    select(p.value)       # selectes only the p-value from the output
}

t.test.pval(rexp(n = 10, rate = 20), rexp(n = 7, rate = 20))
# A tibble: 1 × 1
  p.value
    <dbl>
1   0.160

t-test 10 times

map(1:10, ~t.test.pval(rexp(n = 10, rate = 20),
                       rexp(n = 7, rate = 20))) |> 
          list_rbind()
# A tibble: 10 × 1
   p.value
     <dbl>
 1  0.145 
 2  0.324 
 3  0.414 
 4  0.584 
 5  0.385 
 6  0.782 
 7  0.127 
 8  0.0762
 9  0.733 
10  0.305 

t-test many times

set.seed(4747)

results <- map(1:10000, 
               ~t.test.pval(rexp(n = 10, rate = 20),
                            rexp(n = 7, rate = 20))) |> 
  list_rbind() 
results |> 
  ggplot(aes(x = p.value)) + 
  geom_histogram(breaks = seq(0, 1, by=0.05)) + 
  labs(x = "p-values")

results |> 
  summarize(type1 = mean(p.value < 0.05))
# A tibble: 1 × 1
   type1
   <dbl>
1 0.0377

Takeaway ideas

  • small takeaway: we shouldn’t necessarily use the t-test without checking the technical conditions (note here that there were pretty small sample sizes, which is also important).

  • bigger takeaway: we can use simulation to understand the error rates for hypothesis tests under particular initial conditions.

Sensitivity of CI to tech conditions

Consider the following set up:

set.seed(474)
beta0 <- -1; beta1 <- 0.5; beta2 <- 1.5
n_obs <- 100; reps <- 1000


x1 <- rep(c(0,1), each=n_obs/2)
x2 <- runif(n_obs, min=-1, max=1)
y <- beta0 + beta1*x1 + beta2*x2 + rnorm(n_obs, mean=0, sd = 1)

Plot of data

Capture the slope parameter?

  • We know the truth (!!!!!)
  • We can create a single CI for the parameter given a single data set
  • Is the parameter in the interval?
  • Should it always be in the interval? How often should it be?
  • Repeat the process to see whether the confidence interval captures the parameter at the claimed rate.

See: Rossman Chance regression applet

Running a linear model

CI <- lm(y~x1+x2) |> tidy(conf.int=TRUE) 
CI
# A tibble: 3 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   -0.929     0.142     -6.52 3.10e- 9  -1.21      -0.647
2 x1             0.330     0.202      1.64 1.05e- 1  -0.0698     0.731
3 x2             1.44      0.192      7.53 2.66e-11   1.06       1.83 
CI |>
  filter(term == "x2") |>
  select(term, estimate, conf.low, conf.high) |>
  mutate(inside = between(beta2, conf.low, conf.high))
# A tibble: 1 × 5
  term  estimate conf.low conf.high inside
  <chr>    <dbl>    <dbl>     <dbl> <lgl> 
1 x2        1.44     1.06      1.83 TRUE  

What happens to the CI of coefficients in repeated samples? (eq var)

eqvar_data <- data.frame(row_id = seq(1, n_obs, 1)) |>
  slice(rep(row_id, each = reps)) |>
  mutate(
    sim_id = rep(1:reps, n_obs),
    x1 = rep(c(0,1), each = n()/2),
    x2 = runif(n(), min = -1, max = 1),
    y = beta0 + beta1*x1 + beta2*x2 + rnorm(n(), mean = 0, sd = 1)
  ) |>
  arrange(sim_id, row_id) |>
  group_by(sim_id) |>
  nest()

eqvar_data
# A tibble: 1,000 × 2
# Groups:   sim_id [1,000]
   sim_id data              
    <int> <list>            
 1      1 <tibble [100 × 4]>
 2      2 <tibble [100 × 4]>
 3      3 <tibble [100 × 4]>
 4      4 <tibble [100 × 4]>
 5      5 <tibble [100 × 4]>
 6      6 <tibble [100 × 4]>
 7      7 <tibble [100 × 4]>
 8      8 <tibble [100 × 4]>
 9      9 <tibble [100 × 4]>
10     10 <tibble [100 × 4]>
# ℹ 990 more rows
beta_coef <- function(df){
  lm(y ~ x1 + x2, data = df) |>
    tidy(conf.int = TRUE) |>
    filter(term == "x2") |>
    select(estimate, conf.low, conf.high, p.value) 
}
eqvar_data |> 
  mutate(b2_vals = map(data, beta_coef)) |>
  select(b2_vals) |> 
  unnest(b2_vals) |>
  summarize(capture = between(beta2, conf.low, conf.high)) |>
  summarize(capture_rate = sum(capture)/reps)
# A tibble: 1 × 1
  capture_rate
         <dbl>
1        0.953

Sensitivity of CI to tech conditions

Consider the following set up (note the difference in variability):

set.seed(470)
beta0 <- -1; beta1 <- 0.5; beta2 <- 1.5
n_obs <- 100; reps <- 1000

  x1 <- rep(c(0,1), each=n_obs/2)
  x2 <- runif(n_obs, min=-1, max=1)
  y <- beta0 + beta1*x1 + beta2*x2 + 
             rnorm(n_obs, mean=0, sd = 1 + x1 + 10*abs(x2))

Plot of data

What happens to the CI of coefficients in repeated samples? (uneq var)

uneqvar_data <- data.frame(row_id = seq(1, n_obs, 1)) |>
  slice(rep(row_id, each = reps)) |>
  mutate(sim_id = rep(1:reps, n_obs), 
         x1 = rep(c(0,1), each = n()/2),
         x2 = runif(n(), min = -1, max = 1),
         y = beta0 + beta1*x1 + beta2*x2 + 
              rnorm(n(), mean = 0, sd = 1 + x1 + 10*abs(x2))  ) |>
  arrange(sim_id, row_id) |>
  group_by(sim_id) |> nest() 
uneqvar_data |> 
  mutate(b2_vals = map(data, beta_coef)) |>
  select(b2_vals) |> unnest(b2_vals) |>
  summarize(capture = between(beta2, conf.low, conf.high)) |>
  summarize(capture_rate = sum(capture)/reps)
# A tibble: 1 × 1
  capture_rate
         <dbl>
1        0.877

Equal variance: type I error?

Another method for thinking about type I error rates. With equal variance, the type I error rate is close to what we’d expect, 5%.

t_test_pval <- function(df){
  t.test(y ~ x1, data = df, var.equal = TRUE) |>
    tidy() |>
    select(estimate, p.value) 
}
set.seed(470)
reps <- 1000
n_obs <- 20
null_data_equal <- 
  data.frame(row_id = seq(1, n_obs, 1)) |>
  slice(rep(row_id, each = reps)) |>
  mutate(
    sim_id = rep(1:reps, n_obs),
    x1 = rep(c("group1", "group2"), each = n()/2),
    y = rnorm(n(), mean = 10, 
               sd = rep(c(1,1), each = n()/2))
  ) |>
  arrange(sim_id, row_id) |>
  group_by(sim_id) |>
  nest()
null_data_equal |> 
  filter(sim_id == 47) |> 
  unnest(cols = c(data)) |> 
  ggplot(aes(x = x1, y = y)) + 
  geom_boxplot() + 
  geom_jitter()

null_data_equal |> 
  mutate(t_vals = map(data,t_test_pval)) |>
  select(t_vals) |> 
  unnest(t_vals) |>
  ungroup(sim_id) |>
  summarize(type1error_rate = sum(p.value < 0.05)/reps)
# A tibble: 1 × 1
  type1error_rate
            <dbl>
1           0.045

Takeaway ideas

  • small takeaway: the linear model condition of equal variance will impact what the actual capture rate is for a confidence interval.

  • bigger takeaway: we can use simulations to understand whether the confidence interval procedure is capturing the parameter at the rate we want.

Unequal variance: type I error?

With unequal variance, the type I error rate is much higher than we set. We set the the type I error rate to be 5%, but the simulated rate was 6.8%.

set.seed(47)
reps <- 1000
n_obs <- 20
null_data_unequal <- 
  data.frame(row_id = seq(1, n_obs, 1)) |>
  slice(rep(row_id, each = reps)) |>
  mutate(
    sim_id = rep(1:reps, n_obs),
    x1 = rep(c("group1", "group2"), each = n()/2),
    y = rnorm(n(), mean = 10, 
               sd = rep(c(1,100), each = n()/2))
  ) |>
  arrange(sim_id, row_id) |>
  group_by(sim_id) |>
  nest()
null_data_unequal |> 
  filter(sim_id == 47) |> 
  unnest(cols = c(data)) |> 
  ggplot(aes(x = x1, y = y)) + 
  geom_boxplot() + 
  geom_jitter()

null_data_unequal |> 
  mutate(t_vals = map(data,t_test_pval)) |>
  select(t_vals) |> 
  unnest(t_vals) |>
  ungroup(sim_id) |>
  summarize(type1error_rate = sum(p.value < 0.05)/reps)
# A tibble: 1 × 1
  type1error_rate
            <dbl>
1           0.068

Takeaway ideas

  • small takeaway: we shouldn’t use the t-test without checking the technical conditions, here the problem was unequal variance (again, small sample sizes were part of the problem).

  • bigger takeaway: we can use simulation to understand the error rates for hypothesis tests under particular initial conditions.

References

Baumer, Ben, Daniel Kaplan, and Nicholas Horton. 2021. Modern Data Science with r. CRC Press. https://mdsr-book.github.io/mdsr2e/.
Mosteller, Frederick. 1987. Fifty Challenging Problems in Probability with Solutions. Dover Publications: Mineola, NY.