September 8 - 17, 2025
map()Variance refers to the amount by which
Bias refers to the error that is introduced by approximating the “truth” by a model which is too simple.
We can measure how well a model does by looking at how close the fit of the model (
Because
Term 3: The third term is
Term 2: Turns out to be zero because the errors are assumed to be independent of the explanatory variable, so the expected value can filter through.
Term 1: Takes a little bit of work to expand (something you’d do in, say, Math 152)1.
Where:
Variance of
Let’s consider a model:
num_x <- length(seq(0, 10, by = 0.5))
set.seed(4774)
bias_var_data <- tibble(ex = seq(0, 10, by = 0.5),
eps = rnorm(num_x, mean = 0, sd = 5),
why = ex^2 + eps) |>
mutate(underfit = lm(why ~ ex)$fitted,
`good fit` = lm(why ~ ex + I(ex^2))$fitted,
overfit = lm(why ~ ex + I(ex^2) + I(ex^3) + I(ex^4) + I(ex^5) + I(ex^6) + I(ex^7) + I(ex^8) + I(ex^9) + I(ex^10) )$fitted) |>
pivot_longer(cols = underfit:overfit,
names_to = "fit",
values_to = "prediction") |>
mutate(fit = factor(fit,
levels = c("underfit", "good fit", "overfit") ))
bias_var_data |>
ggplot(aes(x = ex, y = why)) +
geom_point() +
geom_line(aes(y = prediction)) +
facet_grid(~ fit)Which is worse? Underfitting or overfitting? How can we measure the error?
WAIT! We don’t want the error on the data that built the model, we want the error on the population.
# A tibble: 3 × 2
fit MSE_newdata
<fct> <dbl>
1 underfit 98.6
2 good fit 27.3
3 overfit 35.8
lm1 <- bias_var_data |>
filter(fit == "underfit") |>
lm(why ~ ex, data = _)
lm2 <- bias_var_data |>
filter(fit == "good fit") |>
lm(why ~ ex + I(ex^2), data = _)
lm3 <- bias_var_data |>
filter(fit == "overfit") |>
lm(why ~ ex + I(ex^2) + I(ex^3) + I(ex^4) + I(ex^5) +
I(ex^6) + I(ex^7) + I(ex^8) + I(ex^9) + I(ex^10), data = _)
new_data |>
mutate(underfit = predict.lm(lm1, newdata = new_data),
`good fit` = predict(lm2, newdata = new_data),
overfit = predict(lm3, newdata = new_data)) |>
pivot_longer(cols = underfit:overfit,
names_to = "fit",
values_to = "prediction") |>
mutate(fit = factor(fit,
levels = c("underfit", "good fit", "overfit") )) |>
group_by(fit) |>
summarize(MSE_newdata = mean((why - prediction)^2))y ~ xy ~ x + x^2y ~ x + x^2 + x^3 + x^4 + x^5 + x^6 + x^7 + x^8 + x^9 + x^10set.seed(47)
datasets <- 1:9 |>
map_dfr(\(i) {
tibble(
ex = seq(0, 10, by = 0.5),
eps = rnorm(num_x, mean = 0, sd = 5),
why = ex^2 + eps,
dataset = i
)
}) |>
group_by(dataset) |>
mutate(underfit = lm(why ~ ex)$fitted,
`good fit` = lm(why ~ ex + I(ex^2))$fitted,
overfit = lm(why ~ ex + I(ex^2) + I(ex^3) + I(ex^4) + I(ex^5) + I(ex^6) + I(ex^7) + I(ex^8) + I(ex^9) + I(ex^10) )$fitted) |>
pivot_longer(cols = underfit:overfit,
names_to = "fit",
values_to = "prediction") |>
mutate(fit = factor(fit,
levels = c("underfit", "good fit", "overfit") ))
datasetsThe underfit model looks the same for every dataset! That is, the model has low variance.
The underfit model doesn’t fit the data very well. The model has high bias.
The overfit model model is very flexible, and looks like it fits the data very well. It has low bias.
The fitted models look quite different across the different datasets. The overfit model has high variance.
Three simulation methods are used for different purposes:
Monte Carlo methods - use repeated sampling from a population with known characteristics.
Randomization / Permutation methods - use shuffling (sampling without replacement from a sample) to test hypotheses of “no effect”.
Bootstrap methods - use resampling (sampling with replacement from a sample) to establish confidence intervals.
The goal of simulating a complicated model is not only to create a program which will provide the desired results. We also hope to be able to write code such that:
ifelse())case_when())set.seed(4747)
diamonds |> select(carat, cut, color, price) |>
sample_n(20) |>
mutate(price_cat = case_when(
price > 10000 ~ "expensive",
price > 1500 ~ "medium",
TRUE ~ "inexpensive"))# A tibble: 20 × 5
carat cut color price price_cat
<dbl> <ord> <ord> <int> <chr>
1 1.23 Very Good F 10276 expensive
2 0.35 Premium H 706 inexpensive
3 0.7 Good E 2782 medium
4 0.4 Ideal D 1637 medium
5 0.53 Ideal G 1255 inexpensive
6 2.22 Ideal G 14637 expensive
7 0.3 Ideal G 878 inexpensive
8 1.05 Ideal H 4223 medium
9 0.53 Premium E 1654 medium
10 1.7 Ideal H 7068 medium
11 0.31 Good E 698 inexpensive
12 0.31 Ideal F 840 inexpensive
13 1.03 Ideal H 4900 medium
14 0.31 Premium G 698 inexpensive
15 1.56 Premium G 8858 medium
16 1.71 Premium G 11032 expensive
17 1 Good E 5345 medium
18 1.86 Ideal J 10312 expensive
19 1.08 Very Good E 3726 medium
20 0.31 Premium E 698 inexpensive
sample_n())We took a random sample of 20 rows from the diamonds dataset.
set.seed(4747)
diamonds |> select(carat, cut, color, price) |>
sample_n(20) |>
mutate(price_cat = case_when(
price > 10000 ~ "expensive",
price > 1500 ~ "medium",
TRUE ~ "inexpensive"))# A tibble: 20 × 5
carat cut color price price_cat
<dbl> <ord> <ord> <int> <chr>
1 1.23 Very Good F 10276 expensive
2 0.35 Premium H 706 inexpensive
3 0.7 Good E 2782 medium
4 0.4 Ideal D 1637 medium
5 0.53 Ideal G 1255 inexpensive
6 2.22 Ideal G 14637 expensive
7 0.3 Ideal G 878 inexpensive
8 1.05 Ideal H 4223 medium
9 0.53 Premium E 1654 medium
10 1.7 Ideal H 7068 medium
11 0.31 Good E 698 inexpensive
12 0.31 Ideal F 840 inexpensive
13 1.03 Ideal H 4900 medium
14 0.31 Premium G 698 inexpensive
15 1.56 Premium G 8858 medium
16 1.71 Premium G 11032 expensive
17 1 Good E 5345 medium
18 1.86 Ideal J 10312 expensive
19 1.08 Very Good E 3726 medium
20 0.31 Premium E 698 inexpensive
sample())sampling, shuffling, and resampling: sample()
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
[1] "i" "b" "g" "d" "a"
[1] "j" "g" "f" "i" "f"
[1] "f" "h" "i" "e" "g" "d" "c" "j" "b" "a"
[1] "e" "j" "e" "b" "e" "c" "f" "a" "e" "a"
set.seed())What if we want to be able to generate the same random numbers (here on the interval from 0 to 1) over and over?
n() and n_distinct())n() counts the number of rows. n_distinct() counts the number of distinct rows.
Consider a situation where Sally and Joan plan to meet to study in their college campus center (Mosteller 1987; Baumer, Kaplan, and Horton 2021). They are both impatient people who will wait only 10 minutes for the other before leaving.
But their planning was incomplete. Sally said, “Meet me between 7 and 8 tonight at the student center.” When should Joan plan to arrive at the campus center? And what is the probability that they actually meet?
Assume that Sally and Joan are both equally likely to arrive at the campus center anywhere between 7pm and 8pm.
The results themselves are equivalent. Differing values due to randomness in the simulation.
The t-test says that if the null hypothesis is true, we’d expect to reject a true null hypothesis
The guarantee of 5% (across many tests) happens only when the data are normal (or reasonably normal).
t.test.pval <- function(data1, data2){
t.test(data1, data2) |> # does the t-test
broom::tidy() |> # modifies the output to be a data frame
select(p.value) # selectes only the p-value from the output
}
t.test.pval(rexp(n = 10, rate = 20), rexp(n = 7, rate = 20))# A tibble: 1 × 1
p.value
<dbl>
1 0.160
small takeaway: we shouldn’t necessarily use the t-test without checking the technical conditions (note here that there were pretty small sample sizes, which is also important).
bigger takeaway: we can use simulation to understand the error rates for hypothesis tests under particular initial conditions.
Consider the following set up:
# A tibble: 3 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.929 0.142 -6.52 3.10e- 9 -1.21 -0.647
2 x1 0.330 0.202 1.64 1.05e- 1 -0.0698 0.731
3 x2 1.44 0.192 7.53 2.66e-11 1.06 1.83
CI |>
filter(term == "x2") |>
select(term, estimate, conf.low, conf.high) |>
mutate(inside = between(beta2, conf.low, conf.high))# A tibble: 1 × 5
term estimate conf.low conf.high inside
<chr> <dbl> <dbl> <dbl> <lgl>
1 x2 1.44 1.06 1.83 TRUE
eqvar_data <- data.frame(row_id = seq(1, n_obs, 1)) |>
slice(rep(row_id, each = reps)) |>
mutate(
sim_id = rep(1:reps, n_obs),
x1 = rep(c(0,1), each = n()/2),
x2 = runif(n(), min = -1, max = 1),
y = beta0 + beta1*x1 + beta2*x2 + rnorm(n(), mean = 0, sd = 1)
) |>
arrange(sim_id, row_id) |>
group_by(sim_id) |>
nest()
eqvar_data# A tibble: 1,000 × 2
# Groups: sim_id [1,000]
sim_id data
<int> <list>
1 1 <tibble [100 × 4]>
2 2 <tibble [100 × 4]>
3 3 <tibble [100 × 4]>
4 4 <tibble [100 × 4]>
5 5 <tibble [100 × 4]>
6 6 <tibble [100 × 4]>
7 7 <tibble [100 × 4]>
8 8 <tibble [100 × 4]>
9 9 <tibble [100 × 4]>
10 10 <tibble [100 × 4]>
# ℹ 990 more rows
Consider the following set up (note the difference in variability):
uneqvar_data <- data.frame(row_id = seq(1, n_obs, 1)) |>
slice(rep(row_id, each = reps)) |>
mutate(sim_id = rep(1:reps, n_obs),
x1 = rep(c(0,1), each = n()/2),
x2 = runif(n(), min = -1, max = 1),
y = beta0 + beta1*x1 + beta2*x2 +
rnorm(n(), mean = 0, sd = 1 + x1 + 10*abs(x2)) ) |>
arrange(sim_id, row_id) |>
group_by(sim_id) |> nest() Another method for thinking about type I error rates. With equal variance, the type I error rate is close to what we’d expect, 5%.
set.seed(470)
reps <- 1000
n_obs <- 20
null_data_equal <-
data.frame(row_id = seq(1, n_obs, 1)) |>
slice(rep(row_id, each = reps)) |>
mutate(
sim_id = rep(1:reps, n_obs),
x1 = rep(c("group1", "group2"), each = n()/2),
y = rnorm(n(), mean = 10,
sd = rep(c(1,1), each = n()/2))
) |>
arrange(sim_id, row_id) |>
group_by(sim_id) |>
nest()small takeaway: the linear model condition of equal variance will impact what the actual capture rate is for a confidence interval.
bigger takeaway: we can use simulations to understand whether the confidence interval procedure is capturing the parameter at the rate we want.
With unequal variance, the type I error rate is much higher than we set. We set the the type I error rate to be 5%, but the simulated rate was 6.8%.
set.seed(47)
reps <- 1000
n_obs <- 20
null_data_unequal <-
data.frame(row_id = seq(1, n_obs, 1)) |>
slice(rep(row_id, each = reps)) |>
mutate(
sim_id = rep(1:reps, n_obs),
x1 = rep(c("group1", "group2"), each = n()/2),
y = rnorm(n(), mean = 10,
sd = rep(c(1,100), each = n()/2))
) |>
arrange(sim_id, row_id) |>
group_by(sim_id) |>
nest()small takeaway: we shouldn’t use the t-test without checking the technical conditions, here the problem was unequal variance (again, small sample sizes were part of the problem).
bigger takeaway: we can use simulation to understand the error rates for hypothesis tests under particular initial conditions.