Simulating

September 8 - 17, 2025

Jo Hardin

Bias-Variance Trade-off

Mathematically

We can measure how well a model does by looking at how close the fit of the model ( $\hat{f}$ ) is to the observed data ( $y$ ).

$M S E = E [(Y - \hat{f} (x))^{2}]$

Because $Y = f (x) + ϵ$ , we know that:

$\begin{array}{rcl} M S E & = & E [(Y - \hat{f} (x))^{2}] \\ = & E [(f (x) + ϵ - \hat{f} (x))^{2}] \\ = & E [(f (x) - \hat{f} (x))^{2}] + 2 E [(f (x) - \hat{f} (x)) \cdot ϵ] + E [ϵ^{2}] \end{array}$

Term 3: The third term is $σ^{2}$ , the irreducible error (the part that even the perfect model can’t predict).

Term 2: Turns out to be zero because the errors are assumed to be independent of the explanatory variable, so the expected value can filter through.

Term 1: Takes a little bit of work to expand (something you’d do in, say, Math 152)¹.

Simulation

Let’s consider a model: $Y = f (x) + ϵ$

plot
code

num_x <- length(seq(0, 10, by = 0.5))
set.seed(4774)
bias_var_data <- tibble(ex = seq(0, 10, by = 0.5),
                        eps = rnorm(num_x, mean = 0, sd = 5),
                        why = ex^2 + eps) |> 
  mutate(underfit = lm(why ~ ex)$fitted,
         `good fit` = lm(why ~ ex + I(ex^2))$fitted,
         overfit = lm(why ~ ex + I(ex^2) + I(ex^3) + I(ex^4) + I(ex^5) + I(ex^6) + I(ex^7) + I(ex^8) + I(ex^9) + I(ex^10) )$fitted) |> 
  pivot_longer(cols = underfit:overfit, 
               names_to = "fit", 
               values_to = "prediction") |> 
  mutate(fit = factor(fit, 
                        levels = c("underfit", "good fit", "overfit") ))

bias_var_data |> 
  ggplot(aes(x = ex, y = why)) + 
  geom_point() + 
  geom_line(aes(y = prediction)) + 
  facet_grid(~ fit)

t-test many times

set.seed(4747)

results <- map(1:10000, 
               ~t.test.pval(rexp(n = 10, rate = 20),
                            rexp(n = 7, rate = 20))) |> 
  list_rbind()

results |> 
  ggplot(aes(x = p.value)) + 
  geom_histogram(breaks = seq(0, 1, by=0.05)) + 
  labs(x = "p-values")

results |> 
  summarize(type1 = mean(p.value < 0.05))

# A tibble: 1 × 1
   type1
   <dbl>
1 0.0377

1 / 54

Simulating September 8 - 17, 2025 Jo Hardin

Simulating
Agenda 9/8/25
Bias-Variance Trade-off
What is variance? What is bias?
Mathematically
Simplified Term 1
MSE
Simulation
Errors
Fit to new data
Back to bias and variance
What we know about bias-variance trade-off
Agenda 9/15/25
Simulating: the big picture.
Why?
Monte Carlo sample from a population to…
Goals of simulating complicated models
Examples
Aside: a few R functions (ifelse())
Aside: a few R functions (case_when())
Aside: a few R functions (sample_n())
Aside: a few R functions (sample())
Aside: a few R functions (set.seed())
Aside: a few R functions (n() and n_distinct())
Monte Carlo sample from a population to…
Small simulation example to calculate a probability
Sally & Joan
Simulate their meeting
Results
Visualizing the meet up
Simulation best practices
Approach to simulating
Examples
Agenda 9/17/25
Sensitivity of inferential conditions
t-test with exponential data
Non-normal data
t-test function
t-test 10 times
t-test many times
Takeaway ideas
Sensitivity of CI to tech conditions
Plot of data
Capture the slope parameter?
Running a linear model
What happens to the CI of coefficients in repeated samples? (eq var)
Sensitivity of CI to tech conditions
Plot of data
What happens to the CI of coefficients in repeated samples? (uneq var)
Equal variance: type I error?
Takeaway ideas
Unequal variance: type I error?
Takeaway ideas
References