Permutation Tests

September 22 + 24 + 29, 2025

Jo Hardin

Agenda 9/22/25

Logic of hypothesis testing
Logic of permutation tests
Examples - testing $k$ means

Statistics Without the Agonizing Pain

Image of John Rauser who gave a keynote address on permutation tests at the Strat + Hadoop conference in 2014.

John Rauser of Pintrest (now Amazon), speaking at Strata + Hadoop 2014. https://blog.revolutionanalytics.com/2014/10/statistics-doesnt-have-to-be-that-hard.html

Logic of hypothesis tests

Choose a statistic that measures the effect
Construct the sampling distribution under $H_0$ (almost always done using technical mathematics)
Locate the observed statistic in the null sampling distribution
p-value is the probability of the observed data or more extreme if the null hypothesis is true

Logic of permutation tests

Choose a test statistic
Shuffle the data (force the null hypothesis to be true)
Create a null sampling distribution of the test statistic (under $H_0$) (done using the computer, not calculus)
Find the observed test statistic on the null sampling distribution and compute the p-value (observed data or more extreme). The p-value can be one or two-sided.

Consider the NHANES dataset.

Income
- (HHIncomeMid - Numerical version of HHIncome derived from the middle income in each category)
Health
- (HealthGen - Self-reported rating of participant’s health in general Reported for participants aged 12 years or older. One of Excellent, Vgood, Good, Fair, or Poor.)

Summary of the variables of interest

NHANES  |> 
  group_by(HealthGen)  |> 
  summarize(skimr::skim_without_charts(HHIncomeMid)) |> 
  select(HealthGen, n_missing, numeric.mean, numeric.sd, numeric.p0, numeric.p50, numeric.p100) |> 
  gt::gt()

HealthGen	n_missing	numeric.mean	numeric.sd	numeric.p0	numeric.p50	numeric.p100
Excellent	61	69354.35	32131.03	2500	87500	1e+05
Vgood	166	65010.67	32071.03	2500	70000	1e+05
Good	212	55662.35	31320.81	2500	50000	1e+05
Fair	111	44193.55	30688.83	2500	30000	1e+05
Poor	23	37027.44	29396.50	2500	30000	1e+05
NA	238	53175.89	33978.71	2500	50000	1e+05

Mean Income broken down by Health

NH.means <- NHANES  |> 
  filter(!is.na(HealthGen) & !is.na(HHIncomeMid))  |> 
  group_by(HealthGen)  |> 
  summarize(IncMean = mean(HHIncomeMid), count=n())
NH.means

# A tibble: 5 × 3
  HealthGen IncMean count
  <fct>       <dbl> <int>
1 Excellent  69354.   817
2 Vgood      65011.  2342
3 Good       55662.  2744
4 Fair       44194.   899
5 Poor       37027.   164

Are the differences in means simply due to random chance??

Income and Health

NHANES  |> filter(!is.na(HealthGen)& !is.na(HHIncomeMid))  |> 
ggplot(aes(x=HealthGen, y=HHIncomeMid)) + geom_boxplot()

Differences in Income ($)

           Excellent      Vgood      Good      Fair      Poor
Excellent      0.000   4343.671  13691.99 25160.797 32326.906
Vgood      -4343.671      0.000   9348.32 20817.126 27983.236
Good      -13691.991  -9348.320      0.00 11468.806 18634.915
Fair      -25160.797 -20817.126 -11468.81     0.000  7166.109
Poor      -32326.906 -27983.236 -18634.92 -7166.109     0.000

Overall difference

We can measure the overall differences as the amount of variability between each of the means and the overall mean:

\[F = \frac{\text{between-group variability}}{\text{within-group variability}}\] \[F = \frac{\sum_i n_i(\overline{X}_{i\cdot} - \overline{X})^2/(K-1)}{\sum_{ij} (X_{ij}-\overline{X}_{i\cdot})^2/(N-K)}\] \[SumSqBtwn = \sum_i n_i(\overline{X}_{i\cdot} - \overline{X})^2\]

Creating a test statistic

NHANES  |> select(HHIncomeMid, HealthGen)  |> 
  filter(!is.na(HealthGen)& !is.na(HHIncomeMid))

# A tibble: 6,966 × 2
   HHIncomeMid HealthGen
         <int> <fct>    
 1       30000 Good     
 2       30000 Good     
 3       30000 Good     
 4       40000 Good     
 5       87500 Vgood    
 6       87500 Vgood    
 7       87500 Vgood    
 8       30000 Vgood    
 9      100000 Vgood    
10       70000 Fair     
# ℹ 6,956 more rows

Creating a test statistic

The pull() function create a single value (or vector) instead of a data frame.

GM <- NHANES  |> summarize(mean(HHIncomeMid, na.rm=TRUE))  |> pull()
GM

[1] 57206.17

NH.means

# A tibble: 5 × 3
  HealthGen IncMean count
  <fct>       <dbl> <int>
1 Excellent  69354.   817
2 Vgood      65011.  2342
3 Good       55662.  2744
4 Fair       44194.   899
5 Poor       37027.   164

Creating a test statistic

I’ve tried to break down the process of creating the test statistic, but the syntax is slightly different from what we would do in the actual tidy pipeline. It might be easier to understand the calculation of the observed test statistic two slides forward.

NH.means  |> select(IncMean)  |> pull() - GM

[1]  12148.175   7804.504  -1543.816 -13012.622 -20178.731

(NH.means  |> select(IncMean)  |> pull() - GM)^2

[1] 147578150  60910286   2383368 169328332 407181201

NH.means  |> select(count)  |> pull()

[1]  817 2342 2744  899  164

NH.means  |> select(count)  |> pull() * 
  (NH.means  |> select(IncMean)  |> pull() - GM)^2

[1] 120571348234 142651889943   6539963000 152226170649  66777716928

sum(NH.means  |> select(count)  |> pull() * 
  (NH.means  |> select(IncMean)  |> pull() - GM)^2)

[1] 488767088754

Creating a test statistic

\[SumSqBtwn = \sum_i n_i(\overline{X}_{i\cdot} - \overline{X})^2\]

sum(NH.means  |> select(count)  |> pull() * 
      (NH.means  |> select(IncMean)  |> pull() - GM)^2)

[1] 488767088754

The observed test statistic

GM <- NHANES  |> summarize(mean(HHIncomeMid, na.rm=TRUE))  |> pull()
GM

[1] 57206.17

SSB_obs <- NHANES  |>
  filter(!is.na(HealthGen) & !is.na(HHIncomeMid))  |> 
  group_by(HealthGen)  |> 
  summarize(IncMean = mean(HHIncomeMid), count=n())  |>
  summarize(obs_teststat = sum(count*(IncMean - GM)^2)) 

SSB_obs

# A tibble: 1 × 1
   obs_teststat
          <dbl>
1 488767088754.

Permuting the data

NHANES  |> 
  filter(!is.na(HealthGen)& !is.na(HHIncomeMid))  |>
  mutate(IncomePerm = sample(HHIncomeMid, replace=FALSE))  |>
  select(HealthGen, HHIncomeMid, IncomePerm)

# A tibble: 6,966 × 3
   HealthGen HHIncomeMid IncomePerm
   <fct>           <int>      <int>
 1 Good            30000      50000
 2 Good            30000      12500
 3 Good            30000     100000
 4 Good            40000      30000
 5 Vgood           87500      50000
 6 Vgood           87500      70000
 7 Vgood           87500       7500
 8 Vgood           30000     100000
 9 Vgood          100000      40000
10 Fair            70000     100000
# ℹ 6,956 more rows

Permuting the data & a new test statistic

NHANES  |> 
  filter(!is.na(HealthGen)& !is.na(HHIncomeMid))  |>
  mutate(IncomePerm = sample(HHIncomeMid, replace=FALSE))  |>
  group_by(HealthGen)  |> 
  summarize(IncMeanP = mean(IncomePerm), count=n())  |>
  summarize(teststat = sum(count*(IncMeanP - GM)^2))

# A tibble: 1 × 1
      teststat
         <dbl>
1 21622040975.

Lots of times…

reps <- 1000

SSB_perm_func <- function(.x){
  NHANES  |> 
        filter(!is.na(HealthGen)& !is.na(HHIncomeMid))  |>
        mutate(IncomePerm = sample(HHIncomeMid, replace=FALSE))  |>
        group_by(HealthGen)  |> 
        summarize(IncMeanP = mean(IncomePerm), count=n())  |>
        summarize(teststat = sum(count*(IncMeanP - GM)^2)) 
}

SSB_perm_val <- map(1:reps, SSB_perm_func) |> 
  list_rbind()

SSB_perm_val

# A tibble: 1,000 × 1
       teststat
          <dbl>
 1 15485792584.
 2 13404483026.
 3 20777315121.
 4 15049867857.
 5 12546504102.
 6 15459609593.
 7 20169837162.
 8 15372407832.
 9 21775634127.
10 14464305913.
# ℹ 990 more rows

Compared to the real data

SSB_obs <- NHANES  |>
  filter(!is.na(HealthGen) & !is.na(HHIncomeMid))  |> 
  group_by(HealthGen)  |> 
  summarize(IncMean = mean(HHIncomeMid), count=n())  |>
  summarize(obs_teststat = sum(count*(IncMean - GM)^2)) 

SSB_obs

# A tibble: 1 × 1
   obs_teststat
          <dbl>
1 488767088754.

sum(SSB_perm_val  |> pull() > SSB_obs  |> pull() ) / reps

[1] 0

Compared to the observed test statistic

Agenda 9/24/25

Conditions, exchangeability, random structure
Different structures and statistics

Exchangeability

If the null hypothesis is true, the labels assigning groups are interchangeable with respect to the probability distribution.

typically (with the two group setting),

\[H_0: F_1(x) = F_2(x)\]

(there are no distributional or parametric conditions)

Exchangeability

More generally, we might use the following exchangeability definition

Data are exchangeable under the null hypothesis if the joint distribution from which the data came is the same before permutation as after permutation when the null hypothesis is true.

Probability as measured by what?

Random Sample The concept of a p-value usually comes from the idea of taking a sample from a population and comparing it to a sampling distribution (from many many random samples).
Randomized Experiment The p-value represents the observed data compared to the treatment variable being allocated to the groups “by chance.”

Permuting independent observations

Consider a “family” structure where some individuals are exposed and others are not (control).

Permuting homogenous cluster

Consider a “family” structure where individuals in a cluster always have the same treatment.

Permuting herterogenous cluster

Consider a “family” structure where individuals in a cluster always have the opposite treatment.

Agenda 9/29/25

Exchangeability
Nested permutations – MacNell evaluations
Multiple linear regression

Gender bias in teaching evaluations

The Economist, Sep 21, 2017

Gender bias in teaching evaluations

Gender bias seen in study on 20,000 students. The students knew the gender of their prof, but did not choose their class. https://doi.org/10.1093/jeea/jvx057

Gender bias in teaching evaluations

86 students assigned randomly to the TA section. Innovative Higher Education, 40, pages 291–303 (2015). https://doi.org/10.58188/1941-8043.1509

Gender bias in teaching evaluations

Gender bias: MacNell data

temp <- macnell |> 
  group_by(tagender) |> 
  mutate(perm = sample(taidgender, replace = FALSE)) |> 
  select(overall,tagender, taidgender, perm) |> 
  arrange(tagender)

Analysis goal

Want to know if the population average score for the perceived gender is different.

\[H_0: \mu_{ID.Female} = \mu_{ID.Male}\]

Although for the permutation test, under the null hypothesis not only are the means of the population distributions the same, but the variance and all other aspects of the distributions across perceived gender.