Permutation Tests

September 22 + 24 + 29, 2025

Jo Hardin

Agenda 9/22/25

  1. Logic of hypothesis testing
  2. Logic of permutation tests
  3. Examples - testing \(k\) means

Statistics Without the Agonizing Pain

Image of John Rauser who gave a keynote address  on permutation tests at the Strat + Hadoop conference in 2014.

John Rauser of Pintrest (now Amazon), speaking at Strata + Hadoop 2014. https://blog.revolutionanalytics.com/2014/10/statistics-doesnt-have-to-be-that-hard.html

Logic of hypothesis tests

  1. Choose a statistic that measures the effect

  2. Construct the sampling distribution under \(H_0\) (almost always done using technical mathematics)

  3. Locate the observed statistic in the null sampling distribution

  4. p-value is the probability of the observed data or more extreme if the null hypothesis is true

Logic of permutation tests

  1. Choose a test statistic

  2. Shuffle the data (force the null hypothesis to be true)

  3. Create a null sampling distribution of the test statistic (under \(H_0\)) (done using the computer, not calculus)

  4. Find the observed test statistic on the null sampling distribution and compute the p-value (observed data or more extreme). The p-value can be one or two-sided.

Consider the NHANES dataset.

  • Income
    • (HHIncomeMid - Numerical version of HHIncome derived from the middle income in each category)
  • Health
    • (HealthGen - Self-reported rating of participant’s health in general Reported for participants aged 12 years or older. One of Excellent, Vgood, Good, Fair, or Poor.)

Summary of the variables of interest

NHANES  |> 
  group_by(HealthGen)  |> 
  summarize(skimr::skim_without_charts(HHIncomeMid)) |> 
  select(HealthGen, n_missing, numeric.mean, numeric.sd, numeric.p0, numeric.p50, numeric.p100) |> 
  gt::gt()
HealthGen n_missing numeric.mean numeric.sd numeric.p0 numeric.p50 numeric.p100
Excellent 61 69354.35 32131.03 2500 87500 1e+05
Vgood 166 65010.67 32071.03 2500 70000 1e+05
Good 212 55662.35 31320.81 2500 50000 1e+05
Fair 111 44193.55 30688.83 2500 30000 1e+05
Poor 23 37027.44 29396.50 2500 30000 1e+05
NA 238 53175.89 33978.71 2500 50000 1e+05

Mean Income broken down by Health

NH.means <- NHANES  |> 
  filter(!is.na(HealthGen) & !is.na(HHIncomeMid))  |> 
  group_by(HealthGen)  |> 
  summarize(IncMean = mean(HHIncomeMid), count=n())
NH.means
# A tibble: 5 × 3
  HealthGen IncMean count
  <fct>       <dbl> <int>
1 Excellent  69354.   817
2 Vgood      65011.  2342
3 Good       55662.  2744
4 Fair       44194.   899
5 Poor       37027.   164

Are the differences in means simply due to random chance??

Income and Health

NHANES  |> filter(!is.na(HealthGen)& !is.na(HHIncomeMid))  |> 
ggplot(aes(x=HealthGen, y=HHIncomeMid)) + geom_boxplot()

Differences in Income ($)

           Excellent      Vgood      Good      Fair      Poor
Excellent      0.000   4343.671  13691.99 25160.797 32326.906
Vgood      -4343.671      0.000   9348.32 20817.126 27983.236
Good      -13691.991  -9348.320      0.00 11468.806 18634.915
Fair      -25160.797 -20817.126 -11468.81     0.000  7166.109
Poor      -32326.906 -27983.236 -18634.92 -7166.109     0.000

Overall difference

We can measure the overall differences as the amount of variability between each of the means and the overall mean:

\[F = \frac{\text{between-group variability}}{\text{within-group variability}}\] \[F = \frac{\sum_i n_i(\overline{X}_{i\cdot} - \overline{X})^2/(K-1)}{\sum_{ij} (X_{ij}-\overline{X}_{i\cdot})^2/(N-K)}\] \[SumSqBtwn = \sum_i n_i(\overline{X}_{i\cdot} - \overline{X})^2\]

Creating a test statistic

NHANES  |> select(HHIncomeMid, HealthGen)  |> 
  filter(!is.na(HealthGen)& !is.na(HHIncomeMid))
# A tibble: 6,966 × 2
   HHIncomeMid HealthGen
         <int> <fct>    
 1       30000 Good     
 2       30000 Good     
 3       30000 Good     
 4       40000 Good     
 5       87500 Vgood    
 6       87500 Vgood    
 7       87500 Vgood    
 8       30000 Vgood    
 9      100000 Vgood    
10       70000 Fair     
# ℹ 6,956 more rows

Creating a test statistic

The pull() function create a single value (or vector) instead of a data frame.

GM <- NHANES  |> summarize(mean(HHIncomeMid, na.rm=TRUE))  |> pull()
GM
[1] 57206.17
NH.means
# A tibble: 5 × 3
  HealthGen IncMean count
  <fct>       <dbl> <int>
1 Excellent  69354.   817
2 Vgood      65011.  2342
3 Good       55662.  2744
4 Fair       44194.   899
5 Poor       37027.   164

Creating a test statistic

I’ve tried to break down the process of creating the test statistic, but the syntax is slightly different from what we would do in the actual tidy pipeline. It might be easier to understand the calculation of the observed test statistic two slides forward.

NH.means  |> select(IncMean)  |> pull() - GM
[1]  12148.175   7804.504  -1543.816 -13012.622 -20178.731
(NH.means  |> select(IncMean)  |> pull() - GM)^2
[1] 147578150  60910286   2383368 169328332 407181201
NH.means  |> select(count)  |> pull()
[1]  817 2342 2744  899  164
NH.means  |> select(count)  |> pull() * 
  (NH.means  |> select(IncMean)  |> pull() - GM)^2
[1] 120571348234 142651889943   6539963000 152226170649  66777716928
sum(NH.means  |> select(count)  |> pull() * 
  (NH.means  |> select(IncMean)  |> pull() - GM)^2)
[1] 488767088754

Creating a test statistic

\[SumSqBtwn = \sum_i n_i(\overline{X}_{i\cdot} - \overline{X})^2\]

sum(NH.means  |> select(count)  |> pull() * 
      (NH.means  |> select(IncMean)  |> pull() - GM)^2)
[1] 488767088754

The observed test statistic

GM <- NHANES  |> summarize(mean(HHIncomeMid, na.rm=TRUE))  |> pull()
GM
[1] 57206.17
SSB_obs <- NHANES  |>
  filter(!is.na(HealthGen) & !is.na(HHIncomeMid))  |> 
  group_by(HealthGen)  |> 
  summarize(IncMean = mean(HHIncomeMid), count=n())  |>
  summarize(obs_teststat = sum(count*(IncMean - GM)^2)) 

SSB_obs
# A tibble: 1 × 1
   obs_teststat
          <dbl>
1 488767088754.

Permuting the data

NHANES  |> 
  filter(!is.na(HealthGen)& !is.na(HHIncomeMid))  |>
  mutate(IncomePerm = sample(HHIncomeMid, replace=FALSE))  |>
  select(HealthGen, HHIncomeMid, IncomePerm) 
# A tibble: 6,966 × 3
   HealthGen HHIncomeMid IncomePerm
   <fct>           <int>      <int>
 1 Good            30000      50000
 2 Good            30000      12500
 3 Good            30000     100000
 4 Good            40000      30000
 5 Vgood           87500      50000
 6 Vgood           87500      70000
 7 Vgood           87500       7500
 8 Vgood           30000     100000
 9 Vgood          100000      40000
10 Fair            70000     100000
# ℹ 6,956 more rows

Permuting the data & a new test statistic

NHANES  |> 
  filter(!is.na(HealthGen)& !is.na(HHIncomeMid))  |>
  mutate(IncomePerm = sample(HHIncomeMid, replace=FALSE))  |>
  group_by(HealthGen)  |> 
  summarize(IncMeanP = mean(IncomePerm), count=n())  |>
  summarize(teststat = sum(count*(IncMeanP - GM)^2))
# A tibble: 1 × 1
      teststat
         <dbl>
1 21622040975.

Lots of times…

reps <- 1000

SSB_perm_func <- function(.x){
  NHANES  |> 
        filter(!is.na(HealthGen)& !is.na(HHIncomeMid))  |>
        mutate(IncomePerm = sample(HHIncomeMid, replace=FALSE))  |>
        group_by(HealthGen)  |> 
        summarize(IncMeanP = mean(IncomePerm), count=n())  |>
        summarize(teststat = sum(count*(IncMeanP - GM)^2)) 
}

SSB_perm_val <- map(1:reps, SSB_perm_func) |> 
  list_rbind()

SSB_perm_val
# A tibble: 1,000 × 1
       teststat
          <dbl>
 1 15485792584.
 2 13404483026.
 3 20777315121.
 4 15049867857.
 5 12546504102.
 6 15459609593.
 7 20169837162.
 8 15372407832.
 9 21775634127.
10 14464305913.
# ℹ 990 more rows

Compared to the real data

SSB_obs <- NHANES  |>
  filter(!is.na(HealthGen) & !is.na(HHIncomeMid))  |> 
  group_by(HealthGen)  |> 
  summarize(IncMean = mean(HHIncomeMid), count=n())  |>
  summarize(obs_teststat = sum(count*(IncMean - GM)^2)) 

SSB_obs 
# A tibble: 1 × 1
   obs_teststat
          <dbl>
1 488767088754.
sum(SSB_perm_val  |> pull() > SSB_obs  |> pull() ) / reps
[1] 0

Compared to the observed test statistic

Agenda 9/24/25

  1. Conditions, exchangeability, random structure
  2. Different structures and statistics

Exchangeability

If the null hypothesis is true, the labels assigning groups are interchangeable with respect to the probability distribution.

typically (with the two group setting),

\[H_0: F_1(x) = F_2(x)\]

(there are no distributional or parametric conditions)

Exchangeability

More generally, we might use the following exchangeability definition

Data are exchangeable under the null hypothesis if the joint distribution from which the data came is the same before permutation as after permutation when the null hypothesis is true.

Probability as measured by what?

  • Random Sample The concept of a p-value usually comes from the idea of taking a sample from a population and comparing it to a sampling distribution (from many many random samples).

  • Randomized Experiment The p-value represents the observed data compared to the treatment variable being allocated to the groups “by chance.”

Permuting independent observations

Consider a “family” structure where some individuals are exposed and others are not (control).

Permuting homogenous cluster

Consider a “family” structure where individuals in a cluster always have the same treatment.

Permuting herterogenous cluster

Consider a “family” structure where individuals in a cluster always have the opposite treatment.

Agenda 9/29/25

  • Exchangeability
  • Nested permutations – MacNell evaluations
  • Multiple linear regression

Gender bias in teaching evaluations

The Economist, Sep 21, 2017

Gender bias in teaching evaluations

Gender bias seen in study on 20,000 students. The students knew the gender of their prof, but did not choose their class. https://doi.org/10.1093/jeea/jvx057

Gender bias in teaching evaluations

86 students assigned randomly to the TA section. Innovative Higher Education, 40, pages 291–303 (2015). https://doi.org/10.58188/1941-8043.1509

Gender bias in teaching evaluations

Gender bias: MacNell data

temp <- macnell |> 
  group_by(tagender) |> 
  mutate(perm = sample(taidgender, replace = FALSE)) |> 
  select(overall,tagender, taidgender, perm) |> 
  arrange(tagender)

Analysis goal

Want to know if the population average score for the perceived gender is different.

\[H_0: \mu_{ID.Female} = \mu_{ID.Male}\]

Although for the permutation test, under the null hypothesis not only are the means of the population distributions the same, but the variance and all other aspects of the distributions across perceived gender.

MacNell Data without permutation

taidgender indicates perceived gender.

macnell  |>
  select(overall, tagender, taidgender) 
# A tibble: 47 × 3
   overall tagender taidgender
     <dbl>    <dbl>      <dbl>
 1       4        0          1
 2       4        0          1
 3       5        0          1
 4       5        0          1
 5       5        0          1
 6       4        0          1
 7       4        0          1
 8       5        0          1
 9       4        0          1
10       3        0          1
# ℹ 37 more rows

Permuting MacNell data

Conceptually, there are two levels of randomization:

  1. \(N_m\) students are randomly assigned to the male instructor and \(N_f\) are assigned to the female instructor.

  2. Of the \(N_j\) assigned to instructor \(j\), \(N_{jm}\) are told that the instructor is male, and \(N_{jf}\) are told that the instructor is female for \(j=m,f\).

taidgender indicates perceived gender.

macnell  |>
  group_by(tagender, taidgender)  |>
  summarize(n())
# A tibble: 4 × 3
# Groups:   tagender [2]
  tagender taidgender `n()`
     <dbl>      <dbl> <int>
1        0          0    11
2        0          1    12
3        1          0    13
4        1          1    11

Stratified two-sample test:

  • For each instructor, permute perceived gender assignments.
  • Use difference in mean ratings for female-identified vs male-identified instructors.

MacNell Data with permutation

taidgender indicates perceived gender.

macnell  |> 
  group_by(tagender)  |>
  mutate(permTAID = sample(taidgender, replace=FALSE))  |>
  select(overall, tagender, taidgender, permTAID) 
# A tibble: 47 × 4
# Groups:   tagender [2]
   overall tagender taidgender permTAID
     <dbl>    <dbl>      <dbl>    <dbl>
 1       4        0          1        0
 2       4        0          1        1
 3       5        0          1        0
 4       5        0          1        1
 5       5        0          1        1
 6       4        0          1        1
 7       4        0          1        0
 8       5        0          1        0
 9       4        0          1        0
10       3        0          1        0
# ℹ 37 more rows

MacNell Data with permutation

taidgender indicates perceived gender.

macnell  |> 
  group_by(tagender)  |>
  mutate(permTAID = sample(taidgender, replace=FALSE))  |>
  ungroup(tagender)  |>
  group_by(permTAID)  |>
  summarize(pmeans = mean(overall, na.rm=TRUE))  |>
  summarize(diff(pmeans))
# A tibble: 1 × 1
  `diff(pmeans)`
           <dbl>
1         -0.188

MacNell Data with permutation

taidgender indicates perceived gender.

diff_means_func <- function(.x){
  macnell  |> group_by(tagender)  |>
  mutate(permTAID = sample(taidgender, replace=FALSE))  |>
  ungroup(tagender)  |>
  group_by(permTAID)  |>
  summarize(pmeans = mean(overall, na.rm=TRUE))  |>
  summarize(diff_mean = diff(pmeans))
  }

map(1:5, diff_means_func) |> 
  list_rbind()
# A tibble: 5 × 1
  diff_mean
      <dbl>
1    0.277 
2    0.188 
3   -0.649 
4    0.0909
5    0.184 

Observed vs. Permuted statistic

# observed
macnell  |> 
  group_by(taidgender)  |>
  summarize(pmeans = mean(overall, na.rm=TRUE))  |>
  summarize(diff_mean = diff(pmeans))
# A tibble: 1 × 1
  diff_mean
      <dbl>
1     0.474
# permuted
set.seed(47)
reps = 1000
perm_diff_means <- map(1:reps, diff_means_func) |> 
  list_rbind()

MacNell Data with permutation

permutation null sampling distribution:

# permutation p-value
perm_diff_means  |>
  summarize(p_val = 
      sum(diff_mean > 0.474) / 
      reps)
# A tibble: 1 × 1
  p_val
  <dbl>
1 0.048

MacNell results

Analysis was done using t-tests. https://doi.org/10.58188/1941-8043.1509

Other Test Statistics

Data Hypothesis Question Statistic
2 categorical diff in prop \(\hat{p}_1 - \hat{p}_2\) or \(\chi^2\)
variables ratio of prop \(\hat{p}_1 / \hat{p}_2\)
1 numeric & diff in means \(\overline{X}_1 - \overline{X}_2\)
1 binary ratio of means \(\overline{X}_1 / \overline{X}_2\)
diff in medians \(\mbox{median}_1 - \mbox{median}_2\)
ratio of medians \(\mbox{median}_1 / \mbox{median}_2\)
diff in SD \(s_1 - s_2\)
diff in var \(s^2_1 - s^2_2\)
ratio of SD or VAR \(s_1 / s_2\)
1 numeric & diff in means \(\sum n_i (\overline{X}_i - \overline{X})^2\) or
k groups F stat
paired or (permute within pair) \(\overline{X}_1 - \overline{X}_2\)
repeated measures
regression correlation least sq slope (\(b_1\))
time series no serial corr lag 1 autocross

Multiple Linear Regression

Consider the following model where interest is in \(\beta_{1 \cdot 2}\)

\[E[Y] = \beta_{0\cdot 1, 2} + \beta_{1 \cdot 2} X_1 + \beta_{2 \cdot 1} X_2\]

Hypothesis testing

The research question of interest is to understand the relationship between \(E[Y]\) and \(X_1\).

\[H_0: \beta_{1 \cdot 2} = 0\] \[H_A: \beta_{1 \cdot 2} \ne 0\]

For each dataset, we run a least squares regression to get the sample values (parameter estimates):

\[\hat{Y} = b_{0 \cdot 1, 2} + b_{1 \cdot 2} X_1 + b_{2 \cdot 1} X_2\]

Permuting \(Y\) - algorithm

  1. Fit the original model to the data and get the coefficient estimate: \(b_{1\cdot 2}\)

  2. Permute \(Y\) to obtain \(Y^*\).

  3. Fit a model on the permuted \(Y^*\) values to obtain a permuted coefficient estimate: \(b^*_{1\cdot 2}\)

\[\widehat{Y}^* = b^*_{0\cdot1,2} + b^*_{1\cdot2}X_1 + b^*_{2\cdot1}X_2\] 4. Repeat steps 2 and 3 \(P\) times. For example, \(P\) = 1000.

  1. Create a null sampling distribution of the \(P\) copies of \(b^*_{1\cdot 2}\).

  2. Compare the observed test statistic to the permuted null sampling distribution from step 5.

Permuting \(Y\) - consequences

  • Indeed, permuting \(Y\) will break the relationship between \(Y\) and \(X_1\), which will force the null hypothesis to be true (which is what we want for testing).

  • However, permuting \(Y\) will also simultaneously break the relationship between \(Y\) and \(X_2,\) which may not be acceptable if we need to preserve the relationship to mirror the original data structure.

Permuting \(X_1\) - algorithm

  1. Fit the original model to the data and get the coefficient estimates:\(b_{1\cdot 2}\)

  2. Permute \(X_1\) to obtain \(X^*_1\).

  3. Fit a model on the permuted \(X_1^*\) values to obtain a permuted coefficient estimate:

\[\widehat{Y} = b^*_{0\cdot1,2} + b^*_{1\cdot2}X^*_1 + b^*_{2\cdot1}X_2\]

  1. Repeat steps 2 and 3 \(P\) times. For example, \(P\) = 1000.

  2. Create a null sampling distribution of the \(P\) copies of \(b^*_{1\cdot 2}\).

  3. Compare the observed test statistic to the permuted null sampling distribution from step 5.

Permuting \(X_1\) - consequences

  • The permutation distribution created from permuting \(X_1\) will force the null hypothesis to be true.

  • However, permuting \(X_1\) has the side effect that the relationship between \(X_1\) and \(X_2\) will be broken in the permuted data.

  • If the data come from, for example, a randomized clinical trial (where \(X_1\) is the treatment variable), then \(X_1\) and \(X_2\) will be independent in the original dataset, and permuting of \(X_1\) will not violate the exchangeability condition.

  • If \(X_1\) and \(X_2\) are correlated in the original dataset, then permuting \(X_1\) violates the exchangeability condition.

Permuting reduced model residuals - algorithm

  1. Fit the original model on \(X_2\) only and obtain coefficient estimate: \(b_2\)

\[\widehat{Y} = b_{0\cdot2} + b_{2}X_2\] 2. Let the residuals \(R_{Y\cdot2} = Y - b_{0\cdot2} - b_{2}X_2\), and permute \(R_{Y\cdot2}\) to obtain \(R^*_{Y\cdot2}.\) Define the permuted outcome variable as \(Y^* = b_{0\cdot2} + b_{2}X_2 + R^*_{Y\cdot2}.\)

  1. Fit a model on the permuted \(Y^*\) values (regress \(Y^*\) on both \(X_1\) and \(X_2\)) to obtain permuted coefficient estimate: \(b^*_{1\cdot2}\) \[\widehat{Y}^* = b^*_{0\cdot1,2} + b^*_{1\cdot2}X_1 + b^*_{2\cdot1}X_2\]

  2. Repeat steps 2 and 3 \(P\) times.

  3. Create a null sampling distribution of the \(P\) copies of \(b^*_{1\cdot 2}\).

  4. Compare the observed test statistic to the permuted null sampling distribution from step 5.

Permuting reduced model residuals - consequences

  • The permutation preserves the relationship between \(X_1\) & \(X_2\) as well as the relationship between \(X_2\) & \(Y\).

  • However, in order for the relationship between \(X_1\) & \(Y\) to be broken (i.e., to obtain a null sampling distribution for the test of \(H_0: \beta_{1\cdot2} = 0),\) \(X_1\) and \(X_2\) must not be associated.

Takeaways

  • small takeaway – it turns out that even when both the normality conditions and the exchangeability are modestly violated, all three of the permutation test methods described below do reasonably well at controlling Type I errors (Anderson and Legendre 1999; Winkler et al. 2014).

  • big takeaway – we can use the ideas of permutation tests in MLR to understand what exchangeability means. Exchangeability is the technical condition for permutation tests.

  • once you fully understand exchangeability, you’ll be able to perform permutation tests in much more complicated experimental designs.

Relationships

Permutation Broken Relationships Preserved Relationships
Permute \(Y\) \(X_1\) & \(Y\) \(X_1\) & \(X_2\)
\(X_2\) & \(Y\)
Permute \(X_1\) \(X_1\) & \(X_2\) \(X_2\) & \(Y\)
\(X_1\) & \(Y\)
Permute reduced model residuals \(X_1\) & \(Y\) (if \(X_1\) & \(X_2\) are uncorrelated) \(X_1\) & \(X_2\)
\(X_2\) & \(Y\)

References

Anderson, Marti J., and Pierre Legendre. 1999. “An Empirical Comparison of Permutation Methods for Tests of Partial Regression Coefficients in a Linear Model.” Journal of Statistical Computation and Simulation 62 (3): 271–303. https://doi.org/10.1080/00949659908811936.
Winkler, A. M., G. R. Ridgway, M. A. Webster, S. M. Smith, and T. E. Nichols. 2014. Permutation Inference for the General Linear Model.” NeuroImage 92: 381–97.