Bootstrapping

class: right, top, my-title, title-slide

# Bootstrapping
### Jo Hardin
### October 3 & 5, 2021

---

# Agenda 10/5/21

0. Finish power in the permutation test context
1. Review: logic of confidence intervals
2. Logic of bootstrapping (resample from the sample *with* replacement)
3. BS SE of a statistic

---

## Why bootstrap?

Motivation:

> to estimate the variability and distribution of a statistic in repeated samples of size `\(n\)` (*not* dependent on `\(H_0\)` being true).

---

## Variability

* standard deviation of the **data**: `\(s = \sqrt{\frac{\sum_{i=1}^n(X_i - \overline{X})^2}{n-1}}\)`

* standard error of the **statistic**: depends...

---
## Intuitive understanding

See in the applets for an intuitive understanding of both confidence intervals and boostrapping:

* <a href = "http://www.lock5stat.com/StatKey/" target = "_blank">StatKey applets</a> which demonstrate bootstrapping.

* <a href = "http://www.rossmanchance.com/applets/ConfSim.html" target = "_blank">Confidence interval logic</a> from the Rossman & Chance applets.

---

## Basic Notation

(n.b., we don't ever do what is on this slide)

Let `\(\theta\)` be the parameter of interest, and let `\(\hat{\theta}\)` be the estimate of `\(\theta\)`.  If we could, we'd take many samples of size `\(n\)` from the population to create a **sampling distribution** for `\(\hat{\theta}\)`.  Consider taking `\(B\)` random samples from the population.

`\begin{align}
\hat{\theta}(\cdot) = \frac{1}{B} \sum_{i=1}^B \hat{\theta}_i
\end{align}`
is the best guess for `\(\theta\)`.  If `\(\hat{\theta}\)` is very different from `\(\theta\)`, we would call it **biased**.
`\begin{align}
SE(\hat{\theta}) &= \bigg[ \frac{1}{B-1} \sum_{i=1}^B(\hat{\theta}_i - \hat{\theta}(\cdot))^2 \bigg]^{1/2}\\
q_1 &= [0.25 B] \ \ \ \ \hat{\theta}^{(q_1)} = \mbox{25}\% \mbox{ cutoff}\\
q_3 &= [0.75 B] \ \ \ \ \hat{\theta}^{(q_3)} = \mbox{75}\% \mbox{ cutoff}\\
\end{align}`

---

## Ideally

(we never do part (a))

<div class="figure" style="text-align: center">
<img src="../images/BSlogic.png" alt="From Hesterberg et al., Chapter 16 of Introduction to the Practice of Statistics by  Moore, McCabe, and Craig" width="60%" />
<p class="caption">From Hesterberg et al., Chapter 16 of Introduction to the Practice of Statistics by  Moore, McCabe, and Craig</p>
</div>

---

## Bootstrap Procedure

1. Resample data **with replacement** from the original sample.
2. Calculate the statistic of interest for each resample.
3. Repeat 1. and 2. `\(B\)` times.
4. Use the bootstrap distribution for inference.

---

## Bootstrap Notation

(n.b., bootstrapping is the process on this slide)

Take many ( `\(B\)` ) resamples of size `\(n\)` from the sample to create a bootstrap distribution for `\(\hat{\theta}^*\)` (instead of the sampling distribution for `\(\hat{\theta}\)`).

Let `\(\hat{\theta}^*(b)\)` be the calculated statistic of interest for the `\(b^{th}\)` bootstrap sample.  The best guess for `\(\theta\)` is:
`\begin{align}
\hat{\theta}^* = \frac{1}{B} \sum_{b=1}^B \hat{\theta}^*(b)
\end{align}`
(if `\(\hat{\theta}^*\)` is very different from `\(\hat{\theta}\)`, we call it biased.)  And the estimated value for the standard error of the estimate is
`\begin{align}
\hat{SE}^* = \bigg[ \frac{1}{B-1} \sum_{b=1}^B ( \hat{\theta}^*(b) - \hat{\theta}^*)^2 \bigg]^{1/2}
\end{align}`

---

## What do we get?

Just like repeatedly taking samples from the population, taking resamples from the sample allows us to characterize the bootstrap distribution which approximates the sampling distribution.

> The bootstrap distribution approximates the **shape, spread, & bias** of the true sampling distribution.

---

<div class="figure" style="text-align: center">
<img src="../images/BShesterberg1.png" alt="From Hesterberg et al., Chapter 16 of Introduction to the Practice of Statistics by  Moore, McCabe, and Craig.  The left image represents the mean with n=50.  The center image represents the mean with n=9.  The right image represents the median with n=15." width="30%" /><img src="../images/BShesterberg2.png" alt="From Hesterberg et al., Chapter 16 of Introduction to the Practice of Statistics by  Moore, McCabe, and Craig.  The left image represents the mean with n=50.  The center image represents the mean with n=9.  The right image represents the median with n=15." width="30%" /><img src="../images/BShesterberg3.png" alt="From Hesterberg et al., Chapter 16 of Introduction to the Practice of Statistics by  Moore, McCabe, and Craig.  The left image represents the mean with n=50.  The center image represents the mean with n=9.  The right image represents the median with n=15." width="30%" />
<p class="caption">From Hesterberg et al., Chapter 16 of Introduction to the Practice of Statistics by  Moore, McCabe, and Craig.  The left image represents the mean with n=50.  The center image represents the mean with n=9.  The right image represents the median with n=15.</p>
</div>

---

## Example
- Hesketh and Everitt (2000) report on a study by Caplehorn and Bell (1991) that investigated the time (days) spent in a clinic for methadone maintenance treatment for people addicted to heroin.

- The data include the amount of time that the subjects stayed in the facility until treatment was terminated.

- For about 37% of the subjects, the study ended while they were still the in clinic (status=0).

- Their survival time has been truncated.  For this reason we might not want to estimate the mean survival time, but rather some other measure of typical survival time.  Below we explore using the median as well as the 25% trimmed mean.   (From ISCAM Chance & Rossman, Investigation 4.5.3)

---

## The data

```
## # A tibble: 238 × 5
##       id clinic status times  dose
##    <dbl>  <dbl>  <dbl> <dbl> <dbl>
##  1     1      1      1   428    50
##  2     2      1      1   275    55
##  3     3      1      1   262    55
##  4     4      1      1   183    30
##  5     5      1      1   259    65
##  6     6      1      1   714    55
##  7     7      1      1   438    65
##  8     8      1      0   796    60
##  9     9      1      1   892    50
## 10    10      1      1   393    65
## # … with 228 more rows
```

---

## Observed Test Statistic(s)

```r
heroin %>%
  summarize(obs_med = median(times), 
            obs_tr_mean = mean(times, trim = 0.25))
```

```
## # A tibble: 1 × 2
##   obs_med obs_tr_mean
##     <dbl>       <dbl>
## 1   367.5       378.3
```

---

## Bootstrapped data!

```r
set.seed(4747)

heroin %>% 
  sample_frac(size=1, replace=TRUE) %>%
  summarize(boot_med = median(times), 
            boot_tr_mean = mean(times, trim = 0.25))
```

```
## # A tibble: 1 × 2
##   boot_med boot_tr_mean
##      <dbl>        <dbl>
## 1      368       372.26
```

---

##  Need to bootstrap a lot of times...
Bootstrapping with `map`.

.panelset[

.panel[.panel-name[set variables]

```r
n_rep1 <- 100
n_rep2 <- 20
set.seed(4747)
```
]

.panel[.panel-name[boot stat function]

```r
boot_stat_func <- function(df){ 
	df %>% 
    mutate(obs_med = median(times),
           obs_tr_mean = mean(times, trim = 0.25)) %>%
    sample_frac(size=1, replace=TRUE) %>%
    summarize(boot_med = median(times), 
              boot_tr_mean = mean(times, trim = 0.25),
              obs_med = mean(obs_med),
              obs_tr_mean = mean(obs_tr_mean))}
```
]

.panel[.panel-name[resample function]

```r
boot_1_func <- function(df){
  df %>% 
    sample_frac(size=1, replace=TRUE)
}
```
]

.panel[.panel-name[bootstrapping]

```r
map_df(1:n_rep1, ~boot_stat_func(df = heroin))
```

```
## # A tibble: 100 × 4
##    boot_med boot_tr_mean obs_med obs_tr_mean
##       <dbl>        <dbl>   <dbl>       <dbl>
##  1    368         372.26   367.5       378.3
##  2    358         362.78   367.5       378.3
##  3    431         420.62   367.5       378.3
##  4    331.5       349.62   367.5       378.3
##  5    310.5       330.78   367.5       378.3
##  6    376         382.3    367.5       378.3
##  7    366         365.02   367.5       378.3
##  8    378.5       381.9    367.5       378.3
##  9    394         386.1    367.5       378.3
## 10    391.5       402.38   367.5       378.3
## # … with 90 more rows
```
]

]

---

## What do the **data** distributions look like?

![](2021-10-05-bootstrap_files/figure-html/unnamed-chunk-12-1.png)![](2021-10-05-bootstrap_files/figure-html/unnamed-chunk-12-2.png)

---

## What do the **sampling** distributions look like?

Both the median and the trimmed mean are reasonably symmetric and bell-shaped.

.pull-left[
![](2021-10-05-bootstrap_files/figure-html/unnamed-chunk-13-1.png)
]

.pull-right[
![](2021-10-05-bootstrap_files/figure-html/unnamed-chunk-14-1.png)
]

---

# Agenda 10/7/21

1. Normal CI using BS SE
2. Bootstrap-t (studentized) CIs 
3. Percentile CIs
4. properties / advantages / disadvantages

---

## Technical derivations

See in class notes on [bootstrapping](http://st47s.com/Math154/Notes/boot.html) for the technical details on how to create different bootstrap intervals.

---

## Bootstrapping with `map`

.panelset[

.panel[.panel-name[set variables]

```r
n_rep1 <- 100
set.seed(4747)
```
]

.panel[.panel-name[boot stat function]

.panel[.panel-name[bootstrap!]

```r
boot_stats <- map_df(1:n_rep1, ~boot_stat_func(df = heroin))

boot_stats
```

]

---

## 95% normal CI with BS SE

```r
boot_stats %>%
  summarize(low_med = mean(obs_med) + qnorm(0.025) * sd(boot_med), 
         up_med = mean(obs_med) + qnorm(0.975) * sd(boot_med),
         low_tr_mean = mean(obs_tr_mean) + qnorm(0.025) * sd(boot_tr_mean), 
         up_tr_mean = mean(obs_tr_mean) + qnorm(0.975) * sd(boot_tr_mean))
```

```
## # A tibble: 1 × 4
##   low_med up_med low_tr_mean up_tr_mean
##     <dbl>  <dbl>       <dbl>      <dbl>
## 1  309.99 425.01      336.87     419.73
```

---

## 95% Percentile CI

```r
boot_stats %>%
  summarize(perc_CI_med = quantile(boot_med, c(0.025, 0.975)), 
            perc_CI_tr_mean = quantile(boot_tr_mean, c(0.025, 0.975)), 
            q = c(0.025, 0.975))
```

```
## # A tibble: 2 × 3
##   perc_CI_med perc_CI_tr_mean     q
##         <dbl>           <dbl> <dbl>
## 1      321             334.86 0.025
## 2      434.58          419.77 0.975
```

---

## Double bootstrapping with `map`

.panelset[

.panel[.panel-name[set variables]

```r
n_rep1 <- 100
n_rep2 <- 20
set.seed(4747)
```
]

.panel[.panel-name[boot stat function]

.panel[.panel-name[resample function]

```r
boot_1_func <- function(df){
  df %>% 
    sample_frac(size=1, replace=TRUE)
}
```
]

.panel[.panel-name[re-resample function]

```r
boot_2_func <- function(df, reps){
  resample2 <- 1:reps
  df %>%
    summarize(boot_med = median(times), boot_tr_mean = mean(times, trim = 0.25)) %>%
    cbind(resample2, map_df(resample2, ~df %>% 
             sample_frac(size=1, replace=TRUE) %>%
             summarize(boot_2_med = median(times), 
                       boot_2_tr_mean = mean(times, trim = 0.25)))) %>%
    select(resample2, everything())
}
```
]

.panel[.panel-name[double bootstrap!]

```r
boot_2_stats <- data.frame(resample1 = 1:n_rep1) %>%
  mutate(first_boot = map(1:n_rep1, ~boot_1_func(df = heroin))) %>%
  mutate(second_boot = map(first_boot, boot_2_func, reps = n_rep2)) 
```

]

---

## Summarizing the double bootstrap

.panelset[

.panel[.panel-name[results]

```r
boot_2_stats %>%
  unnest(second_boot) %>%
  unnest(first_boot) 
```

```
## # A tibble: 476,000 × 12
##    resample1    id clinic status times prison  dose resample2 boot_med
##        <int> <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>     <int>    <dbl>
##  1         1   257      1      1   204      0    50         1      368
##  2         1   230      1      0    28      0    50         1      368
##  3         1   229      1      1   216      0    50         1      368
##  4         1   186      2      0   683      0   100         1      368
##  5         1   119      2      0   684      0    65         1      368
##  6         1    73      1      0   405      0    80         1      368
##  7         1    41      1      1   550      1    60         1      368
##  8         1    75      1      0   905      0    80         1      368
##  9         1    68      1      0   439      0    80         1      368
## 10         1   224      1      1   546      1    50         1      368
## # … with 475,990 more rows, and 3 more variables: boot_tr_mean <dbl>,
## #   boot_2_med <dbl>, boot_2_tr_mean <dbl>
```
]

.panel[.panel-name[summary for resample 1]

```r
boot_2_stats %>%
  unnest(second_boot) %>%
  unnest(first_boot) %>%
  filter(resample1 == 1) 
```

```
## # A tibble: 4 × 4
##   skim_variable  numeric.mean numeric.sd numeric.p50
##   <chr>                 <dbl>      <dbl>       <dbl>
## 1 boot_med             368         0          368   
## 2 boot_tr_mean         372.26      0          372.26
## 3 boot_2_med           364.75     32.538      367.25
## 4 boot_2_tr_mean       367.94     21.516      366.74
```
]

.panel[.panel-name[summary for all resamples]

```r
boot_t_stats <- boot_2_stats %>%
  unnest(second_boot) %>%
  unnest(first_boot) %>%
  group_by(resample1) %>%
  summarize(boot_sd_med = sd(boot_2_med),
            boot_sd_tr_mean = sd(boot_2_tr_mean),
            boot_med = mean(boot_med),  # doesn't do anything, just copies over
            boot_tr_mean = mean(boot_tr_mean))  %>% # the variables into the output
  mutate(boot_t_med = (boot_med - mean(boot_med)) / boot_sd_med,
            boot_t_tr_mean = (boot_tr_mean - mean(boot_tr_mean)) / boot_sd_tr_mean)

boot_t_stats
```

```
## # A tibble: 100 × 7
##    resample1 boot_sd_med boot_sd_tr_mean boot_med boot_tr_mean boot_t_med
##        <int>       <dbl>           <dbl>    <dbl>        <dbl>      <dbl>
##  1         1      32.538          21.516    368         372.26   -0.15367
##  2         2      24.219          18.789    358         362.78   -0.61934
##  3         3      32.007          21.129    431         420.62    1.8121 
##  4         4      49.141          33.982    331.5       349.62   -0.84452
##  5         5      22.705          13.374    310.5       330.78   -2.7527 
##  6         6      20.264          19.908    376         382.3     0.14804
##  7         7      35.345          22.086    366         365.02   -0.19805
##  8         8      14.993          16.412    378.5       381.9     0.36683
##  9         9      27.584          20.857    394         386.1     0.76131
## 10        10      38.452          19.621    391.5       402.38    0.48111
## # … with 90 more rows, and 1 more variable: boot_t_tr_mean <dbl>
```

]
]

---

## 95% Bootstrap-t CI

Note that the t-value is needed (which requires a different SE for each bootstrap sample).

.panelset[

.panel[.panel-name[t-values]

```r
boot_t_stats %>%
  select(boot_t_med, boot_t_tr_mean)
```

```
## # A tibble: 100 × 2
##    boot_t_med boot_t_tr_mean
##         <dbl>          <dbl>
##  1   -0.15367       -0.24930
##  2   -0.61934       -0.78976
##  3    1.8121         2.0353 
##  4   -0.84452       -0.82389
##  5   -2.7527        -3.5024 
##  6    0.14804        0.23497
##  7   -0.19805       -0.57037
##  8    0.36683        0.26065
##  9    0.76131        0.40646
## 10    0.48111        1.2616 
## # … with 90 more rows
```
]

.panel[.panel-name[multipliers]

```r
boot_q <- boot_t_stats %>%
  select(boot_t_med, boot_t_tr_mean) %>%
  summarize(q_t_med = quantile(boot_t_med, c(0.025, 0.975)), 
            q_t_tr_mean = quantile(boot_t_tr_mean, c(0.025, 0.975)),
            q = c(0.025, 0.975))

boot_q
```

```
## # A tibble: 2 × 3
##   q_t_med q_t_tr_mean     q
##     <dbl>       <dbl> <dbl>
## 1 -2.1709     -2.2041 0.025
## 2  1.7827      2.0599 0.975
```
]

.panel[.panel-name[pull numbers]

```r
boot_q_med <- boot_q %>% select(q_t_med) %>% pull()
boot_q_med
```

```
##    2.5%   97.5% 
## -2.1709  1.7827
```

```r
boot_q_tr_mean <- boot_q %>% select(q_t_tr_mean) %>% pull()
boot_q_tr_mean
```

```
##    2.5%   97.5% 
## -2.2041  2.0599
```
]

.panel[.panel-name[BS-t CI]

```r
boot_t_stats %>%
  summarize(boot_t_CI_med = mean(boot_med) + 
                                  boot_q_med*sd(boot_med),
            boot_t_CI_tr_mean = mean(boot_tr_mean) + 
                                  boot_q_tr_mean * sd(boot_tr_mean),
            q = c(0.025, 0.975))
```

```
## # A tibble: 2 × 3
##   boot_t_CI_med boot_t_CI_tr_mean     q
##           <dbl>             <dbl> <dbl>
## 1        309.30            331.03 0.025
## 2        425.31            421.17 0.975
```
]

]

---

## Comparison of intervals

The first three columns  correspond to the CIs for the true median of the survival times.  The second three columns correspond to the CIs for the true trimmed mean of the survival times.

CI | lower | obs Med | upper | lower | obs Tr Mean | upper
--- | ----- | ----- | ----- | ----- | ------ | -------- | 
Percentile | 321 | 367.50| 434.58 | 334.86 | 378.30 | 429.77
w BS SE | 309.98 | 367.50| 425.01 | 336.90 | 378.30 | 419.77
BS-t | 309.30 | 367.50| 425.31 | 331.02 | 378.30 | 421.17

---

## What makes a confidence interval procedure good?

1. That it captures the true parameter in `\(1-\alpha \cdot\)` 100% of the datasets.

2. That it produces narrow intervals.

---

## What else about intervals?

| CI 	| Symmetric 	| Range Resp 	| Trans Resp 	| Accuracy 	| Normal Samp Dist? 	| Other 	|
|:-----:	|:---------:	|:----------:	|:----------:	|:--------------:	|:-----------------:	|------------------------------------	|
| Boot SE 	| Yes 	| No 	| No 	| `\(1^{st}\)` order 	| Yes 	| param assump `\(F(\hat{\theta})\)` 	|
| Boot-t 	| No 	| No 	| No 	| `\(2^{nd}\)` order 	| Yes/No 	| computer intensive 	|
| perc 	| No 	| Yes 	| Yes 	| `\(1^{st}\)` order 	| No 	| small `\(n \rightarrow\)` low accuracy 	|
| BCa 	| No 	| Yes 	| Yes 	| `\(2^{nd}\)` order 	| No 	| limited param assump 	|