class: right, top, my-title, title-slide # Simulating: scenarios, datasets, random variables ### Jo Hardin ### September 21 & 23, 2021 --- # Agenda 9/21/21 1. Why simulate? 2. What makes a good simulation? 3. Examples --- ## What can simulation tell us? * 2016: 10 of the 16 Republican presidential candidates were allowed into the first debate using an average of the five most recent major national polls. <img src="../images/GOPdebate.jpg" width="100%" style="display: block; margin: auto;" /> Simulating who would be in the [first GOP debate](http://www.nytimes.com/interactive/2015/07/21/upshot/election-2015-the-first-gop-debate-and-the-role-of-chance.html) --- ### Goals of Simulating Complicated Models The goal of simulating a complicated model is not only to create a program which will provide the desired results. We also hope to be able to write code such that: 1. The problem is broken down into small pieces. 2. The problem has checks in it to see what works (run the lines *inside* the functions!). 3. uses simple code (as often as possible). --- ## Aside: a few R functions (`ifelse()`) ```r data.frame(value = c(-2:2)) %>% mutate(abs_value = ifelse(value >=0, value, -value)) # abs val ``` ``` ## value abs_value ## 1 -2 2 ## 2 -1 1 ## 3 0 0 ## 4 1 1 ## 5 2 2 ``` --- ## Aside: a few R functions (`case_when()`) ```r set.seed(4747) diamonds %>% select(carat, cut, color, price) %>% sample_n(20) %>% mutate(price_cat = case_when( price > 10000 ~ "expensive", price > 1500 ~ "medium", TRUE ~ "inexpensive")) ``` ``` ## # A tibble: 20 × 5 ## carat cut color price price_cat ## <dbl> <ord> <ord> <int> <chr> ## 1 1.23 Very Good F 10276 expensive ## 2 0.35 Premium H 706 inexpensive ## 3 0.7 Good E 2782 medium ## 4 0.4 Ideal D 1637 medium ## 5 0.53 Ideal G 1255 inexpensive ## 6 2.22 Ideal G 14637 expensive ## 7 0.3 Ideal G 878 inexpensive ## 8 1.05 Ideal H 4223 medium ## 9 0.53 Premium E 1654 medium ## 10 1.7 Ideal H 7068 medium ## 11 0.31 Good E 698 inexpensive ## 12 0.31 Ideal F 840 inexpensive ## 13 1.03 Ideal H 4900 medium ## 14 0.31 Premium G 698 inexpensive ## 15 1.56 Premium G 8858 medium ## 16 1.71 Premium G 11032 expensive ## 17 1 Good E 5345 medium ## 18 1.86 Ideal J 10312 expensive ## 19 1.08 Very Good E 3726 medium ## 20 0.31 Premium E 698 inexpensive ``` --- ## Aside: a few R functions (`sample()`) *sampling*, *shuffling*, and *resampling*: `sample()` ```r alph <- letters[1:10] sample(alph, 5, replace = FALSE) # sample (from a population) ``` ``` ## [1] "g" "b" "e" "f" "h" ``` ```r sample(alph, 15, replace = TRUE) # sample (from a population) ``` ``` ## [1] "j" "g" "i" "a" "g" "e" "a" "g" "c" "e" "j" "i" "d" "a" "d" ``` ```r sample(alph, 10, replace = FALSE) # shuffle ``` ``` ## [1] "b" "h" "f" "i" "a" "d" "j" "c" "g" "e" ``` ```r sample(alph, 10, replace = TRUE) # resample ``` ``` ## [1] "i" "h" "j" "i" "i" "f" "b" "f" "b" "d" ``` --- ## Why? Three simulating methods are used for different purposes: 1. Monte Carlo methods - use *sampling* repeated sampling from populations with known (either via data or via populations) characteristics. 2. Randomization / Permutation methods - use *shuffling* (sampling without replacement) to test hypotheses of "no effect". 3. Bootstrap methods - use *resampling* (sampling with replacement) to establish confidence intervals. --- ## Aside: a few R functions (`set.seed()`) What if we want to be able to generate the **same** random numbers (here on the interval from 0 to 1) over and over? ```r set.seed(4747) runif(4, 0, 1) # random uniform numbers ``` ``` ## [1] 0.195 0.339 0.515 0.452 ``` ```r set.seed(123) runif(4, 0, 1) # random uniform numbers ``` ``` ## [1] 0.288 0.788 0.409 0.883 ``` ```r set.seed(4747) runif(4, 0, 1) # random uniform numbers ``` ``` ## [1] 0.195 0.339 0.515 0.452 ``` --- ## Aside: a few R functions (`rep()`, `seq()`) Output is a vector; `rep ()`has repeated values, `seq()` for a sequence. ```r rep(c("a", "b", "c"), times = 2) ``` ``` ## [1] "a" "b" "c" "a" "b" "c" ``` ```r rep(c("a", "b", "c"), times = 1:3) ``` ``` ## [1] "a" "b" "b" "c" "c" "c" ``` ```r rep(c("a", "b", "c"), each = 2) ``` ``` ## [1] "a" "a" "b" "b" "c" "c" ``` ```r seq(2, 9, by = pi) ``` ``` ## [1] 2.00 5.14 8.28 ``` ```r seq(47, 40, by = -4) ``` ``` ## [1] 47 43 ``` --- ## Small simulation problem Estimate `\(E(X)\)` where `\(X=\max \{ k: \sum_{i=1}^k U_i < 1 \}\)` and `\(U_i\)` are uniform(0,1). -- ```r sum_unif <- function(.x){ sumU <- 0 k <- 0 while(sumU < 1) { sumU <- sumU + runif(1) k <- k+1 } return(c(k-1, sumU)) } set.seed(4747) sum_unif() ``` ``` ## [1] 2.00 1.05 ``` --- ## **purrr** for functional programming .pull-left[ Recall the `map` functions which are *named* by the **output** the produce. For example: * `map(.x, .f)` is the main mapping function and returns a list * `map_df(.x, .f)` returns a data frame * `map_dbl(.x, .f)` returns a numeric (double) vector * `map_chr(.x, .f)` returns a character vector * `map_lgl(.x, .f)` returns a logical vector ] .pull-right[ <div class="figure" style="text-align: center"> <img src="../images/purrr_map.png" alt="From Advanced R by Wickham. https://adv-r.hadley.nz/functionals.html" width="90%" /> <p class="caption">From Advanced R by Wickham. https://adv-r.hadley.nz/functionals.html</p> </div> ] The first argument is the data object and the second object is the function to iteratively apply. --- ## Small simulation problem Estimate `\(E(X)\)` where `\(X=\max \{ k: \sum_{i=1}^k U_i < 1 \}\)` and `\(U_i\)` are uniform(0,1). To understand the simulation process, the lines of code below should be run one at a time. We will do that in class, but if you are reviewing these slides, please do it on your own at home. You should be able to reproduce exactly the results using the example. .panelset[ .panel[.panel-name[sum function] ```r sum_unif <- function(.x){ sumU <- 0 k <- 0 while(sumU < 1) { sumU <- sumU + runif(1) k <- k+1 } return(c(k-1, sumU)) } ``` ] .panel[.panel-name[mapping] ```r set.seed(4747) reps <- 1000 sim_k_max <- data.frame(row_id = seq(1, reps, 1)) %>% mutate(max_for_EX = map(row_id, sum_unif)) %>% unnest(max_for_EX) %>% mutate(output = rep(c("k", "sum"), reps)) %>% pivot_wider(id_cols = row_id, names_from = output, values_from = max_for_EX) ``` ] .panel[.panel-name[output] ```r sim_k_max ``` ``` ## # A tibble: 1,000 × 3 ## row_id k sum ## <dbl> <dbl> <dbl> ## 1 1 2 1.05 ## 2 2 2 1.68 ## 3 3 1 1.39 ## 4 4 1 1.47 ## 5 5 2 1.03 ## 6 6 1 1.45 ## 7 7 1 1.41 ## 8 8 1 1.83 ## 9 9 1 1.17 ## 10 10 2 1.42 ## # … with 990 more rows ``` ] .panel[.panel-name[expected value] ```r sim_k_max %>% summarize(EX = mean(k)) ``` ``` ## # A tibble: 1 × 1 ## EX ## <dbl> ## 1 1.74 ``` ] ] --- ## Thoughts on simulating * break down the problem into **small** steps * check to see what works * simple is better --- ## Pass the Pigs A game by Winning Moves (previously Hasboro) involving rolling pigs <img src="../images/Pass the Pigs Points.png" width="100%" style="display: block; margin: auto;" /> --- ## Pig Rules - Roll pigs, accumulate points - Stop if you want or if you Pig Out (no Oinker) - First player to 100 points wins What is your best strategy for winning??? --- ## Basic Blackjack * Card game, goal: sum cards as close to 21 without going over * A few nuances to card value (e.g., Ace can be 1 or 11) * Start with 2 cards, build up one card at a time * Lots of different strategies (also based on dealer's cards) <img src="../images/Blackjack_game_1.png" width="100%" style="display: block; margin: auto;" /> --- ## What do we need to simulate poker? > - set-up of cards, dealing, hands > - "score" (both sum of cards and payout) > - strategies > - result of strategies (summary of outcomes) .footnote[Example and code come from **Data Science in R: a case studies approach to computational reasoning and problem solving** by Nolan and Temple Lang, chapter 9 *Simulating Blackjack* by Hadley Wickham. R code online at [http://rdatasciencecases.org/](http://rdatasciencecases.org/).] --- ## Setting up the Game in R ```r deck = rep(c(1:10, 10, 10, 10), 4) shuffle_decks = function(ndecks){sample(rep(deck, ndecks))} head(shuffle_decks(4), 10) ``` ``` ## [1] 3 6 4 4 10 8 4 6 1 3 ``` --- ## Outcome of cards in hand ```r handValue = function(cards) { value = sum(cards) value # Check for an Ace and change value if it doesn't bust if (any(cards == 1) && value <= 11) value = value + 10 # Check bust (set to 0); check Blackjack (set to 21.5) if(value > 21) 0 else if (value == 21 && length(cards) == 2) 21.5 # Blackjack else value } handValue(c(10,4)) ``` ``` ## [1] 14 ``` --- ## $ of cards in hand ```r winnings = function(dealer, players) { if (dealer > 21) { # Dealer=Blackjack, ties players with Blackjack -1 * (players <= 21) } else if (dealer == 0) { # Dealer busts - all non-busted players win 1.5 * (players > 21) + 1 * (players <= 21 & players > 0) + -1 * (players == 0) } else { # Dealer 21 or below, all players > dealer win 1.5 * (players > 21) + 1 * (players <= 21 & players > dealer) + -1 * (players <= 21 & players < dealer) } } winnings(17,c(20, 21.5, 14, 0, 21)) ``` ``` ## [1] 1.0 1.5 -1.0 -1.0 1.0 ``` --- ## Better $ of cards in hand ```r winnings = function(dealer, players){ (players > dealer & players > 21) * 1.5 + # Blackjack (players > dealer & players <= 21) * 1 + # win (players < dealer | players == 0) * -1 # lose } winnings(17,c(20, 21.5, 14, 0, 21)) ``` ``` ## [1] 1.0 1.5 -1.0 -1.0 1.0 ``` ```r winnings(21.5,c(20, 21.5, 14, 0, 21)) ``` ``` ## [1] -1 0 -1 -1 -1 ``` --- ## How well does `handValue` work? ```r test_cards = list( c(10, 1), c(10, 5, 6), c(10, 1, 1), c(7, 6, 1, 5), c(3, 6, 1, 1), c(2, 3, 4, 10), c(5, 1, 9, 1, 1), c(5, 10, 7), c(10, 9, 1, 1, 1)) test_cards_val = c(21.5, 21, 12, 19, 21, 19, 17, 0, 0) sapply(test_cards, handValue) # apply the function handValue to test_cards ``` ``` ## [1] 21.5 21.0 12.0 19.0 21.0 19.0 17.0 0.0 0.0 ``` ```r identical(test_cards_val, sapply(test_cards, handValue)) ``` ``` ## [1] TRUE ``` --- ## Function for getting more cards ```r shoe = function(m = 1) sample(deck, m, replace = TRUE) new_hand = function(shoe, cards = shoe(2), bet = 1) { list(bet = bet, shoe = shoe, cards = cards) } myCards = new_hand(shoe, bet = 7) myCards ``` ``` ## $bet ## [1] 7 ## ## $shoe ## function(m = 1) sample(deck, m, replace = TRUE) ## ## $cards ## [1] 10 4 ``` --- ## First action: hit receive another card ```r hit = function(hand) { hand$cards = c(hand$cards, hand$shoe(1)) hand } hit(myCards)$cards ``` ``` ## [1] 10 4 3 ``` --- ## Second action: stand stay with current cards ```r stand = function(hand) hand stand(myCards)$cards ``` ``` ## [1] 10 4 ``` --- ## Third action: double down double the bet after receiving exactly one more card ```r dd = function(hand) { hand$bet = hand$bet * 2 hand = hit(hand) stand(hand) } dd(myCards)$cards ``` ``` ## [1] 10 4 10 ``` --- ## Fourth action: split a pair create two different hands from initial hand with two cards of the same value ```r splitPair = function(hand) { list( new_hand(hand$shoe, cards = c(hand$cards[1], hand$shoe(1)), bet = hand$bet), new_hand(hand$shoe, cards = c(hand$cards[2], hand$shoe(1)), bet = hand$bet)) } splitHand = splitPair(myCards) ``` --- ## Results of splitting (can we always split?) ```r splitHand[[1]]$cards ``` ``` ## [1] 10 4 ``` ```r splitHand[[2]]$cards ``` ``` ## [1] 4 9 ``` --- ## Let's play! Not yet automated... ```r set.seed(470); dealer = new_hand(shoe); player = new_hand(shoe); dealer$cards[1] ``` ``` ## [1] 2 ``` ```r player$cards; player = hit(player); player$cards ``` ``` ## [1] 5 10 ``` ``` ## [1] 5 10 9 ``` ```r dealer$cards; dealer = hit(dealer); dealer$cards ``` ``` ## [1] 2 3 ``` ``` ## [1] 2 3 3 ``` --- ## Who won? ```r dealer$cards; player$cards ``` ``` ## [1] 2 3 3 ``` ``` ## [1] 5 10 9 ``` ```r handValue(dealer$cards); handValue(player$cards) ``` ``` ## [1] 8 ``` ``` ## [1] 0 ``` ```r winnings(handValue(dealer$cards), handValue(player$cards)) ``` ``` ## [1] -1 ``` --- ## Simply strategy recall the handValue function -- what if player busts? ```r strategy_simple = function(mine, dealerFaceUp) { if (handValue(dealerFaceUp) > 6 && handValue(mine) < 17) "H" else "S" } ``` --- ## Better simple strategy ```r strategy_simple = function(mine, dealerFaceUp) { if (handValue(mine) == 0) return("S") if (handValue(dealerFaceUp) > 6 && handValue(mine) < 17) "H" else "S" } ``` --- ## Dealer The dealer gets cards regardless of what the player does ```r dealer_cards = function(shoe) { cards = shoe(2) while(handValue(cards) < 17 && handValue(cards) > 0) { cards = c(cards, shoe(1)) } cards } dealer_cards(shoe) ``` ``` ## [1] 5 9 1 8 ``` ```r dealer_cards(shoe) ``` ``` ## [1] 4 4 7 10 ``` --- ## Playing a hand ```r play_hand = function(shoe, strategy, hand = new_hand(shoe), dealer = dealer_cards(shoe)) { face_up_card = dealer[1] action = strategy(hand$cards, face_up_card) while(action != "S" && handValue(hand$cards) != 0) { if (action == "H") { hand = hit(hand) action = strategy(hand$cards, face_up_card) } else { stop("Unknown action: should be one of S, H") } } winnings(handValue(dealer), handValue(hand$cards)) * hand$bet } ``` --- ## Play a few hands ```r play_hand(shoe, strategy_simple) ``` ``` ## [1] 0 ``` ```r play_hand(shoe, strategy_simple) ``` ``` ## [1] -1 ``` ```r play_hand(shoe, strategy_simple, new_hand(shoe, bet=7)) ``` ``` ## [1] 0 ``` ```r play_hand(shoe, strategy_simple, new_hand(shoe, bet=7)) ``` ``` ## [1] 7 ``` --- ## Repeated games To repeat the game, we simply repeat the `play_hand` function and keep track of the dollars gained or lost. ```r set.seed(4747) reps <- 10 money <- 20 map_dbl(1:reps, ~ play_hand(shoe, strategy_simple)) %>% accumulate(`+`) + 20 ``` ``` ## [1] 21 20 19 18 18 19 20 21 20 19 ``` --- # Agenda 9/23/21 1. Bias in modeling 2. Simulating statistical inference --- ## Bias in a model ``` talent ~ Normal (100, 15) grades ~ Normal (talent, 15) SAT ~ Normal (talent, 15) ``` College wants to admit students with > `talent > 115` ... but they only have access to `grades` and `SAT` which are noisy estimates of `talent`. The example is taken directly (and mostly verbatim) from a blog by Aaron Roth [Algorithmic Unfairness Without Any Bias Baked In](http://aaronsadventures.blogspot.com/2019/01/discussion-of-unfairness-in-machine.html). --- ## Plan for accepting students * Run a regression on a training dataset (`talent` is known for existing students) * Find a model which predicts `talent` based on `grades` and `SAT` * Choose students for whom predicted `talent` is above 115 --- ## Flaw in the plan ... * there are two populations of students, the Reds and Blues. * Reds are the majority population (99%), and Blues are a small minority population (1%) * the Reds and the Blues are no different when it comes to talent: they both have the same talent distribution, as described above. * there is no bias baked into the grading or the exams: both the Reds and the Blues also have exactly the same grade and exam score distributions -- > But there is one difference: the Blues have more money than the Reds, so they each take the **SAT twice**, and report only the highest of the two scores to the college. This results in a small but noticeable bump in their average SAT scores, compared to the Reds. --- ## Key aspect: > The value of `SAT` means something different for the Reds versus the Blues (They have different feature distributions.) --- ## Let's see what happens ... ![](2021-09-21-simulating_files/figure-html/unnamed-chunk-45-1.png)<!-- --> --- ## Two models: Red model (SAT taken once): ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 33.2 0.152 218. 0 ## 2 SAT 0.334 0.00149 224. 0 ## 3 grades 0.334 0.00149 225. 0 ``` Blue model (SAT is max score of two): ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 25.3 1.60 15.8 2.04e- 50 ## 2 SAT 0.432 0.0161 26.7 3.35e-119 ## 3 grades 0.277 0.0154 18.0 6.83e- 63 ``` --- ## New data * Generate new data and use the **two** models above, separately. * How well do the models predict if a student has `talent` > 115? ![](2021-09-21-simulating_files/figure-html/unnamed-chunk-49-1.png)<!-- --> --- ## New data .pull-left[ ![](2021-09-21-simulating_files/figure-html/unnamed-chunk-50-1.png)<!-- --> ] .pull-right[ ``` ## # A tibble: 2 × 5 ## color tpr fpr fnr error ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 blue 0.482 0.0371 0.518 0.103 ## 2 red 0.503 0.0378 0.497 0.111 ``` ] --- ## **TWO** models doesn't seem right???? What if we fit only one model to the entire dataset? Afterall, there are laws against using protected classes to make decisions (housing, jobs, money loans, college, etc.) ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 33.1 0.151 219. 0 ## 2 SAT 0.334 0.00148 225. 0 ## 3 grades 0.334 0.00148 226. 0 ``` (The coefficients kinda look like the red model...) --- ## How do the error rates change? .pull-left[ ![](2021-09-21-simulating_files/figure-html/unnamed-chunk-53-1.png)<!-- --> ] .pull-right[ One model: ``` ## # A tibble: 2 × 5 ## color tpr fpr fnr error ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 blue 0.584 0.0660 0.416 0.114 ## 2 red 0.502 0.0375 0.498 0.111 ``` Two separate models: ``` ## # A tibble: 2 × 5 ## color tpr fpr fnr error ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 blue 0.482 0.0371 0.518 0.103 ## 2 red 0.503 0.0378 0.497 0.111 ``` ] --- ## What did we learn? > with two populations that have different feature distributions, learning a single classifier (that is prohibited from discriminating based on population) will fit the bigger of the two populations * depending on the nature of the distribution difference, it can be either to the benefit or the detriment of the minority population * no explicit human bias, either on the part of the algorithm designer or the data gathering process * the problem is exacerbated if we artificially force the algorithm to be group blind * well intentioned "fairness" regulations prohibiting decision makers form taking sensitive attributes into account can actually make things less fair and less accurate at the same time --- ## Simulate? * different varying proportions * effect due to variability * effect due to SAT coefficient * different number of times the blues get to take the test * etc. --- ## Sensitivity of CI to tech conditions Consider the following set up (as in the WU): ```r set.seed(474) beta0 <- -1; beta1 <- 0.5; beta2 <- 1.5 n_obs <- 100; reps <- 1000 x1 <- rep(c(0,1), each=n_obs/2) x2 <- runif(n_obs, min=-1, max=1) y <- beta0 + beta1*x1 + beta2*x2 + rnorm(n_obs, mean=0, sd = 1) ``` --- ## Plot of data ![](2021-09-21-simulating_files/figure-html/unnamed-chunk-58-1.png)<!-- --> --- ## Capture the slope parameter? * We know the truth (!!!!!) * We can create a single CI for the parameter given a single data set * Is the parameter in the interval? * Should it always be in the interval? How often should it be? * Repeat the process to see whether the confidence interval captures the parameter at the claimed rate. See: [Rossman Chance regression applet](https://www.rossmanchance.com/applets/2021/regshuffle/regshuffle.htm) --- ## Running a linear model ```r CI <- lm(y~x1+x2) %>% tidy(conf.int=TRUE) %>% data.frame() CI ``` ``` ## term estimate std.error statistic p.value conf.low conf.high ## 1 (Intercept) -0.929 0.142 -6.52 3.10e-09 -1.2118 -0.647 ## 2 x1 0.330 0.202 1.64 1.05e-01 -0.0698 0.731 ## 3 x2 1.445 0.192 7.53 2.66e-11 1.0639 1.826 ``` ```r CI %>% filter(term == "x2") %>% select(term, estimate, conf.low, conf.high) %>% mutate(inside = between(beta2, conf.low, conf.high)) ``` ``` ## term estimate conf.low conf.high inside ## 1 x2 1.44 1.06 1.83 TRUE ``` --- ## What happens to the CI of coefficients in repeated samples? (eq var) ```r eqvar_data <- data.frame(row_id = seq(1, n_obs, 1)) %>% slice(rep(row_id, each = reps)) %>% mutate( sim_id = rep(1:reps, n_obs), x1 = rep(c(0,1), each = n()/2), x2 = runif(n(), min = -1, max = 1), y = beta0 + beta1*x1 + beta2*x2 + rnorm(n(), mean = 0, sd = 1) ) %>% arrange(sim_id, row_id) %>% group_by(sim_id) %>% nest() eqvar_data ``` ``` ## # A tibble: 1,000 × 2 ## # Groups: sim_id [1,000] ## sim_id data ## <int> <list> ## 1 1 <tibble [100 × 4]> ## 2 2 <tibble [100 × 4]> ## 3 3 <tibble [100 × 4]> ## 4 4 <tibble [100 × 4]> ## 5 5 <tibble [100 × 4]> ## 6 6 <tibble [100 × 4]> ## 7 7 <tibble [100 × 4]> ## 8 8 <tibble [100 × 4]> ## 9 9 <tibble [100 × 4]> ## 10 10 <tibble [100 × 4]> ## # … with 990 more rows ``` --- ```r beta_coef <- function(df){ lm(y ~ x1 + x2, data = df) %>% tidy(conf.int = TRUE) %>% filter(term == "x2") %>% select(estimate, conf.low, conf.high, p.value) } eqvar_data %>% mutate(b2_vals = map(data, beta_coef)) %>% select(b2_vals) %>% unnest(b2_vals) %>% summarize(capture = between(beta2, conf.low, conf.high)) %>% summarize(capture_rate = sum(capture)/reps) ``` ``` ## # A tibble: 1 × 1 ## capture_rate ## <dbl> ## 1 0.953 ``` --- ## Sensitivity of CI to tech conditions Consider the following set up (note the difference in variability): <code class ='r hljs remark-code'>set.seed(470)<br>beta0 <- -1<br>beta1 <- 0.5<br>beta2 <- 1.5<br>n_obs <- 100<br>reps <- 1000<br><br> x1 <- rep(c(0,1), each=n_obs/2)<br> x2 <- runif(n_obs, min=-1, max=1)<br> y <- beta0 + beta1*x1 + beta2*x2 + <br> rnorm(n_obs, mean=0, <span style='background-color:#ffff7f'>sd = 1 + x1 + 10*abs(x2)</span>)</code> --- ## Plot of data ![](2021-09-21-simulating_files/figure-html/unnamed-chunk-63-1.png)<!-- --> --- ## What happens to the CI of coefficients in repeated samples? (uneq var) <code class ='r hljs remark-code'>uneqvar_data <- data.frame(row_id = seq(1, n_obs, 1)) %>%<br> slice(rep(row_id, each = reps)) %>%<br> mutate(sim_id = rep(1:reps, n_obs), <br> x1 = rep(c(0,1), each = n()/2),<br> x2 = runif(n(), min = -1, max = 1),<br> y = beta0 + beta1*x1 + beta2*x2 + <br> rnorm(n(), mean = 0, <span style='background-color:#ffff7f'>sd = 1 + x1 + 10*abs(x2)</span>) ) %>%<br> arrange(sim_id, row_id) %>%<br> group_by(sim_id) %>% nest() <br><br>uneqvar_data %>% <br> mutate(b2_vals = map(data, beta_coef)) %>%<br> select(b2_vals) %>% unnest(b2_vals) %>%<br> summarize(capture = between(beta2, conf.low, conf.high)) %>%<br> summarize(capture_rate = sum(capture)/reps)</code> ``` ## # A tibble: 1 × 1 ## capture_rate ## <dbl> ## 1 0.877 ``` --- ## All together .panelset[ .panel[.panel-name[coef function] ```r beta_coef <- function(df){ lm(y ~ x1 + x2, data = df) %>% tidy(conf.int = TRUE) %>% filter(term == "x2") %>% select(estimate, conf.low, conf.high, p.value) } ``` ] .panel[.panel-name[generate data] ```r uneqvar_data <- data.frame(row_id = seq(1, n_obs, 1)) %>% slice(rep(row_id, each = reps)) %>% mutate(sim_id = rep(1:reps, n_obs), x1 = rep(c(0,1), each = n()/2), x2 = runif(n(), min = -1, max = 1), y = beta0 + beta1*x1 + beta2*x2 + rnorm(n(), mean = 0, sd = 1 + x1 + 10*abs(x2)) ) %>% arrange(sim_id, row_id) %>% group_by(sim_id) %>% nest() ``` ] .panel[.panel-name[capture rate] ```r uneqvar_data %>% mutate(b2_vals = map(data, beta_coef)) %>% select(b2_vals) %>% unnest(b2_vals) %>% summarize(capture = between(beta2, conf.low, conf.high)) %>% summarize(capture_rate = sum(capture)/reps) ``` ``` ## # A tibble: 1 × 1 ## capture_rate ## <dbl> ## 1 0.877 ``` ] ] --- ## Equal variance: type I error? We can also consider the type I errors associated with the two data configurations. With equal variance, the type I error rate is close to what we'd expect, 5%. .panelset[ .panel[.panel-name[t-test function] ```r t_test_pval <- function(df){ t.test(y ~ x1, data = df, var.equal = TRUE) %>% tidy() %>% select(estimate, p.value) } ``` ] .panel[.panel-name[generate data] ```r set.seed(470) reps <- 1000 n_obs <- 20 null_data_equal <- data.frame(row_id = seq(1, n_obs, 1)) %>% slice(rep(row_id, each = reps)) %>% mutate( sim_id = rep(1:reps, n_obs), x1 = rep(c("group1", "group2"), each = n()/2), y = rnorm(n(), mean = 10, sd = rep(c(1,1), each = n()/2)) ) %>% arrange(sim_id, row_id) %>% group_by(sim_id) %>% nest() ``` ] .panel[.panel-name[summarize p-values] ```r null_data_equal %>% mutate(t_vals = map(data,t_test_pval)) %>% select(t_vals) %>% unnest(t_vals) %>% ungroup(sim_id) %>% summarize(type1error_rate = sum(p.value < 0.05)/reps) ``` ``` ## # A tibble: 1 × 1 ## type1error_rate ## <dbl> ## 1 0.045 ``` ] ] --- ## Unequal variance: type I error? With unequal variance, the type I error rate is much higher than we set. We set the the type I error rate to be 5%, but the simulated rate was 6.36%. .panelset[ .panel[.panel-name[t-test function] ```r t_test_pval <- function(df){ t.test(y ~ x1, data = df, var.equal = TRUE) %>% tidy() %>% select(estimate, p.value) } ``` ] .panel[.panel-name[generate data] <code class ='r hljs remark-code'>set.seed(470)<br>reps <- 1000<br>n_obs <- 20<br>null_data_unequal <- <br> data.frame(row_id = seq(1, n_obs, 1)) %>%<br> slice(rep(row_id, each = reps)) %>%<br> mutate(<br> sim_id = rep(1:reps, n_obs),<br> x1 = rep(c("group1", "group2"), each = n()/2),<br> y = rnorm(n(), mean = 10, <br> <span style='background-color:#ffff7f'>sd = rep(c(1,100), each = n()/2)</span>)<br> ) %>%<br> arrange(sim_id, row_id) %>%<br> group_by(sim_id) %>%<br> nest()</code> ] .panel[.panel-name[summarize p-values] ```r null_data_unequal %>% mutate(t_vals = map(data,t_test_pval)) %>% select(t_vals) %>% unnest(t_vals) %>% ungroup(sim_id) %>% summarize(type1error_rate = sum(p.value < 0.05)/reps) ``` ``` ## # A tibble: 1 × 1 ## type1error_rate ## <dbl> ## 1 0.057 ``` ] ]