01. Start with R + Git
The importance of reproducibility. Ideas of computational statistics, data science, and machine learning. Some resources for starting with R + RStudio + Git + GitHub.
Agenda
August 26, 2024
- Syllabus & Course Outline
- Stitch Fix Algorithm
- Can Twitter predict election results?
Before Wednesday, listen to the full conversation of Not So Standard Deviations - Compromised Shoe Situation.
August 28, 2024
- Reproducibility & GitHub
- Design Challenge (Not So Standard Deviations)
Before next Thursday, read: Tufte. 1997. Visual and Statistical Thinking: Displays of Evidence for Making Decisions. (Use Google to find it.)
Readings
- The syllabus
- Modern Data Science with R Prologue
- Class notes: Introduction
- Why Git? + monsters
Reflection questions
What can statistics & data science do? How do they do that?
What can’t statistics & data science do? Why not?
What choices were made to collect the Twitter data?
What choices were made to model the Twitter data?
What are the advantages and disadvantages of high touch versus low touch data?
Ethics considerations
Why is it problematic if the analysis isn’t reproducible?
Is every analysis worth doing? (e.g., time to get to work, predicting presidential results, etc.). Can the act of doing the analysis be ethically questionable?
Slides
In class slides for both 8/26/24 and 8/28/24.
Additional Resources
Design Challenge (Not So Standard Deviations), listen to the full conversation.
Video (less than 2 min) on the strengths of reproducible research
R vs. Python? (My personal opinion is that neither of the languages is “best”.)
2017 Kaggle user survey and 2019 Stack Overflow Developer Survey
PNAS paper retracted due to problems with figure and reproducibility (April 2016)
Analysis of Trump’s tweets with evidence that someone else tweets from his account using an iPhone part 1 and part 2
:::