7. Recipes + k-NN
And old adage says: garbage in, garbage out. Here we avoid garbage in. k-Nearest Neighbors is a classification algorithm based on the premise that points which are close to one another (in some predictor space) are likely to be similar with respect to the outcome variable.
Agenda
October 21, 2024
- What needs to be done to the data?
tidymodels
syntax for recipes- Example
October 23, 2022
- \(k\)-Nearest Neighbors
- Cross Validation
- Example
Readings
Class notes: model building
Class notes: k nearest neighbors
Max Kuhn and Julia Silge (2021), Tidy Modeling with R
Reflection questions
What is the process for building a model using tidymodels?
Why is it important to do feature engineering for variables in a model?
How is data separated in order to work with independent information (hint: two ways)?
What is the “\(k\)” in \(k\)-Nearest Neighbors?
Why do most implementations of \(k\)-NN prefer odd values of k?
How does \(k\)-NN make predictions on test data?
Can \(k\)-NN be used for both classification and regression or only one of the two tasks?
Can you use categorical / character predictors with \(k\)-NN?
How is \(k\) chosen?
How do the bias and variance change for different values of \(k\) in \(k\)-NN?
What are the advantages of the \(k\)-NN algorithm?
What are the disadvantages of the \(k\)-NN algorithm?
Ethics considerations
There are two ways that laws are enforced (both equally important):
disparate treatment \(\rightarrow\) means that the differential treatment is intentional
disparate impact \(\rightarrow\) means that the differential treatment is unintentional or implicit (some examples include advancing mortgage credit, employment selection, predictive policing)
Anti-discrimination Laws
- Civil Rights Acts of 1964 and 1991
- Americans with Disabilities Act
- Genetic Information Nondiscrimination Act
- Equal Credit Opportunity Act
- Fair Housing Act
Questions to ask yourself in every single data analysis you perform (taken from Data Science for Social Good at the University of Chicago):
- What biases may exist in the data you’ve been given? How can you find out?
- How will your choices with tuning parameters affect different populations represented in the data?
- How do you know you aren’t getting the right answer to the wrong question?
- How would you justify what you’d built to someone whose welfare is made worse off by the implementation of your algorithm?
- See the slides on bias in modeling (9/23/21) for times when there are no inherent biases but the structure of the data create unequal model results.
What type of feature engineering is required for \(k\)-NN?
Why is it recommended that \(k\)-NN not be used on large datasets?
(For a given \(k\)) why does \(k\)-NN use more computation time to test than to train? [n.b., the opposite is true for the majority of classification and regression algorithms.]
If the model produces near perfect predictions on the test data, what are some potential concerns about putting that model into production?
Slides
Recipes + feature engineering for 10/21/24.
k-NN for 10/23/24.
Additional Resources
Hilary Mason describing what is machine learning to 5 different people.
Julia Silge’s blog is full of complete tidymodels examples and screencasts.
Alexandria Ocasio-Cortez, Jan 22, 2019 MLK event with Ta-Nehisi Coates
S. Barocas and A. Selbst, “Big Data’s Disparate Impact”, California Law Review, 671, 2016.
Machine Bias in Pro Publica by Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner, May 23, 2016
Algorithmic Justice League is a collective that aims to:
Highlight algorithmic bias through media, art, and science
Provide space for people to voice concerns and experiences with coded bias
Develop practices for accountability during design, development, and deployment of coded systems
Joy Buolamwini – AI, Ain’t I A Woman?
:::