7. Recipes + k-NN

And old adage says: garbage in, garbage out. Here we avoid garbage in. k-Nearest Neighbors is a classification algorithm based on the premise that points which are close to one another (in some predictor space) are likely to be similar with respect to the outcome variable.

Author
Published

October 21, 2024

Artwork by @allison_horst.

Agenda

October 21, 2024

  1. What needs to be done to the data?
  2. tidymodels syntax for recipes
  3. Example

October 23, 2022

  1. \(k\)-Nearest Neighbors
  2. Cross Validation
  3. Example

Readings

Reflection questions

  • What is the process for building a model using tidymodels?

  • Why is it important to do feature engineering for variables in a model?

  • How is data separated in order to work with independent information (hint: two ways)?

  • What is the “\(k\)” in \(k\)-Nearest Neighbors?

  • Why do most implementations of \(k\)-NN prefer odd values of k?

  • How does \(k\)-NN make predictions on test data?

  • Can \(k\)-NN be used for both classification and regression or only one of the two tasks?

  • Can you use categorical / character predictors with \(k\)-NN?

  • How is \(k\) chosen?

  • How do the bias and variance change for different values of \(k\) in \(k\)-NN?

  • What are the advantages of the \(k\)-NN algorithm?

  • What are the disadvantages of the \(k\)-NN algorithm?

Ethics considerations

  • There are two ways that laws are enforced (both equally important):

    1. disparate treatment \(\rightarrow\) means that the differential treatment is intentional

    2. disparate impact \(\rightarrow\) means that the differential treatment is unintentional or implicit (some examples include advancing mortgage credit, employment selection, predictive policing)

  • Anti-discrimination Laws

    • Civil Rights Acts of 1964 and 1991
    • Americans with Disabilities Act
    • Genetic Information Nondiscrimination Act
    • Equal Credit Opportunity Act
    • Fair Housing Act
  • Questions to ask yourself in every single data analysis you perform (taken from Data Science for Social Good at the University of Chicago):

    • What biases may exist in the data you’ve been given? How can you find out?
    • How will your choices with tuning parameters affect different populations represented in the data?
    • How do you know you aren’t getting the right answer to the wrong question?
    • How would you justify what you’d built to someone whose welfare is made worse off by the implementation of your algorithm?
    • See the slides on bias in modeling (9/23/21) for times when there are no inherent biases but the structure of the data create unequal model results.
  • What type of feature engineering is required for \(k\)-NN?

  • Why is it recommended that \(k\)-NN not be used on large datasets?

  • (For a given \(k\)) why does \(k\)-NN use more computation time to test than to train? [n.b., the opposite is true for the majority of classification and regression algorithms.]

  • If the model produces near perfect predictions on the test data, what are some potential concerns about putting that model into production?

Slides

Additional Resources

  • Hilary Mason describing what is machine learning to 5 different people.

  • Julia Silge’s blog is full of complete tidymodels examples and screencasts.

  • Alexandria Ocasio-Cortez, Jan 22, 2019 MLK event with Ta-Nehisi Coates

  • S. Barocas and A. Selbst, “Big Data’s Disparate Impact”, California Law Review, 671, 2016.

  • Machine Bias in Pro Publica by Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner, May 23, 2016

  • Algorithmic Justice League is a collective that aims to:

    • Highlight algorithmic bias through media, art, and science

    • Provide space for people to voice concerns and experiences with coded bias

    • Develop practices for accountability during design, development, and deployment of coded systems

    • Joy Buolamwini – AI, Ain’t I A Woman?

:::