k-Nearest Neighbors is a classification algorithm based on the premise that points which are close to one another (in some predictor space) are likely to be similar with respect to the outcome variable. trees represent a set of methods where prediction is based on majority vote or average outcome based on a partition of the predictor space.
Class notes: k nearest neighbors
Class notes: decision trees
Gareth, Witten, Hastie, and Tibshirani (2021), k Nearest Neighbors (section 3.5) Introduction to Statistical Learning.
Gareth, Witten, Hastie, and Tibshirani (2021), the basics of decision trees (section 8.1) Introduction to Statistical Learning.
Max Kuhn and Julia Silge (2021), Tidy Modeling with R
What is the “\(k\)” in \(k\)-Nearest Neighbors? What does CART stand for?
Why do most implementations of \(k\)-NN prefer odd values of k?
How does \(k\)-NN / CART make predictions on test data?
Can \(k\)-NN / CART be used for both classification and regression or only one of the two tasks?
Can you use categorical / character predictors with \(k\)-NN / CART?
How is \(k\) / tree depth chosen?
What does it mean for CART to be high variance? How do the bias and variance change for different values of \(k\) in \(k\)-NN?
What are the advantages of the \(k\)-NN / CART algorithm?
What are the disadvantages of the \(k\)-NN / CART algorithm?
What type of feature engineering is required for \(k\)-NN / CART?
Why is it recommended that \(k\)-NN not be used on large datasets?
(For a given \(k\)) why does \(k\)-NN use more computation time to test than to train? [n.b., the opposite is true for the majority of classification and regression algorithms.]
If the model produces near perfect predictions on the test data, what are some potential concerns about putting that model into production?
In class slides - knn for 10/26/21.
In class slides - decision trees for 10/28/21.
With the help of the Rand Corp., the city tried to measure fire response times, identify redundancies in service, and close or re-allocate fire stations accordingly. What resulted, though, was a perfect storm of bad data: The methodology was flawed, the analysis was rife with biases, and the results were interpreted in a way that stacked the deck against poorer neighborhoods. The slower response times allowed smaller fires to rage uncontrolled in the city’s most vulnerable communities.
SF vs. NYC housing – a great example of a classification tree.
Julia Silge’s blog <a href = “https://juliasilge.com/blog/scooby-doo/” target_“blank”>Tuning Decision Trees
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hardin47/m154-comp-stats, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".