Clustering

November 18 + 20 + 25, 2024

Jo Hardin

Agenda 11/18/24

  1. unsupervised learning
  2. distance metrics
  3. hierarchical clustering

Unsupervised learning

Grouping or categorizing observational units (objects) without any pre-assigned labels or scores (no outcome information!)

Some examples:

Distance metric (mathematically)

  1. \(d({\textbf x}, {\textbf y}) \geq 0\)
  2. \(d({\textbf x}, {\textbf y}) = d({\textbf y}, {\textbf x})\)
  3. \(d({\textbf x}, {\textbf y}) = 0\) iff \({\textbf x} = {\textbf y}\)
  4. \(d({\textbf x}, {\textbf y}) \leq d({\textbf x}, {\textbf z}) + d({\textbf z}, {\textbf y})\) for all other vectors \({\textbf z}\).

Distance measures (for clustering)

  • Euclidean Distance

\[d_E({\textbf x}, {\textbf y}) = \sqrt{\sum_{i=1}^p (x_i - y_i)^2}\]

  • Pearson Correlation Distance

\[d_P({\textbf x}, {\textbf y}) = 1 - r_P ({\textbf x}, {\textbf y})\] \[\mbox{ or } d_P({\textbf x}, {\textbf y}) = 1 - |r_P ({\textbf x}, {\textbf y})|\] \[\mbox{ or } d_P({\textbf x}, {\textbf y}) = 1 - (r_P ({\textbf x}, {\textbf y}))^2\]

Correlation distance isn’t a distance metric!

x1 <- c(1,2,3)
x2 <- c(1, 4, 10)
x3 <- c(9, 2, 2)

# d(1,2)
1 - cor(x1, x2)
[1] 0.01801949
# d(1,3)
1 - cor(x1, x3)
[1] 1.866025
# d(2,3)
1 - cor(x2, x3)
[1] 1.755929
# d(1,3) > d(1,2) + d(2,3)
1 - cor(x1, x2) + 1 - cor(x2, x3)
[1] 1.773948

Correlation distance isn’t a distance metric!

Using absolute distance doesn’t fix things.

# d(1,2)
1 - abs(cor(x1, x2))
[1] 0.01801949
# d(1,3)
1 - abs(cor(x1, x3))
[1] 0.1339746
# d(2,3)
1 - abs(cor(x2, x3))
[1] 0.2440711
# d(2,3) > d(1,2) + d(1,3)
1 - abs(cor(x1, x2)) + 1 - abs(cor(x1, x3))
[1] 0.1519941

Distance measures (for clustering)

  • Cosine Distance

\[d_C({\textbf x}, {\textbf y}) = \frac{{\textbf x} \cdot {\textbf y}}{|| {\textbf x} || ||{\textbf y}||}\] \[= \frac{\sum_{i=1}^p x_i y_i}{\sqrt{\sum_{i=1}^p x_i^2 \sum_{i=1}^p y_i^2}}\] \[= 1 - r_P ({\textbf x}, {\textbf y}) \ \ \ \ \mbox{if } \overline{\textbf x} = \overline{\textbf y} = 0\] * Hamming Distance

\[\begin{align} d_H({\textbf x}, {\textbf y}) = \sum_{i=1}^p I(x_i \ne y_i) \end{align}\]

The Hamming distance across the two DNA strands is 7.

dist function in R

The function dist in R calculates the distances given above.

String distances

https://www.kdnuggets.com/2019/01/comparison-text-distance-metrics.html

Comparison of string distance metrics from https://www.kdnuggets.com/2019/01/comparison-text-distance-metrics.html.

Hierarchical Clustering

is a set of nested clusters that are organized as a tree. Note that objects that belong to a child cluster also belong to the parent cluster.

Agglomerative Hierarchical Clustering Algorithm

  1. Begin with \(n\) observations and a measure (such as Euclidean distance) of all the \({n \choose 2} = n(n-1)/2\) pairwise dissimilarities. Treat each observation as its own cluster.
  2. For \(i = n, n - 1, \ldots , 2\):
    1. Examine all pairwise inter-cluster dissimilarities among the \(i\) clusters and identify the pair of clusters that are least dissimilar (that is, most similar). Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed.
    2. Compute the new pairwise inter-cluster dissimilarities among the \(i - 1\) remaining clusters.

Definitions

Agglomerative methods start with each object (e.g., gene, penguin, etc.) in its own group. Groups are merged until all objects are together in one group.

Divisive methods start with all objects in one group and break up the groups sequentially until all objects are individuals.

Single Linkage algorithm defines the distance between groups as that of the closest pair of individuals.

Complete Linkage algorithm defines the distance between groups as that of the farthest pair of individuals.

Average Linkage algorithm defines the distance between groups as the average of the distances between all pairs of individuals across the groups.

Toy example

of Single Linkage Agglomerative Hierarchical Clustering

A B C D E
A 0
B 0.2 0
C 0.6 0.5 0
D 1 0.9 0.4 0
E 0.9 0.8 0.5 0.3 0

see class notes to walk through the process.

Hierarchical clustering in R

penguins_h <- penguins |>
  drop_na(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
  select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
  mutate(across(bill_length_mm:body_mass_g, scale))

penguin_hclust <- penguins_h |>
  dist() |>
  hclust(method = "complete")

penguin_hclust

Call:
hclust(d = dist(penguins_h), method = "complete")

Cluster method   : complete 
Distance         : euclidean 
Number of objects: 342 
penguin_dend <- as.dendrogram(penguin_hclust)

The basic dendrogram

plot(penguin_hclust)

Colored dendrogram

Agenda 11/20 + 25/24

  1. k-means clustering
  2. k-medoids clustering

\(k\)-means Clustering

\(k\)-means clustering is an unsupervised partitioning algorithm designed to find a partition of the observations such that the following objective function is minimized (find the smallest within cluster sum of squares):

\[\text{arg}\,\min\limits_{C_1, \ldots, C_k} \Bigg\{ \sum_{k=1}^K \sum_{i \in C_k} \sum_{j=1}^p (x_{ij} - \overline{x}_{kj})^2 \Bigg\}\]

Monsters clustering

Monsters as cluster centers moving around throughout the k-means algorithm.

Artwork by (allison_horst?).

A fun applet!!

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

\(k\)-Means Clustering Algoritm

  1. Randomly assign a number, from 1 to \(k\), to each of the observations. These serve as initial cluster assignments for the observations.
  2. Iterate until the cluster assignments stop changing:
    1. For each of the \(k\) clusters, compute the cluster centroid. The \(k^{th}\) cluster centroid is the vector of the \(p\) feature means for the observations in the \(k^{th}\) cluster.
    2. Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).
  3. Ties? Do something consistent: for example, leave in the current cluster.

Convergence? Yes! (local…)

  1. If a point is “closer” to a different center, moving it will lower the objective function.

  2. Averages minimize squared differences, so taking the new average will result in a lower objective function.

  3. If a point is equidistant from two clusters, the point won’t move.

  4. The algorithm must converge in finite number of steps because there are finitely many points.

Scaling

norm_clust |>
  kmeans(centers = 2) |>
  augment(norm_clust) |>
  ggplot() + 
  geom_point(aes(x = x1, 
                 y = x2, 
                 color = .cluster)) +
  ggtitle("k-means (k=2) on raw data")

norm_clust |>
  mutate(across(everything(), 
                scale)) |>
  kmeans(centers = 2) |>
  augment(norm_clust) |>
  ggplot() + 
  geom_point(aes(x = x1, 
                 y = x2, 
                 color = .cluster)) +
  ggtitle("k-means (k=2) on scaled data")

k-Medoids

Shortcomings of k-means

  • Center is calculated as average (establishes Euclidean distance)

  • Because center changes, distances must be re-calculated

  • Really, only Euclidean distance makes sense

Partitioning Around Medoids (PAM)

Find the observations (data values!) \(m_k\) that solve:

\[\text{arg}\,\min\limits_{C_1, \ldots, C_k} \Bigg\{ \sum_{k=1}^K \sum_{i \in C_k}d(x_i, m_k) \Bigg\}\]

Evaluating clustering (which \(k\)?)

  • Silhouette Width (use \(k\) with smallest silhouette width)

  • Elbow plot (use \(k\) at elbow on plot of \(k\) vs. within cluster sum of squares)

Silhouette width

Consider observation \(i \in\) cluster \(C_1\). Let

\[d(i, C_k) = \mbox{average dissimilarity of } i \mbox{ to all objects in cluster } C_k\] \[a(i) = \mbox{average dissimilarity of } i \mbox{ to all objects in } C_1.\] \[b(i) = \min_{C_k \ne C_1} d(i,C_k) = \mbox{distance to the next closest neighbor cluster}\] \[s(i) = \frac{b(i) - a(i)}{\max \{ a(i), b(i) \}}\] \[\mbox{average}_{i \in C_1} s(i) = \mbox{average silhouette width for cluster } C_1\]

Note that if \(a(i) < b(i)\) then \(i\) is well classified with a maximum \(s(i) = 1\). If \(a(i) > b(i)\) then \(i\) is not well classified with a maximum \(s(i) = -1\).