Clustering

November 19 + 24, 2025

Jo Hardin

Agenda 11/19/25

  1. unsupervised learning
  2. distance metrics / dissimilarity
  3. hierarchical clustering

Unsupervised learning

Grouping or categorizing observational units (objects) without any pre-assigned labels or scores (no outcome information!)

Some examples:

Distance metric (mathematically)

  1. \(d({\textbf x}, {\textbf y}) \geq 0\)
  2. \(d({\textbf x}, {\textbf y}) = d({\textbf y}, {\textbf x})\)
  3. \(d({\textbf x}, {\textbf y}) = 0\) iff \({\textbf x} = {\textbf y}\)
  4. \(d({\textbf x}, {\textbf y}) \leq d({\textbf x}, {\textbf z}) + d({\textbf z}, {\textbf y})\) for all other vectors \({\textbf z}\).

Distance measures (for clustering)

  • Euclidean Distance

\[d_E({\textbf x}, {\textbf y}) = \sqrt{\sum_{i=1}^p (x_i - y_i)^2}\]

  • Pearson Correlation Distance / Dissimilarity

\[d_P({\textbf x}, {\textbf y}) = 1 - r_P ({\textbf x}, {\textbf y})\] \[\mbox{ or } d_P({\textbf x}, {\textbf y}) = 1 - |r_P ({\textbf x}, {\textbf y})|\] \[\mbox{ or } d_P({\textbf x}, {\textbf y}) = 1 - (r_P ({\textbf x}, {\textbf y}))^2\]

Correlation dissimilarity isn’t a distance metric!

x1 <- c(1,2,3)
x2 <- c(1, 4, 10)
x3 <- c(9, 2, 2)

# d(1,2)
1 - cor(x1, x2)
[1] 0.01801949
# d(1,3)
1 - cor(x1, x3)
[1] 1.866025
# d(2,3)
1 - cor(x2, x3)
[1] 1.755929
# d(1,3) > d(1,2) + d(2,3)
1 - cor(x1, x2) + 1 - cor(x2, x3)
[1] 1.773948

Correlation dissimilarity isn’t a distance metric!

Using absolute value doesn’t fix things.

# d(1,2)
1 - abs(cor(x1, x2))
[1] 0.01801949
# d(1,3)
1 - abs(cor(x1, x3))
[1] 0.1339746
# d(2,3)
1 - abs(cor(x2, x3))
[1] 0.2440711
# d(2,3) > d(1,2) + d(1,3)
1 - abs(cor(x1, x2)) + 1 - abs(cor(x1, x3))
[1] 0.1519941

More dissimilarity measures

  • Cosine Distance

\[d_C({\textbf x}, {\textbf y}) = \frac{{\textbf x} \cdot {\textbf y}}{|| {\textbf x} || ||{\textbf y}||}\] \[= \frac{\sum_{i=1}^p x_i y_i}{\sqrt{\sum_{i=1}^p x_i^2 \sum_{i=1}^p y_i^2}}\] \[= 1 - r_P ({\textbf x}, {\textbf y}) \ \ \ \ \mbox{if } \overline{\textbf x} = \overline{\textbf y} = 0\]

More dissimilarity measures

  • Hamming Distance

\[\begin{align} d_H({\textbf x}, {\textbf y}) = \sum_{i=1}^p I(x_i \ne y_i) \end{align}\]

The Hamming distance across the two DNA strands is 7.

dist function in R

The function dist in R calculates the distances given above.

String distances

Comparison of string distance metrics from https://www.kdnuggets.com/2019/01/comparison-text-distance-metrics.html.

Hierarchical Clustering

is a set of nested clusters that are organized as a tree. Note that objects that belong to a child cluster also belong to the parent cluster.

Agglomerative Hierarchical Clustering Algorithm

  1. Begin with \(n\) observations and a measure (such as Euclidean distance) of all the \({n \choose 2} = n(n-1)/2\) pairwise dissimilarities. Treat each observation as its own cluster.
  2. For \(i = n, n - 1, \ldots , 2\):
    1. Examine all pairwise inter-cluster dissimilarities among the \(i\) clusters and identify the pair of clusters that are least dissimilar (that is, most similar). Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed.
    2. Compute the new pairwise inter-cluster dissimilarities among the \(i - 1\) remaining clusters.

Definitions

Agglomerative methods start with each object (e.g., gene, penguin, etc.) in its own group. Groups are merged until all objects are together in one group.

Divisive methods start with all objects in one group and break up the groups sequentially until all objects are individuals.

Single Linkage algorithm defines the dissimilarity between groups as that of the closest pair of individuals.

Complete Linkage algorithm defines the dissimilarity between groups as that of the farthest pair of individuals.

Average Linkage algorithm defines the dissimilarity between groups as the average of the dissimilarities between all pairs of individuals across the groups.

Toy example

of Single Linkage Agglomerative Hierarchical Clustering

A B C D E
A 0
B 0.2 0
C 0.6 0.5 0
D 1 0.9 0.4 0
E 0.9 0.8 0.5 0.3 0

see class notes to walk through the process.

Should we use hierarchical clustering?

strengths

  • Provides a clustering for a range of values of \(k\).
  • Can be used with any distance metric / dissimilarity.

shortcomings

  • Forces a hierarchical structure (almost always agglomerative)
  • Linkage choice can change the outcome.

Hierarchical clustering in R

penguins <- penguins |>
  select(bill_length_mm, bill_depth_mm,
         flipper_length_mm, body_mass_g,
         island, species) |>
  drop_na()
hc_rec <- recipe(~ ., data = penguins) |>
  update_role(island, new_role = "ID") |> 
  update_role(species, new_role = "ID") |>
  step_rm(all_nominal()) |>
  step_normalize(all_predictors())
hc_model <- hier_clust(
  num_clusters = 3,
  linkage_method = "average") |> 
  set_engine("stats")

hc_model
Hierarchical Clustering Specification (partition)

Main Arguments:
  num_clusters = 3
  linkage_method = average

Computational engine: stats 
hc_work <- workflow() |> 
  add_recipe(hc_rec) |> 
  add_model(hc_model)

hc_work
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: hier_clust()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_rm()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Hierarchical Clustering Specification (partition)

Main Arguments:
  num_clusters = 3
  linkage_method = average

Computational engine: stats 
hc_fit <- hc_work |>
  fit(data = penguins)

hc_fit |>
  summary()
        Length Class      Mode   
pre     3      stage_pre  list   
fit     2      stage_fit  list   
post    1      stage_post list   
trained 1      -none-     logical
hc_fit |> extract_fit_summary() |> str()
List of 7
 $ cluster_names         : Factor w/ 3 levels "Cluster_1","Cluster_2",..: 1 2 3
 $ centroids             : tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
  ..$ bill_length_mm   : num [1:3] -0.369 0.603 2.253
  ..$ bill_depth_mm    : num [1:3] 0.617 -1.123 -0.368
  ..$ flipper_length_mm: num [1:3] -0.65 1.13 2.05
  ..$ body_mass_g      : num [1:3] -0.612 1.06 1.977
 $ n_members             : int [1:3] 219 119 4
 $ sse_within_total_total: num [1:3] 283.19 111.49 2.01
 $ sse_total             : num 656
 $ orig_labels           : NULL
 $ cluster_assignments   : Factor w/ 3 levels "Cluster_1","Cluster_2",..: 1 1 1 1 1 1 1 1 1 1 ...
hc_fit |> extract_centroids()
# A tibble: 3 × 5
  .cluster  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>              <dbl>         <dbl>             <dbl>       <dbl>
1 Cluster_1         -0.369         0.617            -0.650      -0.612
2 Cluster_2          0.603        -1.12              1.13        1.06 
3 Cluster_3          2.25         -0.368             2.05        1.98 

Prediction

To predict the cluster assignment for a new observation, we find the closest cluster. How we measure “closeness” is dependent on the specified type of linkage in the model:

  • single linkage: The new observation is assigned to the same cluster as its nearest observation from the training data.

  • complete linkage: The new observation is assigned to the cluster with the smallest maximum dissimilarities between training observations and the new observation.

  • average linkage: The new observation is assigned to the cluster with the smallest average dissimilarities between training observations and the new observation.

  • centroid method: The new observation is assigned to the cluster with the closest centroid, as in prediction for k_means.

  • Ward’s method: The new observation is assigned to the cluster with the smallest increase in error sum of squares (ESS) due to the new addition. The ESS is computed as the sum of squared Euclidean distances between observations in a cluster, and the centroid of the cluster.

Prediction

hc_preds <- hc_fit |> predict(penguins)

hc_preds
# A tibble: 342 × 1
   .pred_cluster
   <fct>        
 1 Cluster_1    
 2 Cluster_1    
 3 Cluster_1    
 4 Cluster_1    
 5 Cluster_1    
 6 Cluster_1    
 7 Cluster_1    
 8 Cluster_1    
 9 Cluster_1    
10 Cluster_1    
# ℹ 332 more rows

It’s important to note that there is no guarantee that predict() on the training data will produce the same results as extract_cluster_assignments(). The process by which clusters are created during the agglomerations results in a particular partition; but if a training observation is treated as new data, it is predicted in the same manner as truly new information.

Prediction wonky

Consider a silly example using complete linkage:

\[x_1 = 1, x_2 = 2, x_3 = 5, x_4 = 9, x_5 = 10, x_6 = -3\]

  • Initial clusers (\(k=6\)), each point is in their own cluster.
  • \(k = 4\): \(\{x_1, x_2 \}\) and \(\{ x_3 \}\) and \(\{x_4, x_5 \}\) and \(\{x_6\}\)
  • \(k=3\): \(\{x_1, x_2, x_3 \}\) and \(\{x_4, x_5 \}\) and \(\{x_6 \}\)
  • \(k=2\): \(\{x_1, x_2, x_3, x_6 \}\) and \(\{x_4, x_5 \}\)

Note, using complete linkage, \(x_3\) is closer to \(\{x_4, x_5 \}\) (distance = 5) than to \(\{x_1, x_2, x_3, x_6 \}\) (distance = 8).

Prediction vs. label

bind_cols(hc_preds,
  extract_cluster_assignment(hc_fit) ) |> 
  select(.pred_cluster, .cluster) |> 
  table()
             .cluster
.pred_cluster Cluster_1 Cluster_2 Cluster_3
    Cluster_1       218         0         0
    Cluster_2         0        99         0
    Cluster_3         1        20         4

The basic dendrogram

We have to get the hclust “object” from the fit, and transform it to make the dendrogram.

library(ggdendro)

penguin_hclust <- hc_fit |> 
  extract_fit_engine()

penguin_dend <- penguin_hclust |> 
  as.dendrogram()

ggdendro::ggdendrogram(penguin_dend)

Zooming in on a cluster

ggdendrogram(penguin_dend[[1]])

ggdendrogram(penguin_dend[[2]])

ggdendrogram(penguin_dend[[1]][[1]])

Colored dendrogram

library(dendextend)
penguin_dend |>
  set("branches_k_color", k = 3) |>  # color of the branches
  set("labels_colors", k = 3) |>     # color of the labels
  set("branches_lwd", 0.3) |>        # width of the branches
  set("labels_cex", 0.3) |>          # size of the label 
  as.ggdend() |> 
  ggplot(horiz = TRUE)

We don’t usually have a way to check (i.e., “unsupervised”)

penguins |> 
  drop_na(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
  select(island) |> 
  cbind(cluster = cutree(penguin_hclust, k = 3) ) |> 
  table()
           cluster
island        1   2   3
  Biscoe     44 119   4
  Dream     124   0   0
  Torgersen  51   0   0
penguins |> 
  drop_na(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
  select(species) |> 
  cbind(cluster = cutree(penguin_hclust, k = 3) ) |> 
  table()
           cluster
species       1   2   3
  Adelie    151   0   0
  Chinstrap  68   0   0
  Gentoo      0 119   4

Agenda 11/24/25

  1. \(k\)-means clustering
  2. \(k\)-medoids clustering

\(k\)-means Clustering

\(k\)-means clustering is an unsupervised partitioning algorithm designed to find a partition of the observations such that the following objective function is minimized (find the smallest within cluster sum of squares):

\[\text{arg}\,\min\limits_{C_1, \ldots, C_k} \Bigg\{ \sum_{k=1}^K \sum_{i \in C_k} \sum_{j=1}^p (x_{ij} - \overline{x}_{kj})^2 \Bigg\}\]

Monsters clustering

Monsters as cluster centers moving around throughout the k-means algorithm.

Artwork by (allison_horst?).

A fun applet!!

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

\(k\)-Means Clustering Algoritm

  1. Randomly assign a number, from 1 to \(k\), to each of the observations. These serve as initial cluster assignments for the observations.
  2. Iterate until the cluster assignments stop changing:
    1. For each of the \(k\) clusters, compute the cluster centroid. The \(k^{th}\) cluster centroid is the vector of the \(p\) feature means for the observations in the \(k^{th}\) cluster.
    2. Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).
  3. Ties? Do something consistent: for example, leave in the current cluster.

Convergence? Yes! (local…)

  1. If a point is “closer” to a different center, moving it will lower the objective function.

  2. Averages minimize squared differences, so taking the new average will result in a lower objective function.

  3. If a point is equidistant from two clusters, the point won’t move.

  4. The algorithm must converge in finite number of steps because there are finitely many points.

Should we use \(k\)-means?

strengths

  • No hierarchical structure / points can move from one cluster to another.
  • Can run for a range of values of \(k\).

shortcomings

  • \(k\) has to be predefined to run the algorithm.
  • \(k\)-means is based on Euclidean distance (only).

\(k\)-means clustering in R

library(ClusterR)
penguins <- penguins |>
  select(bill_length_mm, bill_depth_mm,
         flipper_length_mm, body_mass_g,
         island, species) |>
  drop_na()
kmean_rec <- recipe(~ ., data = penguins) |>
  update_role(island, new_role = "ID") |> 
  update_role(species, new_role = "ID") |>
  step_rm(all_nominal()) |>
  step_normalize(all_predictors())
kmean_model <- k_means(
  num_clusters = 3) |> 
  set_engine("ClusterR", initializer = "random") # need the ClusterR package

kmean_model
K Means Cluster Specification (partition)

Main Arguments:
  num_clusters = 3

Engine-Specific Arguments:
  initializer = random

Computational engine: ClusterR 
kmean_work <- workflow() |> 
  add_recipe(kmean_rec) |> 
  add_model(kmean_model)

kmean_work
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: k_means()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_rm()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
K Means Cluster Specification (partition)

Main Arguments:
  num_clusters = 3

Engine-Specific Arguments:
  initializer = random

Computational engine: ClusterR 
kmean_fit <- kmean_work |>
  fit(data = penguins)

kmean_fit |>
  summary()
        Length Class      Mode   
pre     3      stage_pre  list   
fit     2      stage_fit  list   
post    1      stage_post list   
trained 1      -none-     logical
kmean_summary <- kmean_fit |> extract_fit_summary()

kmean_summary |> str()
List of 7
 $ cluster_names         : Factor w/ 3 levels "Cluster_1","Cluster_2",..: 1 2 3
 $ centroids             : tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
  ..$ bill_length_mm   : num [1:3] -1.047 0.66 0.656
  ..$ bill_depth_mm    : num [1:3] 0.486 0.816 -1.098
  ..$ flipper_length_mm: num [1:3] -0.89 -0.286 1.157
  ..$ body_mass_g      : num [1:3] -0.769 -0.374 1.09
 $ n_members             : int [1:3] 132 87 123
 $ sse_within_total_total: num [1:3] 122 113 143
 $ sse_total             : num 1364
 $ orig_labels           : int [1:342] 1 1 1 1 1 1 1 1 2 1 ...
 $ cluster_assignments   : Factor w/ 3 levels "Cluster_1","Cluster_2",..: 1 1 1 1 1 1 1 1 2 1 ...
kmean_fit |> extract_centroids()
# A tibble: 3 × 5
  .cluster  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>              <dbl>         <dbl>             <dbl>       <dbl>
1 Cluster_1         -1.05          0.486            -0.890      -0.769
2 Cluster_2          0.660         0.816            -0.286      -0.374
3 Cluster_3          0.656        -1.10              1.16        1.09 

Some fun plots

need the stats engine to get glance() options.

data(penguins)  # refresh the dataset

k_func <- function(k = 3, mydata = penguins) {
  
  workflow() |>
    add_recipe(recipe(~ ., data = mydata) |>   # recipe
             step_naomit(all_predictors()) |> 
             update_role(island, new_role = "ID") |> 
             update_role(species, new_role = "ID") |>
             update_role(sex, new_role = "ID") |> 
             step_rm(all_nominal()) |>
             step_normalize(all_predictors())) |>
    add_model(k_means(num_clusters = k) |>     # model
              set_engine("stats")) |>
    fit(data = mydata |> drop_na())            # fit
}  

kmax <- 9
penguin_kclusts <- 
  tibble(k = 1:kmax) |>
  mutate(
    penguin_kclust = map(k, k_func, mydata = penguins |> drop_na()),
    tidied = map(penguin_kclust, tidy),
    glanced = map(penguin_kclust, glance),
    augmented = map(penguin_kclust, augment, penguins |> drop_na())
  )

assignments <- 
  penguin_kclusts |> 
  unnest(cols = c(augmented))

clusterings <- 
  penguin_kclusts |>
  unnest(cols = c(glanced))
# within sum of squares
clusterings |>
  ggplot(aes(x = k, y = tot.withinss)) + 
  geom_line() + 
  geom_point() + ylab("") +
  ggtitle("Total Within Sum of Squares")
# via variables
assignments |>
 ggplot(aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(aes(color = .pred_cluster), alpha = 0.8) + 
  facet_wrap(~ k)
# via species
assignments |>
  group_by(k) |>
  select(.pred_cluster, species) |>
  table() |>
  as.data.frame() |>
    ggplot() +
    geom_tile(aes(x = .pred_cluster, y = species, fill = Freq)) + 
  facet_wrap( ~ k) + ylab("") + 
  scale_fill_gradient(low = "white", high = "red") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + 
  ggtitle("Species vs cluster prediction across different values of k")
# via islands

assignments |>
  group_by(k) |>
  select(.pred_cluster, island) |>
  table() |>
  as.data.frame() |>
    ggplot() +
    geom_tile(aes(x = .pred_cluster, y = island, fill = Freq)) + 
  facet_wrap( ~ k) + ylab("") +
  scale_fill_gradient(low = "white", high = "red") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + 
  ggtitle("Island vs cluster prediction across different values of k")

Partitioning Around Medoids (PAM)

Find the observations (data values!) \(m_k\) that solve (note, we have a new objective function!):

\[\text{arg}\,\min\limits_{C_1, \ldots, C_k} \Bigg\{ \sum_{k=1}^K \sum_{i \in C_k}d(x_i, m_k) \Bigg\}\]

Important: \(m_k\) is a data point. It is not an average of points or any other summary statistic.

PAM algorithm

  1. Randomly assign a number, from 1 to \(K\), to each of the observations to serve as initial cluster assignments for the observations.

  2. Iterate until the cluster assignments stop changing:

  1. (Repeat for \(k \in \{1, 2, ...K\}\)) For a given cluster, \(C_k\), find the observation in the cluster minimizing total distance to other points in that cluster: \[i^*_k = \text{arg}\,\min\limits_{i \in C_k} \sum_{i' \in C_k} d(x_i, x_{i'})\] Then \(m_k = x_{i^*_k}, k=1, 2, \ldots, K\) are the current estimates of the cluster centers.
  2. Given a current set of cluster centers \(\{m_1, m_2, \ldots, m_K\}\), minimize the total error by assigning each observation to the closest (current) cluster center: \[C_i = \text{arg}\,\min\limits_{1 \leq k \leq K} d(x_i, m_k)\]

Should we use PAM?

strengths

  • No hierarchical structure / points can move from one cluster to another.
  • Can run for a range of values of \(k\).
  • It can use any distance / dissimilarity measure.

shortcomings

  • \(k\) has to be predefined to run the algorithm.

Evaluating clustering (which \(k\)?)

  • Silhouette Width (use \(k\) with smallest silhouette width) – good for PAM (because uses any distance / dissimilarity)

  • Elbow plot (use \(k\) at elbow on plot of \(k\) vs. within cluster sum of squares) – good for \(k\)-means (because based on Euclidean distance)

Silhouette width

Consider observation \(i \in\) cluster \(C_1\). Let

\[d(i, C_k) = \mbox{average dissimilarity of } i \mbox{ to all objects in cluster } C_k\] \[a(i) = \mbox{average dissimilarity of } i \mbox{ to all objects in } C_1.\] \[b(i) = \min_{C_k \ne C_1} d(i,C_k) = \mbox{distance to the next closest neighbor cluster}\] \[s(i) = \frac{b(i) - a(i)}{\max \{ a(i), b(i) \}}\] \[\mbox{average}_{i \in C_1} s(i) = \mbox{average silhouette width for cluster } C_1\]

Note that if \(a(i) < b(i)\) then \(i\) is well classified with a maximum \(s(i) = 1\). If \(a(i) > b(i)\) then \(i\) is not well classified with a minimum of \(s(i) = -1\).

We are looking for a large silhouette width.