[1] 0.01801949
[1] 1.866025
[1] 1.755929
[1] 1.773948
November 19 + 24, 2025
Grouping or categorizing observational units (objects) without any pre-assigned labels or scores (no outcome information!)
Latent Dirichlet Allocation: Topic Modeling of TSL Articles
Network & Clustering: Characters in ‘Love Actually’
\[d_E({\textbf x}, {\textbf y}) = \sqrt{\sum_{i=1}^p (x_i - y_i)^2}\]
\[d_P({\textbf x}, {\textbf y}) = 1 - r_P ({\textbf x}, {\textbf y})\] \[\mbox{ or } d_P({\textbf x}, {\textbf y}) = 1 - |r_P ({\textbf x}, {\textbf y})|\] \[\mbox{ or } d_P({\textbf x}, {\textbf y}) = 1 - (r_P ({\textbf x}, {\textbf y}))^2\]
Using absolute value doesn’t fix things.
\[d_C({\textbf x}, {\textbf y}) = \frac{{\textbf x} \cdot {\textbf y}}{|| {\textbf x} || ||{\textbf y}||}\] \[= \frac{\sum_{i=1}^p x_i y_i}{\sqrt{\sum_{i=1}^p x_i^2 \sum_{i=1}^p y_i^2}}\] \[= 1 - r_P ({\textbf x}, {\textbf y}) \ \ \ \ \mbox{if } \overline{\textbf x} = \overline{\textbf y} = 0\]
\[\begin{align} d_H({\textbf x}, {\textbf y}) = \sum_{i=1}^p I(x_i \ne y_i) \end{align}\]
The Hamming distance across the two DNA strands is 7.
dist function in RThe function dist in R calculates the distances given above.
Comparison of string distance metrics from https://www.kdnuggets.com/2019/01/comparison-text-distance-metrics.html.
is a set of nested clusters that are organized as a tree. Note that objects that belong to a child cluster also belong to the parent cluster.
Agglomerative methods start with each object (e.g., gene, penguin, etc.) in its own group. Groups are merged until all objects are together in one group.
Divisive methods start with all objects in one group and break up the groups sequentially until all objects are individuals.
Single Linkage algorithm defines the dissimilarity between groups as that of the closest pair of individuals.
Complete Linkage algorithm defines the dissimilarity between groups as that of the farthest pair of individuals.
Average Linkage algorithm defines the dissimilarity between groups as the average of the dissimilarities between all pairs of individuals across the groups.
of Single Linkage Agglomerative Hierarchical Clustering
| A | B | C | D | E | |
|---|---|---|---|---|---|
| A | 0 | ||||
| B | 0.2 | 0 | |||
| C | 0.6 | 0.5 | 0 | ||
| D | 1 | 0.9 | 0.4 | 0 | |
| E | 0.9 | 0.8 | 0.5 | 0.3 | 0 |
see class notes to walk through the process.
strengths
shortcomings
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: hier_clust()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_rm()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Hierarchical Clustering Specification (partition)
Main Arguments:
num_clusters = 3
linkage_method = average
Computational engine: stats
Length Class Mode
pre 3 stage_pre list
fit 2 stage_fit list
post 1 stage_post list
trained 1 -none- logical
List of 7
$ cluster_names : Factor w/ 3 levels "Cluster_1","Cluster_2",..: 1 2 3
$ centroids : tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
..$ bill_length_mm : num [1:3] -0.369 0.603 2.253
..$ bill_depth_mm : num [1:3] 0.617 -1.123 -0.368
..$ flipper_length_mm: num [1:3] -0.65 1.13 2.05
..$ body_mass_g : num [1:3] -0.612 1.06 1.977
$ n_members : int [1:3] 219 119 4
$ sse_within_total_total: num [1:3] 283.19 111.49 2.01
$ sse_total : num 656
$ orig_labels : NULL
$ cluster_assignments : Factor w/ 3 levels "Cluster_1","Cluster_2",..: 1 1 1 1 1 1 1 1 1 1 ...
To predict the cluster assignment for a new observation, we find the closest cluster. How we measure “closeness” is dependent on the specified type of linkage in the model:
single linkage: The new observation is assigned to the same cluster as its nearest observation from the training data.
complete linkage: The new observation is assigned to the cluster with the smallest maximum dissimilarities between training observations and the new observation.
average linkage: The new observation is assigned to the cluster with the smallest average dissimilarities between training observations and the new observation.
centroid method: The new observation is assigned to the cluster with the closest centroid, as in prediction for k_means.
Ward’s method: The new observation is assigned to the cluster with the smallest increase in error sum of squares (ESS) due to the new addition. The ESS is computed as the sum of squared Euclidean distances between observations in a cluster, and the centroid of the cluster.
# A tibble: 342 × 1
.pred_cluster
<fct>
1 Cluster_1
2 Cluster_1
3 Cluster_1
4 Cluster_1
5 Cluster_1
6 Cluster_1
7 Cluster_1
8 Cluster_1
9 Cluster_1
10 Cluster_1
# ℹ 332 more rows
It’s important to note that there is no guarantee that predict() on the training data will produce the same results as extract_cluster_assignments(). The process by which clusters are created during the agglomerations results in a particular partition; but if a training observation is treated as new data, it is predicted in the same manner as truly new information.
Consider a silly example using complete linkage:
\[x_1 = 1, x_2 = 2, x_3 = 5, x_4 = 9, x_5 = 10, x_6 = -3\]
Note, using complete linkage, \(x_3\) is closer to \(\{x_4, x_5 \}\) (distance = 5) than to \(\{x_1, x_2, x_3, x_6 \}\) (distance = 8).
We have to get the hclust “object” from the fit, and transform it to make the dendrogram.
penguins |>
drop_na(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
select(island) |>
cbind(cluster = cutree(penguin_hclust, k = 3) ) |>
table() cluster
island 1 2 3
Biscoe 44 119 4
Dream 124 0 0
Torgersen 51 0 0
penguins |>
drop_na(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
select(species) |>
cbind(cluster = cutree(penguin_hclust, k = 3) ) |>
table() cluster
species 1 2 3
Adelie 151 0 0
Chinstrap 68 0 0
Gentoo 0 119 4
\(k\)-means clustering is an unsupervised partitioning algorithm designed to find a partition of the observations such that the following objective function is minimized (find the smallest within cluster sum of squares):
\[\text{arg}\,\min\limits_{C_1, \ldots, C_k} \Bigg\{ \sum_{k=1}^K \sum_{i \in C_k} \sum_{j=1}^p (x_{ij} - \overline{x}_{kj})^2 \Bigg\}\]
Artwork by (allison_horst?).
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
If a point is “closer” to a different center, moving it will lower the objective function.
Averages minimize squared differences, so taking the new average will result in a lower objective function.
If a point is equidistant from two clusters, the point won’t move.
The algorithm must converge in finite number of steps because there are finitely many points.
strengths
shortcomings
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: k_means()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_rm()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
K Means Cluster Specification (partition)
Main Arguments:
num_clusters = 3
Engine-Specific Arguments:
initializer = random
Computational engine: ClusterR
List of 7
$ cluster_names : Factor w/ 3 levels "Cluster_1","Cluster_2",..: 1 2 3
$ centroids : tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
..$ bill_length_mm : num [1:3] -1.047 0.66 0.656
..$ bill_depth_mm : num [1:3] 0.486 0.816 -1.098
..$ flipper_length_mm: num [1:3] -0.89 -0.286 1.157
..$ body_mass_g : num [1:3] -0.769 -0.374 1.09
$ n_members : int [1:3] 132 87 123
$ sse_within_total_total: num [1:3] 122 113 143
$ sse_total : num 1364
$ orig_labels : int [1:342] 1 1 1 1 1 1 1 1 2 1 ...
$ cluster_assignments : Factor w/ 3 levels "Cluster_1","Cluster_2",..: 1 1 1 1 1 1 1 1 2 1 ...
need the stats engine to get glance() options.
data(penguins) # refresh the dataset
k_func <- function(k = 3, mydata = penguins) {
workflow() |>
add_recipe(recipe(~ ., data = mydata) |> # recipe
step_naomit(all_predictors()) |>
update_role(island, new_role = "ID") |>
update_role(species, new_role = "ID") |>
update_role(sex, new_role = "ID") |>
step_rm(all_nominal()) |>
step_normalize(all_predictors())) |>
add_model(k_means(num_clusters = k) |> # model
set_engine("stats")) |>
fit(data = mydata |> drop_na()) # fit
}
kmax <- 9
penguin_kclusts <-
tibble(k = 1:kmax) |>
mutate(
penguin_kclust = map(k, k_func, mydata = penguins |> drop_na()),
tidied = map(penguin_kclust, tidy),
glanced = map(penguin_kclust, glance),
augmented = map(penguin_kclust, augment, penguins |> drop_na())
)
assignments <-
penguin_kclusts |>
unnest(cols = c(augmented))
clusterings <-
penguin_kclusts |>
unnest(cols = c(glanced))# via species
assignments |>
group_by(k) |>
select(.pred_cluster, species) |>
table() |>
as.data.frame() |>
ggplot() +
geom_tile(aes(x = .pred_cluster, y = species, fill = Freq)) +
facet_wrap( ~ k) + ylab("") +
scale_fill_gradient(low = "white", high = "red") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
ggtitle("Species vs cluster prediction across different values of k")# via islands
assignments |>
group_by(k) |>
select(.pred_cluster, island) |>
table() |>
as.data.frame() |>
ggplot() +
geom_tile(aes(x = .pred_cluster, y = island, fill = Freq)) +
facet_wrap( ~ k) + ylab("") +
scale_fill_gradient(low = "white", high = "red") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
ggtitle("Island vs cluster prediction across different values of k")Find the observations (data values!) \(m_k\) that solve (note, we have a new objective function!):
\[\text{arg}\,\min\limits_{C_1, \ldots, C_k} \Bigg\{ \sum_{k=1}^K \sum_{i \in C_k}d(x_i, m_k) \Bigg\}\]
Important: \(m_k\) is a data point. It is not an average of points or any other summary statistic.
Randomly assign a number, from 1 to \(K\), to each of the observations to serve as initial cluster assignments for the observations.
Iterate until the cluster assignments stop changing:
strengths
shortcomings
Silhouette Width (use \(k\) with smallest silhouette width) – good for PAM (because uses any distance / dissimilarity)
Elbow plot (use \(k\) at elbow on plot of \(k\) vs. within cluster sum of squares) – good for \(k\)-means (because based on Euclidean distance)
Consider observation \(i \in\) cluster \(C_1\). Let
\[d(i, C_k) = \mbox{average dissimilarity of } i \mbox{ to all objects in cluster } C_k\] \[a(i) = \mbox{average dissimilarity of } i \mbox{ to all objects in } C_1.\] \[b(i) = \min_{C_k \ne C_1} d(i,C_k) = \mbox{distance to the next closest neighbor cluster}\] \[s(i) = \frac{b(i) - a(i)}{\max \{ a(i), b(i) \}}\] \[\mbox{average}_{i \in C_1} s(i) = \mbox{average silhouette width for cluster } C_1\]
Note that if \(a(i) < b(i)\) then \(i\) is well classified with a maximum \(s(i) = 1\). If \(a(i) > b(i)\) then \(i\) is not well classified with a minimum of \(s(i) = -1\).
We are looking for a large silhouette width.