class: right, top, my-title, title-slide # Clustering ### Jo Hardin ### November 18 & 23, 2021 --- # Agenda 11/18/21 1. Unsupervised learning 2. k-means clustering 3. k-medoids clustering --- ## Unsupervised learning > Grouping or categorizing observational units (objects) without any pre-assigned labels or scores (no outcome information!) --- ## Some examples: * Latent Dirichlet Allocation: <a href="https://ziqixiong.shinyapps.io/TopicModeling/" target = "_blank">Topic Modeling of TSL Articles</a> * Network Analysis: <a href = "https://espresso.economist.com/6412121cbb2dc2cb9e460cfee7046be2" target = "_blank">Political Books</a> * Network & Clustering: <a href = "http://varianceexplained.org/r/love-actually-network/" target = "_blank">Characters in 'Love Actually'</a> --- ## `\(k\)`-means Clustering `\(k\)`-means clustering is an unsupervised partitioning algorithm designed to find a partition of the observations such that the following objective function is minimized (find the smallest within cluster sum of squares): `$$\text{arg}\,\min\limits_{C_1, \ldots, C_k} \Bigg\{ \sum_{k=1}^K \sum_{i \in C_k} \sum_{j=1}^p (x_{ij} - \overline{x}_{kj})^2 \Bigg\}$$` --- # Monsters clustering <div class="figure"> <img src="../images/kmeans.gif" alt="Monsters as cluster centers moving around throughout the k-means algorithm." /> <p class="caption">Artwork by @allison_horst.</p> </div> --- ## A fun applet!! https://www.naftaliharris.com/blog/visualizing-k-means-clustering/ --- ****** **Algorithm**: `\(k\)`-Means Clustering ****** 1. Randomly assign a number, from 1 to `\(k\)`, to each of the observations. These serve as initial cluster assignments for the observations. 2. Iterate until the cluster assignments stop changing: (a) For each of the `\(k\)` clusters, compute the cluster centroid. The `\(k^{th}\)` cluster centroid is the vector of the `\(p\)` feature means for the observations in the `\(k^{th}\)` cluster. (b) Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance). 3. Ties? Do something consistent: for example, leave in the current cluster. ****** --- ## Convergence? Yes! (local...) 1. If a point is "closer" to a different center, moving it will lower the objective function. 2. Averages minimize squared differences, so taking the new average will result in a lower objective function. 3. If a point is equidistant from two clusters, the point won't move. 4. The algorithm must converge in finite number of steps because there are finitely many points. --- ## Scaling .panelset[ .panel[.panel-name[raw data] .pull-left[ ```r norm_clust %>% kmeans(centers = 2) %>% augment(norm_clust) %>% ggplot() + geom_point(aes(x = x1, y = x2, color = .cluster)) + ggtitle("k-means (k=2) on raw data") ``` ] .pull-right[ ![](2021-11-18-clustering_files/figure-html/unnamed-chunk-6-1.png)<!-- --> ] ] .panel[.panel-name[scaled data] .pull-left[ ```r norm_clust %>% mutate(across(everything(), scale)) %>% kmeans(centers = 2) %>% augment(norm_clust) %>% ggplot() + geom_point(aes(x = x1, y = x2, color = .cluster)) + ggtitle("k-means (k=2) on raw data") ``` ] .pull-right[ ![](2021-11-18-clustering_files/figure-html/unnamed-chunk-8-1.png)<!-- --> ] ] ] --- ## k-Medoids Shortcomings of k-means * Center is calculated as average (establishes Euclidean distance) * Because center changes, distances must be re-calculated * Really, only Euclidean distance makes sense --- ## Partitioning Around Medoids (PAM) Find the observations (data values!) `\(m_k\)` that solve: `$$\text{arg}\,\min\limits_{C_1, \ldots, C_k} \Bigg\{ \sum_{k=1}^K \sum_{i \in C_k}d(x_i, m_k) \Bigg\}$$` --- ## Evaluating clustering (which `\(k\)`?) * Silhouette Width (use `\(k\)` with smallest silhouette width) * Elbow plot (use `\(k\)` at elbow on plot of `\(k\)` vs. within cluster sum of squares) --- ## Silhouette width Consider observation `\(i \in\)` cluster `\(C_1\)`. Let `$$d(i, C_k) = \mbox{average dissimilarity of } i \mbox{ to all objects in cluster } C_k$$` `$$a(i) = \mbox{average dissimilarity of } i \mbox{ to all objects in } C_1.$$` `$$b(i) = \min_{C_k \ne C_1} d(i,C_k) = \mbox{distance to the next closest neighbor cluster}$$` `$$s(i) = \frac{b(i) - a(i)}{\max \{ a(i), b(i) \}}$$` `$$\mbox{average}_{i \in C_1} s(i) = \mbox{average silhouette width for cluster } C_1$$` Note that if `\(a(i) < b(i)\)` then `\(i\)` is well classified with a maximum `\(s(i) = 1\)`. If `\(a(i) > b(i)\)` then `\(i\)` is *not* well classified with a maximum `\(s(i) = -1\)`. --- # Agenda 11/23/21 1. distance metrics 2. hierarchical clustering --- ## Distance metric (mathematically) 1. `\(d({\bf x}, {\bf y}) \geq 0\)` 2. `\(d({\bf x}, {\bf y}) = d({\bf y}, {\bf x})\)` 3. `\(d({\bf x}, {\bf y}) = 0\)` iff `\({\bf x} = {\bf y}\)` 4. `\(d({\bf x}, {\bf y}) \leq d({\bf x}, {\bf z}) + d({\bf z}, {\bf y})\)` for all other vectors `\({\bf z}\)`. --- ## Distance measures (for clustering) * Euclidean Distance `$$d_E({\bf x}, {\bf y}) = \sqrt{\sum_{i=1}^p (x_i - y_i)^2}$$` * Pearson Correlation Distance `$$d_P({\bf x}, {\bf y}) = 1 - r_P ({\bf x}, {\bf y})$$` `$$\mbox{ or } d_P({\bf x}, {\bf y}) = 1 - |r_P ({\bf x}, {\bf y})|$$` `$$\mbox{ or } d_P({\bf x}, {\bf y}) = 1 - (r_P ({\bf x}, {\bf y}))^2$$` --- #### Correlation distance isn't a distance metric! ```r x1 <- c(1,2,3) x2 <- c(1, 4, 10) x3 <- c(9, 2, 2) # d(1,2) 1 - cor(x1, x2) ``` ``` ## [1] 0.018019 ``` ```r # d(1,3) 1 - cor(x1, x3) ``` ``` ## [1] 1.866 ``` ```r # d(2,3) 1 - cor(x2, x3) ``` ``` ## [1] 1.7559 ``` ```r # d(1,3) > d(1,2) + d(2,3) 1 - cor(x1, x2) + 1 - cor(x2, x3) ``` ``` ## [1] 1.7739 ``` --- #### Correlation distance isn't a distance metric! Using absolute distance doesn't fix things. ```r # d(1,2) 1 - abs(cor(x1, x2)) ``` ``` ## [1] 0.018019 ``` ```r # d(1,3) 1 - abs(cor(x1, x3)) ``` ``` ## [1] 0.13397 ``` ```r # d(2,3) 1 - abs(cor(x2, x3)) ``` ``` ## [1] 0.24407 ``` ```r # d(2,3) > d(1,2) + d(1,3) 1 - abs(cor(x1, x2)) + 1 - abs(cor(x1, x3)) ``` ``` ## [1] 0.15199 ``` --- ## Distance measures (for clustering) * Cosine Distance `$$d_C({\bf x}, {\bf y}) = \frac{{\bf x} \cdot {\bf y}}{|| {\bf x} || ||{\bf y}||}$$` `$$= \frac{\sum_{i=1}^p x_i y_i}{\sqrt{\sum_{i=1}^p x_i^2 \sum_{i=1}^p y_i^2}}$$` `$$= 1 - r_P ({\bf x}, {\bf y}) \ \ \ \ \mbox{if } \overline{\bf x} = \overline{\bf y} = 0$$` * Hamming Distance `\begin{align} d_H({\bf x}, {\bf y}) = \sum_{i=1}^p I(x_i \ne y_i) \end{align}` <div class="figure" style="text-align: center"> <img src="../images/hamdistGCTA.png" alt="The Hamming distance across the two DNA strands is 7." width="60%" /> <p class="caption">The Hamming distance across the two DNA strands is 7.</p> </div> --- ## `dist` function in R <div class="figure" style="text-align: center"> <img src="../images/distR.png" alt="The function `dist` in `R` calculates the distances given above." width="100%" /> <p class="caption">The function `dist` in `R` calculates the distances given above.</p> </div> --- #### String distances https://www.kdnuggets.com/2019/01/comparison-text-distance-metrics.html <div class="figure" style="text-align: center"> <img src="../images/text-distance-infographics.png" alt="Comparison of string distance metrics from https://www.kdnuggets.com/2019/01/comparison-text-distance-metrics.html." width="100%" /> <p class="caption">Comparison of string distance metrics from https://www.kdnuggets.com/2019/01/comparison-text-distance-metrics.html.</p> </div> --- ## Hierarchical Clustering > is a set of nested clusters that are organized as a tree. Note that objects that belong to a child cluster also belong to the parent cluster. --- ****** **Algorithm**: Agglomerative Hierarchical Clustering Algorithm ****** 1. Begin with `\(n\)` observations and a measure (such as Euclidean distance) of all the `\({n \choose 2} = n(n-1)/2\)` pairwise dissimilarities. Treat each observation as its own cluster. 2. For `\(i = n, n - 1, \ldots , 2\)`: a. Examine all pairwise inter-cluster dissimilarities among the `\(i\)` clusters and identify the pair of clusters that are least dissimilar (that is, most similar). Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed. b. Compute the new pairwise inter-cluster dissimilarities among the `\(i - 1\)` remaining clusters. ****** --- ## Definitions **Agglomerative** methods start with each object (e.g., gene, penguin, etc.) in its own group. Groups are merged until all objects are together in one group. **Divisive** methods start with all objects in one group and break up the groups sequentially until all objects are individuals. **Single Linkage** algorithm defines the distance between groups as that of the closest pair of individuals. **Complete Linkage** algorithm defines the distance between groups as that of the farthest pair of individuals. **Average Linkage** algorithm defines the distance between groups as the average of the distances between all pairs of individuals across the groups. --- ## Toy example of **Single Linkage Agglomerative Hierarchical Clustering** | | A | B | C | D | E | |---- |----- |----- |----- |---- |---- | | A | 0 | | | | | | B | 0.2 | 0 | | | | | C | 0.6 | 0.5 | 0 | | | | D | 1 | 0.9 | 0.4 | 0 | | | E | 0.9 | 0.8 | 0.5 | 0.3 | 0 | see class notes to walk through the process. --- ## Hierarchical clustering in R ```r penguins_h <- penguins %>% drop_na(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>% select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>% mutate(across(bill_length_mm:body_mass_g, scale)) penguin_hclust <- penguins_h %>% dist() %>% hclust(method = "complete") penguin_hclust ``` ``` ## ## Call: ## hclust(d = ., method = "complete") ## ## Cluster method : complete ## Distance : euclidean ## Number of objects: 342 ``` ```r penguin_dend <- as.dendrogram(penguin_hclust) ``` --- ## The basic dendrogram ```r plot(penguin_hclust) ``` ![](2021-11-18-clustering_files/figure-html/unnamed-chunk-16-1.png)<!-- --> --- ## Colored dendrogram ![](2021-11-18-clustering_files/figure-html/unnamed-chunk-17-1.png)<!-- -->