Clustering
Big Picture
The clustering algorithms can be broadly split into two categories depending on whether the number of clusters is given or to be determined by user. Partitional ones pre-set the number of clusters; while hierarchical ones output a dendrogram that illustrates how clusters are built level by level. Users are free to choose the level of clustering in this hierarchical clustering.
Hierarchical algorithms
Bottom-up agglomerative clustering
This approach starts with each object in a separate cluster, and repeatedly
- joins the most similar pair of clusters,
- update the similarity of the new cluster to others until there is only one cluster.
There are five further methods in this approach. The difference among them lies in the way to measure inter-cluster similarity.
- Single-linkage measures the similarity between two clusters as the distance between their closest members.
- Complete-linkage measures the similarity between two clusters as the distance between their furthest members.
- Average-linkage measures the similarity between two clusters as the average of distances of all the cross-cluster pairs.
- Centroid measures the similarity between two clusters as the distance between their centers.
- DBSCAN
Top-down divisive clustering
This approach starts with all the data in a single cluster, and repeatedly split each cluster into two using a partition algorithm until each object is in a separate cluster.
Partitional algorithms
- K-means
- Gaussian mixture model
- Soft K-means
Algorithms
Given dataset \(\mathcal D = \{x^{(i)}, i=1,\dots,M\}\), a clustering \(\mathcal C\) of the \(M\) points into \(K (K \le M)\) clusters is a partition of \(\mathcal D\) into \(K\) disjoint groups \(\{C_1,\dots,C_K\}\). Suppose we have a function \(f\) that evaluates the clustering \(\mathcal C\) and returns lower score with better clustering. The best clustering will be \[ \arg \min_{\mathcal C} f(\mathcal C) \] The number of all possibilities of clustering with \(M\) elements is called the Bell number, denoted as \(B_M\). The calculation of the Bell number is based on dynamic programming. The number of ways to cluster \(M+1\) elements is the sum of number of ways to:
- select \(1\) element and cluster it, with the rest belong to one single cluster
- select \(2\) elements and cluster them, with the rest belong to one single cluster
- …
- select \(M\) elements and cluster them, with the rest belong to one single cluster
Therefore, \[ \begin{gather} B_{M+1} = \sum_{i=1}^M\binom{M}{0}B_i \\ B_0 = 1 \end{gather} \] It grows exponentially with \(M\) and thus the exhaustive method will be computationally intractable. We need either an approximation algorithm or a scoring function with special properties.
K-means
K-means assumes there are \(K\) clusters. This greatly eliminates many possibilities described above. Its objective is \[ \min_{c_1,\dots,c_K}\sum_{i=1}^M||x^{(i)} - c_{z^{(i)}}||^2_2, \text{ where }z^{(i)} = \arg \min_{j \in \{1,\dots,K\}}||x^{(i)}-c_j||^2_2 \] \(z^{(i)}\) is the cluster index to which \(x^{(i)}\) is assigned. K-means’ objective is to assign each point to its closest cluster center and minimize the total within-cluster square errors. For cluster \(j\), let \(C_j = \{x^{(i)}|z^{(i)} = j\}\) be the set of points assigned to it, then the cluster center of cluster \(j\) is \[ c_j = \frac{1}{|C_j|}\sum_{x^{(i)} \in C_j}x^{(i)} \] However, both the cluster center \(\newcommand{\c}{\mathrm{c}} \c\) and the assignment \(\newcommand{\z}{\mathrm{z}} \z\) is initially unknown. K-means solves this by randomly pick up initial cluster centers and enter the assign-data-to-clusters/update-cluster-centers loop, until the cluster centers converge or become satisfactory. Rewrite the objective of K-means as:
\[ \min_{\z,\c}(l(\z,\c) \coloneq \sum_{i=1}^M||x^{(i)}- c_{z^{(i)}}||^2_2) \] K-means is in essence a coordinate descent of the loss \(l(\z,\c)\). The main loop of K-means is to:
- assign data points to its nearest cluster center, i.e. minimizing over the assignment \(\z\).
- update cluster centers according to the points assigned to, i.e. minimizing over the centroids \(\c\).
\(l\) is monotonically decreasing after each step in the above loop. Also, \(l\) is lower-bounded by \(0\). Therefore, \(l\) and thus K-means will converge finally.
K-means clustering produces a Voronoi diagram which consists of linear decision boundaries.
For more discussion and an interesting image compression method with K-means, please refer here.
Gaussian Mixture Model
A cluster can also be modelled by a multi-variate Gaussian with elliptical shape: the elliptical shape is controlled by the covariance matrix; the location is controlled by the mean. Gaussian mixture model is a weighted sum of, say \(K\), Gaussian distributions: \[ p(x) = \sum_{j=1}^K\pi_j\mathcal N(x;\mu_j, \Sigma_j) \] \(\pi_j\) is the prior that a sample is generated from the \(j\)-th Gaussian. \(\mathcal N(x;\mu_j, \Sigma_j)\) is the probability to generate the sample \(x\) from the \(j\)-th Gaussian. Putting it together, \(p(x)\) is the total probability of \(x\) over its latent \(z\), with \(z = j\) representing the prior condition that \(x\) is sampled from \(j\)-th Gaussian. \[ p(x) = \sum_{j=1}^Kp(x, z = j) = \sum_{j=1}^K p(z = j) p(x|z = j) \] GMM is learned by \[ \begin{gather} \max_{\pi,\mu,\Sigma}\Big(L(\pi,\mu,\Sigma) \triangleq \prod_{i=1}^Mp(x^{(i)})\Big) \\ s.t.\quad \sum_{j=1}^K\pi_j = 1 \end{gather} \] The log-likelihood function form will be \[ \begin{gather} \max_{\pi,\mu,\Sigma}\Big(l(\pi,\mu,\Sigma) \triangleq \log L(\pi,\mu,\Sigma)\Big) \\ s.t.\quad \sum_{j=1}^K\pi_j = 1 \end{gather} \]
Attempt with Lagrange Multiplier
\[ \begin{gather} L(\pi,\mu,\Sigma) = \prod_{i=1}^Mp(x^{(i)}) = \prod_{i=1}^M\sum_{j=1}^Kp(z_j)p(x|z = z_j) = \prod_{i=1}^M\sum_{j=1}^K\pi_j\mathcal N(x^{(i)};\mu_j, \Sigma_j) \\ l(\pi,\mu,\Sigma) = \log\big(\prod_{i=1}^M\sum_{j=1}^K\pi_j\mathcal N(x^{(i)};\mu_j, \Sigma_j)\big) = \sum_{i=1}^M\log\sum_{j=1}^K\pi_j\mathcal N(x^{(i)};\mu_j, \Sigma_j) \end{gather} \] The Lagrangian function will be \[ J(\pi,\mu,\Sigma,\lambda) = -\sum_{i=1}^M\log\sum_{j=1}^K\pi_j\mathcal N(x^{(i)};\mu_j, \Sigma_j) + \lambda(\sum_{j=1}^K\pi_j - 1) \] This expression is just too hard to zero the derivatives. For example, writing \(\mathcal N(x^{(i)};\mu_j, \Sigma_j)\) as \(z^{(i)}_j\), then take derivative w.r.t. \(\pi\) and make it zero to give \[ \begin{align} \frac{\partial J}{\partial \pi_j} &= -\sum_{i=1}^M\frac{\mathcal N(x^{(i)};\mu_j, \Sigma_j)}{\sum_{k=1}^K\pi_k\mathcal N(x^{(i)};\mu_j, \Sigma_j)} + \lambda \\ &\Downarrow \\ \lambda &= \sum_{i=1}^M\frac{\mathcal N(x^{(i)};\mu_1, \Sigma_1)}{\sum_{k=1}^K\pi_k\mathcal N(x^{(i)};\mu_1, \Sigma_1)} \\ \lambda &= \sum_{i=1}^M\frac{\mathcal N(x^{(i)};\mu_2, \Sigma_2)}{\sum_{k=1}^K\pi_k\mathcal N(x^{(i)};\mu_2, \Sigma_2)} \\ &\dots \\ \lambda &= \sum_{i=1}^M\frac{\mathcal N(x^{(i)};\mu_K, \Sigma_K)}{\sum_{k=1}^K\pi_k\mathcal N(x^{(i)};\mu_K, \Sigma_K)}\\ \end{align} \]
By far, the whole expression is far too complicated for us to continue…
Attempt with Expectation Maximization
\[ \begin{gather} L(\pi,\mu,\Sigma) = \prod_{i=1}^M p(x^{(i)}) = \prod_{i=1}^M \sum_{j=1}^K p(x^{(i)}, z^{(i)} = j) \\ l(\pi,\mu,\Sigma) = \log\big(\prod_{i=1}^M\sum_{j=1}^Kp(x^{(i)}, z^{(i)} = j)\big) = \sum_{i=1}^M\log\sum_{j=1}^Kp(x^{(i)}, z^{(i)} = j) \end{gather} \] By Jensen’s inequality, if \(\forall i \in \{ 1,\dots,M \}, j \in \{1,\dots,K\}, \alpha_j^{(i)} \ge 0\) and \(\sum_{j=1}^K\alpha_j^{(i)} = 1\), \[ \begin{aligned} l(\pi,\mu,\Sigma) &= \sum_{i=1}^M \log \sum_{j=1}^Kp(x^{(i)}, z^{(i)} = j) = \sum_{i=1}^M \log \sum_{j=1}^K \alpha^{(i)}_j \frac{p(x^{(i)}|z^{(i)} = j)p(z^{(i)} = j)}{\alpha^{(i)}_j} \\ &\ge B(\alpha,\pi,\mu,\Sigma) := \sum_{i=1}^M \sum_{j=1}^K \alpha^{(i)}_j \log \frac{\mathcal N(x^{(i)};\mu_j, \Sigma_j) \pi_j}{\alpha^{(i)}_j} \\ \end{aligned} \] \(B\) is the lower bound of \(l\). If we can enlarge \(B\), we can gently guarantee that \(l\) is increasing in this process. As illustrated in expectation maximization, \(\alpha^{(i)}_j\) is chosen to be \(p(z^{(i)}=j | x^{(i)}; \pi^t, \mu^t,\Sigma^t)\), where \(\pi^t, \mu^t,\Sigma^t\) are estimations in current iteration. With \(\alpha\) being fixed, \[ \begin{aligned} B(\alpha,\pi,\mu,\Sigma) = \sum_{i=1}^M \sum_{j=1}^K \alpha^{(i)}_j \log \mathcal N(x^{(i)};\mu_j, \Sigma_j) \pi_j - \sum_{i=1}^M \sum_{j=1}^K \alpha^{(i)}_j \log \alpha^{(i)}_j \\ \end{aligned} \] Grouping terms in \(B\) that are relevant to \(\pi\), we are to \[ \begin{gather} \max_\pi \sum_{i=1}^M \sum_{j=1}^K \alpha^{(i)}_j \log \pi_j \\ s.t.\quad \sum_{i=1}^K \pi_j = 1 \end{gather} \] We do this by Lagrangian multipliers: \[ J(\pi, \lambda) = \sum_{i=1}^M \sum_{j=1}^K \alpha^{(i)}_j \log \pi_j + \lambda(1 - \sum_{i=1}^K\pi_i ) \] Take derivative w.r.t. \(\pi_k\) and make it zero to give \[ \frac{\partial J}{\partial \pi_k} =\sum_{i=1}^M \frac{\alpha^{(i)}_k}{\pi_k} - \lambda \\ \Downarrow \\ \pi_k^{t+1} = \frac{\sum_{i=1}^M\alpha^{(i)}_k}{\lambda} \] With the knowledge that \(\sum_{j=1}^K \pi_j = 1, \sum_{i=1}^M\alpha^{(i)}_j = 1\), we have \[ \pi_k^{t+1} = \frac{\sum_{i=1}^M\alpha^{(i)}_k}{M} \]
Take derivative w.r.t. \(\mu_k\) and make it zero to give \[ \begin{aligned} \frac{\partial B}{\partial \mu_k} &= \frac{\partial \sum_{i=1}^M \sum_{j=1}^K \alpha^{(i)}_j \log \mathcal N(x^{(i)};\mu_j, \Sigma_j)}{\partial \mu_k} \\ &= \frac{-\frac{1}{2} \partial \sum_{i=1}^M \alpha^{(i)}_k (x^{(i)} - \mu_k)^T \Sigma_k^{-1} (x^{(i)} - \mu_k)}{\partial \mu_k} \\ &= \sum_{i=1}^M \alpha^{(i)}_k \Sigma_k^{-1} (x^{(i)} - \mu_k) \\ \end{aligned} \]
\[ \begin{aligned} \sum_{i=1}^M \alpha^{(i)}_k \Sigma_k^{-1} (x^{(i)} - \mu_k) &= 0 \\ \sum_{i=1}^M \alpha^{(i)}_k (x^{(i)} - \mu_k) &= 0, \text{ by the invertibility of $\Sigma_j^{-1}$} \\ \Downarrow \\ \mu_k^{t+1} &= \frac{\sum_{i=1}^M \alpha^{(i)}_k x^{(i)}}{M} \end{aligned} \]
Take derivative w.r.t. \(\Sigma_k\) and make it zero to give \[ \begin{aligned} \frac{\partial B}{\partial \Sigma_k} &= \frac{\partial \sum_{i=1}^M \sum_{j=1}^K \alpha^{(i)}_j \log \mathcal N(x^{(i)};\mu_j, \Sigma_j)}{\partial \Sigma_k} \\ &= \frac{-\frac{1}{2} \partial \sum_{i=1}^M \alpha^{(i)}_k ((x^{(i)} - \mu_k)^T \Sigma_k^{-1} (x^{(i)} - \mu_k) + \log|\Sigma_k|)}{\partial \Sigma_k} \\ &= -\frac{1}{2} \sum_{i=1}^M \alpha^{(i)}_k (-\Sigma_k^{-1} (x^{(i)} - \mu_k) (x^{(i)} - \mu_k)^T \Sigma_k^{-1} + \Sigma_k^{-1}) \\ \end{aligned} \]
\[ \begin{gather} \begin{aligned} -\frac{1}{2} \sum_{i=1}^M \alpha^{(i)}_k (-\Sigma_k^{-1} (x^{(i)} - \mu_k) (x^{(i)} - \mu_k)^T \Sigma_k^{-1} + \Sigma_k^{-1}) &= 0 \\ \sum_{i=1}^M \alpha^{(i)}_k (-\Sigma_k^{-1} (x^{(i)} - \mu_k) (x^{(i)} - \mu_k)^T + I) &= 0 \\ \end{aligned} \\ \Downarrow \\ \Sigma_k^{t+1} = \frac{\sum_{i=1}^M \alpha^{(i)}_k (x^{(i)} - \mu_k) (x^{(i)} - \mu_k)^T}{\sum_{i=1}^M \alpha^{(i)}_k} \\ \end{gather} \]
Soft K-means
\(\alpha^{(i)}_j\) in GMM, which measures how likely the sample \(i\) is generated from \(j\)-th Gaussian, can be viewed as a contribution of sample \(i\) to the cluster \(j\). That is, a sample can contribute different portions to different clusters and these portions add up to \(1\).
In the case where \(\pi_j = \frac{1}{K}, \Sigma_k = I\), \[ \begin{gather} p(x^{(i)}|z^{(i)} = j) = \mathcal N(x^{(i)};\mu_j, I) = \frac{1}{\sqrt{(2\pi)^N}}\exp(-\frac{1}{2}||x^{(i)}-\mu_j||^2_2) \\ \alpha^{(i)}_j = p(z^{(i)} = j | x^{(i)}) = \frac{p(x^{(i)}|z^{(i)} = j)p(z^{(i)} = j)}{p(x^{(i)})} = \frac{\mathcal N(x^{(i)};\mu_j, I) \frac{1}{K}}{\frac{1}{K} \sum_{k=1}^K \mathcal N(x;\mu_k, I)} = \frac{\exp(-\frac{1}{2}||x^{(i)}-\mu_j||^2_2)}{\sum_{k=1}^K \exp(-\frac{1}{2}||x^{(k)}-\mu_j||^2_2)} \end{gather} \] The “soft K-means” comes from the softmax of the Euclidean distance.
External Material
Gaussian Mixture Models vs K-Means. | by K.Kubara | Towards Data Science Exploring Clustering Algorithms: Explanation and Use Cases (neptune.ai)