33 Unsupervised Learning: Clustering

So far we have seen “Supervised Methods” where our goal is to analyze a response (or outcome) based on various predictors. In many cases, especially for Exploratory Data Analysis, we want methods to extract patterns on variables without analyzing a specific response. Methods for the latter case are called “Unsupervised Methods”. Examples are Principal Component Analysis and Clustering.

Interpretation of these methods is much more subjective than in Supervised Learning. For example: if we want to know if a given predictor is related to response, we can perform statistical inference using hypothesis testing. In another example, if we want to know which predictors are useful for prediction: use cross-validation to do model selection. Finally, if we want to see how well we can predict a specific response, we can use cross-validation to report on test error. In unsupervised methods, there is no similar clean evaluation methodology. Nonetheless, they can be very useful methods to understand data at hand.

33.1 Motivating Example

Throughout this unit we will use a time series dataset of mortgage affordability as calculated and distributed by Zillow: https://www.zillow.com/research/data/.

The dataset contains affordability measurements for 81 counties with data from 1979 to 2017. Here we plot the time series of affordability for all counties.

A natural question to ask about this data is, “can we partition counties into groups of counties with similar value trends across time”

33.2 Some Preliminaries

Mathematically, the previous sections on “Supervised Learning” we were concerned with estimates that minimize some error function relative to the outcome of interest \(Y\):

\[ \mu(x) = \arg \min_{\theta} E_{Y|X} L(Y, \beta) \]

In order to do this, explicitly or not, the methods we were using would be concerned with properties of the conditional probability distribution \(p(Y|X)\), and do so without concerning itself with probability distribution \(p(X)\) of the covariates themselves.

In unsupervised learning, we are interested in properties of \(p(X)\). In our example, what can we say about the distribution of home value time series? Since the dimensionality of \(p(X)\) can be large, unsupervised learning methods seek to find structured representations of \(p(X)\) that would be possible to estimate.

In clustering we assume that predictor space is partitioned and that \(p(X)\) is defined over those partitions. In dimensionality reduction we assume that \(p(X)\) is really defined over a space of smaller dimension. We will start studying clustering first.

33.3 Cluster Analysis

The high-level goal of cluster analysis is to organize objects (observations) that are similar to each other into groups. We want objects within a group to be more similar to each other than objects in different groups.

Central to this high-level goal is how to measure the degree of similarity between objects. A clustering method then uses the similarity measure provided to it to group objects into clusters.

This plot shows the result of the k-means algorithm partitioning the data into 16 clusters. We observe that series within each cluster have similar trends. The darker series within each cluster shows the average time series within the cluster. We also see that some clusters correspond to very similar series suggesting that partitioning the data into fewer clusters would be better.

33.4 Dissimilarity-based Clustering

For certain algorithms, instead of similarity we work with dissimilarity, often represented as distances. When we have observations defined over attributes, or predictors, we define dissimilarity based on these attributes.

Given measurements \(x_{ij}\) for \(i=1,\ldots,N\) observations over \(j=1,\ldots,p\) predictors. Suppose we define a dissimilarity \(d_j(x_{ij}, x_{i'j})\), we can then define a dissimilarity between objects as

\[ d(x_i, x_{i'}) = \sum_{j=1}^p d_j(x_{ij},x_{i'j}) \]

In the k-means algorithm, and many other algorithms, the most common usage is squared distance

\[ d_j(x_{ij},x_{i'j}) = (x_{ij}-x_{i'j})^2 \]

We can use different dissimilarities, for example

\[ d_j(x_{ij}, x_{i'j}) = |x_{ij}-x_{i'j}| \]

which may affect our choice of clustering algorithm later on.

For categorical variables, we could set

\[ d_j(x_{ij},x_{i'j}) = \begin{cases} 0 \; \textrm{if } x_{ij} = x_{i'j} \\ 1 \; \textrm{o.w.} \end{cases} \]

If the values the categorical variable have an intrinsic similarity we could generalize using symmetric matrix \(L\) with elements \(L_{rr'} = L_{r'r}\), \(L_{rr}=0\) and \(L_{rr'} \geq 0\) otherwise. This may of course lead to a dissimilarity that is not a proper distance.

33.5 K-means Clustering

A commonly used algorithm to perform clustering is the K-means algorithm. It is appropriate when using squared Euclidean distance as the measure of object dissimilarity.

\[ \begin{aligned} d(x_{i},x_{i'}) & = \sum_{j=1}^p (x_{ij}-x_{i'j})^2 \\ {} & = \|x_i - x_{i'}\|^2 \end{aligned} \]

K-means partitions observations into \(K\) clusters, with \(K\) provided as a parameter. Given some clustering, or partition, \(C\), the cluster assignment of observation \(x_i\) to cluster \(k \in \{1,\ldots,K\}\) is denoted as \(C(i)=k\). K-means seeks to minimize a clustering criterion measuring the dissimilarity of observations assigned to the same cluster.

\[ W(C) = \frac{1}{2} \sum_{k=1}^K \sum_{i: \, C(i)=k} \sum_{i':\, C(i')=k} \|x_i - x_{i'}\|^2 \]

Note however, that this is equivalent to minimizing

\[ W(C) = \frac{1}{2}\sum_{k=1}^K N_k \sum_{i:\,C(i)=k} \|x_i - \bar{x}_k\|^2 \]

where \(\bar{x}_k=(\bar{x}_{k1},\ldots,\bar{x}_{kp})\) and \(\bar{x}_{kj}\) is the average of predictor \(j\) over the observations assigned to cluster \(k\), i.e., \(C(i)=k\), and \(N_k\) is the number of observations assigned to cluster \(k\). Thus the criteria to minimize is the total distance given by each observation to the mean (centroid) of the cluster to which the observation is assigned.

An iterative algorithm is used to minimize this criterion

  1. Initialize by choosing \(K\) observations as centroids \(m_1,m_2,\ldots,m_k\)
  2. Assign each observation \(i\) to the cluster with the nearest centroid, i.e., set \(C(i)=\arg\min_{1 \leq k \leq K} \|x_i - m_k\|^2\)
  3. Update centroids \(m_k=\bar{x}_k\)
  4. Iterate steps 1 and 2 until convergence

Here we illustrate the k-means algorithm over four iterations on our example data with \(K=4\). Each row in this plot is an iteration of the algorithm, and each column corresponds to a cluster. We observe, that in this case, there is little update of the cluster assignments and centroids beyond the first iteration.

Criterion \(W(C)\) is reduced in each iteration so the algorithm is assured to converge. However, as this is not a convex criterion, the clustering we obtain may not be globally optimal. To address this in practice, the algorithm is run with multiple initializations (step 0) and the best clustering achieved is used. Also, selection of observations as centroids can be improved using the K-means++ algorithm:

  1. Choose an observation as centroid \(m_1\) uniformly at random
  2. To choose centroid \(m_k\), compute for each observation \(i\) not chosen as a centroid the distance to the nearest centroid \(d_i = \min_{1\leq l < k} \|x_i - m_l\|^2\)
  3. Set centroid \(m_k\) to an observation randomly chosen with probability \(\frac{e^d_i}{\sum_{i'} e^d_{i'}}\)
  4. Iterate steps 1 and 2 until \(K\) centroids are chosen

This ensures that initial centroids are well distributed in predictor space and yield a better clustering.

33.6 Choosing the number of clusters

The number of parameters must be determined before running the K-means algorithm. As opposed to the supervised learning setting where model selection is performed by minimizing expected prediction error through, for example, cross-validation, there is no clean direct method for choosing the number of clusters to use in the K-means algorithm.

Furthermore, looking at criterion \(W(C)\) alone is not sufficient as the criterion will become smaller as the value of \(K\) is reduced. Here we show the behavior of \(W_K(C)\) in our example dataset. However, there are properties of this statistic that may be used to heuristically choose a proper value of \(K\).

The intuition behind the method is that, supposing there is a true underlying number \(K^*\) of clusters in the data, improvement in the \(W_K(C)\) statistic will be fast for values of \(K \leq K^*\), and slower for values of \(K > K^*\). In the first case, there will be a cluster which will contain observations belonging to two of the true underlying clusters, and therefore will have poor within cluster similarity. As \(K\) is increased, observations may then be separated into separate clusters, providing a sharp improvement in the \(W_K(C)\) statistic. On the other hand, for values of \(K > K^*\) observations belonging to a single true cluster are split into multiple cluster, all with generally high within-cluster similarity, therefore splitting these clusters further will not improve the \(W_K(C)\) statistic very sharply. The curve will therefore have an inflection point around \(K^*\). We see this behavior in our example plot, where improvement in \(W_K(C)\) is slower after \(K=3\).

The gap statistic is used to identify the inflection point in the curve. It compares the behavior of the \(W_K(C)\) statistic based on the data with the behavior of the \(W_K(C)\) statistic for data generated uniformly at random over the range of the data. Below we plot the gap statistic for our example data. For this dataset, the gap statistic suggests that there is no clear cluster structure and therefore \(K=1\) is an optimal choice.

33.7 Summary

Clustering methods are intuitive methods useful to understand structure within unlabeled observations.

The K-means algorithm is a frequently used, easy to implement and understand algorithm for clustering based on Euclidean distance between data entities.