34 Unsupervised Learning: Dimensionality Reduction

Recall that in unsupervised data we are interested in characterizing patterns in predictor space where observation measurements are represented. Mathematically, we stated as an interest in characterizing \(p(X)\) over \(p\)-dimensional predictor space. Clustering methods assume that this space \(p(X)\) can be partitioned into subspaces containing “similar” observations. In dimensionality reduction, we assume that observations can be represented in a space with dimension much lower than \(p\). We will see two general strategies for dimensionality reduction: data transformations into spaces of smaller dimension that capture global properties of a data set \(X\), and data embeddings into lower dimensional spaces that retain local properties of a data set \(X\).

34.1 Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality reduction method. The goal is to embed data in high dimensional space (e.g., observations with a large number of variables), onto a small number of dimensions. Note that its most frequent use is in EDA and visualization, but it can also be helpful in regression (linear or logistic) where we can transform input variables into a smaller number of predictors for modeling.

Mathematically, the PCA problem is:

Given: - Data set \(\{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\}\), where \(\mathbf{x}_i\) is the vector of \(p\) variable values for the \(i\)-th observation.

Return: - Matrix \(\left[ \phi_1, \phi_2, \ldots, \phi_p \right]\) of linear transformations that retain maximal variance.

You can think of the first vector \(\phi_1\) as a linear transformation that embeds observations into 1 dimension:

\[ Z_1 = \phi_{11}X_1 + \phi_{21} X_2 + \cdots + \phi_{p1} X_p \]

where \(\phi_1\) is selected so that the resulting dataset \(\{ z_1, \ldots, z_n\}\) has maximum variance. In order for this to make sense mathematically data has to be centered, i.e., each \(X_j\) has mean equal to zero and transformation vector \(\phi_1\) has to be normalized, i.e., \(\sum_{j=1}^p \phi_{j1}^2=1\). We can find \(\phi_1\) by solving an optimization problem:

\[ \max_{\phi{11},\phi_{21},\ldots,\phi_{p1}} \frac{1}{n} \sum_{i=1}^n \left( \sum_{j=1}^p \phi_{j1} x_{ij} \right)^2 \\ \mathrm{s.t.} \sum_{j=1}^p \phi_{j1}^2 = 1 \]

Conceptually this optimization problem says maximize variance but subject to normalization constraint. The second transformation \(\phi_2\) is obtained next solving a similar problem with the added constraint that \(\phi_2\) is orthogonal to \(\phi_1\). Taken together \(\left[ \phi_1, \phi_2 \right]\) define a pair of linear transformations of the data into 2 dimensional space.

\[ Z_{n\times 2} = X_{n \times p} \left[ \phi_1, \phi_2 \right]_{p \times 2} \]

Each of the columns of the \(Z\) matrix are called Principal Components. The units of the PCs are meaningless. In particular, comparing numbers across PCs doesn’t make mathematical sense.

In practice, one may also use a scaling transformation on the variables \(X_j\) to have unit variance. In general, if variables \(X_j\) are measured in different units (e.g, miles vs. liters vs. dollars), variables should be scaled to have unit variance. Conversely, if they are all measured in the same units, they should be scaled.

library(tidyverse)
library(readr)
library(lubridate)

datadir <- "data"
url <- "http://files.zillowstatic.com/research/public/Affordability_Wide_2017Q4_Public.csv"
filename <- basename(url)
datafile <- file.path(datadir, filename)

if (!file.exists(datafile)) {
  download.file(url, file.path(datadir, filename))
}

afford_data <- read_csv(datafile)

tidy_afford <- afford_data %>%
  filter(Index == "Mortgage Affordability") %>% 
  drop_na() %>%
  filter(RegionID != 0) %>%
  dplyr::select(RegionID, matches("^[1|2]")) %>%
  gather(time, affordability, matches("^[1|2]")) %>%
  type_convert(col_types=cols(time=col_date(format="%Y-%m")))

wide_afford_df <- tidy_afford %>%
  dplyr::select(RegionID, time, affordability) %>%
  spread(time, affordability)

value_mat <-  wide_afford_df %>%
  dplyr::select(-RegionID) %>%
  as.matrix()

pca_res <- prcomp(value_mat, scale=FALSE)
pca_au <- broom::augment(pca_res, wide_afford_df)
pca_d <- broom::tidy(pca_res, matrix="d")
pc_loading <- broom::tidy(pca_res, matrix="variables") %>%
  type_convert(col_types=cols(column=col_date("%Y-%m-%d"))) %>%
  mutate(PC=as.character(PC))
                            
pc_mean <- pca_res$center
pc_mean <- data_frame(column=names(pc_mean), PC="mean", value=pc_mean) %>%
  type_convert(col_types=cols(column=col_date("%Y-%m-%d")))

## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.

pc_loading <- pc_mean %>%
  bind_rows(pc_loading)

Here we plot the mortgage affordability data embedded into the first two principal components. There are some time in these two components that may be treated as outliers. Also, a clustering analysis in this reduced space seems like a reasonable next step.

ggplot(pca_au, aes(.fittedPC1, .fittedPC2)) +
    geom_point(size=2) +
    labs(x="PC1", y="PC2")

A natural question that arises: How many PCs should we consider in post-hoc analysis? One result of PCA is a measure of the variance corresponding to each PC relative to the total variance of the dataset. From that we can calculate the percentage of variance explained for the \(m\)-th PC:

\[ PVE_m=\frac{\sum_{i=1}^n z_{im}^2}{\sum_{j=1}^p \sum_{i=1}^n x_{ij}^2} \]

We can use this measure to choose number of PCs in an ad-hoc manner. In our case, using more than 10 or so PCs does not add information.

pca_d <- broom::tidy(pca_res, matrix="d")
pca_d %>%
  filter(PC <= 30) %>%
  ggplot(aes(PC, 100 * cumulative)) + 
    geom_line(size=1.32) +
    labs(x="PC", y="Pct. Variance Explained")

A useful rule of thumb: - If no apparent patterns in first couple of PCs, stop! - Otherwise, look at other PCs using PVE as guide.

There are bootstrap based methods to perform a statistically guided selection of the number of PCs. However, there is no commonly agreed upon method for choosing number of PCs used in practice, and methods are somewhat ad-hoc.

34.1.1 Solving the PCA

Algorithmically, the Principle Components solutions \(\phi\) are obtained from the singular value decomposition of observation matrix \(X_{n\times p}=UDV^T\), where matrices \(U\) and \(V\) are orthogonal matrices (\(U^TU=I\) and \(V^TV=I\)) called the left and right singular vectors respectively. \(D\) is a diagonal matrix with \(d_1 \geq d_2 \geq \ldots d_p \geq 0\). These are referred to as the singular values.

Using our previous notation \(V\) is the transformation matrix \(V=\left[\phi_1,\phi_2,\cdots,\phi_p \right]\). Principal components \(Z\) are given by the columns of \(UD\). Since \(U\) is orthogonal, \(d_j^2\) equals the variance of the \(j\)th PC.

From this observation we also see that we can write original observations \(x_i\) in terms of PCs \(z\) and transformations \(\phi\). Specifically \(x_i = z_{i1}\phi_1 + z_{i2}\phi_2 + \cdots + z_{ip} \phi_p\). We can think of the \(\phi_j\) vectors as a basis over which we can represent original observations \(i\).

For this reason, another useful post-hoc analysis is to plot the transformation vectors \(\phi_1, \phi_2, \ldots\). Here we plot the mean time series (since we center observations \(X\) before performing the embedding) along with the first three \(\phi_j\) vectors.

pc_loading %>%
  mutate(PC=forcats::fct_shift(factor(PC),-1)) %>%
  filter(PC %in% c("mean",1:3)) %>%
  ggplot(aes(column, value)) + 
    geom_line() +
    facet_wrap(~PC)

34.2 Multidimensional Scaling

Multidimensional scaling is a similar approach to PCA but looks at the task in a little different manner. Given observations \(x_1,\ldots,x_N\) in \(p\) dimensions, let \(d_{ij}\) be the distance between observations \(i\) and \(j\). We may also use this algorithm given distances initially instead of \(p\) dimensional observations. Multidimensional Scaling (MDS) seeks to find embeddings \(z_1, \ldots, z_N\) of \(k\) dimensions for which Euclidean distance (in \(k\) dimensional space) is close to the input distances \(d_{ij}\).

In least squares MDS, we can do this by minimizing

\[ S_M(z_1,\ldots,z_N) = \sum_{i\neq j} (d_{ij}- \|z_i - z_j\|)^2 \]

A gradient descent algorithm is used to minimize this function. A related method that tends to better capture small distances is given by the Sammon mapping:

\[ S_{S_m}(z_1,\ldots,z_N) = \sum_{i\neq j} \frac{(d_{ij}- \|z_i - z_j\|)^2}{d_{ij}} \]

Since MDS can use distances as input, it is suitable to use when distances between observations are all we have.

34.3 Summary

Principal Component Analysis is a conceptually simple but powerful EDA tool. It is very useful at many stages of analyses.

PCA interpretation can be very ad-hoc, however. It is part of large set of unsupervised methods based on matrix decompositions, including Kernel PCA, Non-negative Matrix Factorization and others.

Embedding methods seek to capture local properties of observations. A popular recent method is the t-SNE method.