27 Data Analysis with Geometry

A common situation in data analysis is that one has an outcome attribute (variable) \(Y\) and one or more independent covariate or predictor attributes \(X_1,\ldots,X_p\). One usually observes these variables for multiple “instances” (or entities).

(Note) As before, we use upper case to denote a random variable. To denote actual numbers we use lower case. One way to think about it: \(Y\) has not happened yet, and when it does, we see \(Y=y\).

One may be interested in various things:

What effects do the covariates have on the outcome?
How well can we describe these effects?
Can we predict the outcome using the covariates?, etc…

27.1 Motivating Example: Credit Analysis

default	student	balance	income
No	No	729.5265	44361.625
No	Yes	817.1804	12106.135
No	No	1073.5492	31767.139
No	No	529.2506	35704.494
No	No	785.6559	38463.496
No	Yes	919.5885	7491.559

Task: predict account default What is the outcome \(Y\)? What are the predictors \(X_j\)?

We will sometimes call attributes \(Y\) and \(X\) the outcome/predictors, sometimes observed/covariates, and even input/output. We may call each entity an observation or example. We will denote predictors with \(X\) and outcomes with \(Y\) (quantitative) and \(G\) (qualitative). Notice \(G\) are not numbers, so we cannot add or multiply them. We will use \(G\) to denote the set of possible values. For gender it would be \(G=\{Male,Female\}\).

27.2 From data to feature vectors

The vast majority of ML algorithms we see in class treat instances as “feature vectors”. We can represent each instance as a vector in Euclidean space \(\langle x_1,\ldots,x_p,y \rangle\). This means:

every measurement is represented as a continuous value
in particular, categorical variables become numeric (e.g., one-hot encoding)

Here is the same credit data represented as a matrix of feature vectors

default	student	balance	income
-1	1	1817.17118	24601.04
1	0	1975.65303	38221.84
-1	1	296.46276	20138.25
1	1	1543.09958	18304.65
1	0	2085.58698	35657.23
-1	0	59.96583	24889.25

27.3 Technical notation

Observed values will be denoted in lower case. So \(x_i\) means the \(i\)th observation of the random variable \(X\).
Matrices are represented with bold face upper case. For example \(\mathbf{X}\) will represent all observed predictors.
\(N\) (or \(n\)) will usually mean the number of observations, or length of \(Y\). \(i\) will be used to denote which observation and \(j\) to denote which covariate or predictor.
Vectors will not be bold, for example \(x_i\) may mean all predictors for subject \(i\), unless it is the vector of a particular predictor \(\mathbf{x}_j\).
All vectors are assumed to be column vectors, so the \(i\)-th row of \(\mathbf{X}\) will be \(x_i'\), i.e., the transpose of \(x_i\).

27.4 Geometry and Distances

Now that we think of instances as vectors we can do some interesting operations.

Let’s try a first one: define a distance between two instances using Euclidean distance

\[d(x_1,x_2) = \sqrt{\sum_{j=1}^p(x_{1j}-x_{2j})^2}\]

27.4.1 K-nearest neighbor classification

Now that we have a distance between instances we can create a classifier. Suppose we want to predict the class for an instance \(x\).

K-nearest neighbors uses the closest points in predictor space predict \(Y\).

\[ \hat{Y} = \frac{1}{k} \sum_{x_k \in N_k(x)} y_k. \]

\(N_k(x)\) represents the \(k\)-nearest points to \(x\). How would you use \(\hat{Y}\) to make a prediction?

An important notion in ML and prediction is inductive bias. What assumptions we make about our data that allow us to make predictions. In KNN, our inductive bias is that points that are nearby will be of the same class. Parameter \(K\) is a hyper-parameter, it’s value may affect prediction accuracy significantly.

Question: which situation may lead to overfitting, high or low values of \(K\)? Why?

27.4.2 The importance of transformations

Feature scaling is an important issue in distance-based methods. In the example below, which of these two features will affect distance the most?

## Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.

27.5 Quick vector algebra review

A (real-valued) vector is just an array of real values, for instance \(x = \langle 1, 2.5, −6 \rangle\) is a three-dimensional vector.
Vector sums are computed pointwise, and are only defined when dimensions match, so

\[\langle 1, 2.5, −6 \rangle + \langle 2, −2.5, 3 \rangle = \langle 3, 0, −3 \rangle\].

In general, if \(c = a + b\) then \(cd = ad + bd\) for all vectors \(d\).

Vector addition can be viewed geometrically as taking a vector \(a\), then tacking on \(b\) to the end of it; the new end point is exactly \(c\).

Scalar Multiplication: vectors can be scaled by real values;

\[2\langle 1, 2.5, −6 \rangle = \langle 2, 5, −12\rangle\]

In general, \(ax = \langle ax_1, ax_2, \ldots, ax_p\rangle\)

The norm of a vector \(x\), written \(\|x\|\) is its length.

Unless otherwise specified, this is its Euclidean length, namely:

\[\|x\| = \sqrt{\sum_{j=1}^p x_j^2}\]

27.5.1 Quiz

Write Euclidean distance of vectors \(u\) and \(v\) as a vector norm

The dot product, or inner product of two vectors \(u\) and \(v\) is defined as

\[u'v = \sum_{j=1}^p u_i v_i\]

A useful geometric interpretation of the inner product \(v'u\) is that it gives the projection of \(v\) onto \(u\) (when \(\|u\|=1\)).

27.6 The curse of dimensionality

Distance-based methods like KNN can be problematic in high-dimensional problems Consider the case where we have many covariates. We want to use \(k\)-nearest neighbor methods.
Basically, we need to define distance and look for small multi-dimensional “balls” around the target points. With many covariates this becomes difficult. Imagine we have equally spaced data and that each covariate is in \([0,1]\). We want to something like kNN with a local focus that uses 10% of the data in the local fitting. If we have \(p\) covariates and we are forming \(p\)-dimensional cubes, then each side of the cube must have size \(l\) determined by \(l \times l \times \dots \times l = l^p = .10\). If the number of covariates is p=10, then \(l = .1^{1/10} = .8\). So it really isn’t local! If we reduce the percent of data we consider to 1%, \(l=0.63\). Still not very local.

If we keep reducing the size of the neighborhoods we will end up with very small number of data points in each average and thus predictions with very large variance.

This is known as the curse of dimensionality. Because of this so-called curse, it is not always possible to use KNN. But other methods, like Decision Trees, thrive on multidimensional data.

27.7 Summary

We will represent many ML algorithms geometrically as vectors
Vector math review
K-nearest neighbors
The curse of dimensionality