# 29 Linear models for classification

The general classification setting is: can we predict categorical response/output \(Y\), from set of predictors \(X_1,X_2,\ldots,X_p\)? As in the regression case, we assume training data \((\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\). In this case, however, responses \(y_i\) are categorical and take one of a fixed set of values.

## 29.1 An example classification problem

An individual’s choice of transportation mode to commute to work. Predictors: income, cost and time required for each of the alternatives: driving/carpooling, biking, taking a bus, taking the train. Response: whether the individual makes their commute by car, bike, bus or train.

From a classification model based on this data we could perform an inference task: how do people value price and time when considering their transportation choice.

## 29.2 Why not linear regression?

In our previous unit we learned about linear regression. Why can’t we use linear regression in the classification setting. For categorical responses with more than two values, if order and scale (units) don’t make sense, then it’s not a regression problem

\[ Y = \begin{cases} 1 & \textrm{if } \mathtt{stroke} \\ 2 & \textrm{if } \mathtt{drug overdose} \\ 3 & \textrm{if } \mathtt{epileptic seizure} \end{cases} \]

For **binary** responses, it’s a little better:

\[ Y = \begin{cases} 0 & \textrm{if } \mathtt{stroke} \\ 1 & \textrm{if } \mathtt{drug overdose} \\ \end{cases} \]

We could use linear regression in this setting and *interpret* response \(Y\) as a probability (e.g, if \(\hat{y} > 0.5\) predict \(\mathtt{drug overdose}\))

## 29.3 Classification as probability estimation problem

This observation motivates how we will address the classification problem in general. Instead of modeling classes 0 or 1 directly, we will model the conditional class probability \(p(Y=1|X=x)\), and classify based on this probability. In general, classification approaches use *discriminant* (think of *scoring*) functions to do classification. *Logistic regression* is **one** way of estimating the class probability \(p(Y=1|X=x)\) (also denoted \(p(x)\))

## 29.4 Logistic regression

The basic idea behind *logistic regression* is to build a **linear** model *related* to \(p(x)\), since linear regression directly (i.e. \(p(x) = \beta_0 + \beta_1 x\)) doesn’t work. Why?

Instead we build a linear model of *log-odds*:

\[ \log \frac{p(x)}{1-p(x)} = \beta_0 + \beta_1 x \]

Odds are equivalent to ratios of probabilities. For example, “two to one odds that Serena Williams wins the French Open” means “the probability that Serena Williams wins the French Open is double the probability he loses”. So, if odds = 2, \(p(x)=2/3\). If odds = 1/2, \(p(x)=1/3\). In general odds = \(\frac{p(x)}{1-p(x)}\).

### 29.4.1 Exercises

Suppose an individual has a 16% chance of defaulting on their credit card payment. What are the odds that she will default?

On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?

Here is how we compute a logistic regression model in R

```
library(ISLR)
library(dplyr)
library(broom)
library(ggplot2)
data(Default)
default_fit <- glm(default ~ balance, data=Default, family=binomial)
default_fit %>%
tidy() %>%
knitr::kable(digits=4)
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | -10.6513 | 0.3612 | -29.4922 | 0 |

balance | 0.0055 | 0.0002 | 24.9531 | 0 |

Interpretation of logistic regression models is slightly different than the linear regression model we looked at. In this case, the **odds** that a person defaults increase by \(e^{0.05} \approx 1.051\) for every dollar in their account balance. As before, the **accuracy** of \(\hat{\beta}_1\) as an estimate of the **population** parameter is given its standard error. We can again construct a confidence interval for this estimate as we’ve done before.

As before, we can do hypothesis testing of a relationship between account balance and the probability of default. In this case, we use a \(Z\)-statistic \(\frac{\hat{\beta}_1}{\mathrm{SE}(\hat{\beta}_1)}\) which plays the role of the t-statistic in linear regression: a scaled measure of our estimate (signal / noise). As before, the P-value is the probability of seeing a Z-value as large (e.g., 24.95) under the null hypothesis that **there is no relationship between balance and the probability of defaulting**, i.e., \(\beta_1=0\) in the population.

In accordance to the “inverse problem” view we’ve been developing in class, we require an algorithm required to *estimate* parameters \(\beta_0\) and \(\beta_1\) according to a data fit criterion. In logistic regression we use the **Bernoulli** probability model we saw previously (think of flipping a coin weighted by \(p(x)\)), and *estimate* parameters to **maximize** the *likelihood* of the observed training data under this coin flipping (binomial) model. Usually, we do this by *minimizing* the negative of the log likelihood of the model. I.e.: solve the following optimization problem

\[ \min_{\beta_0, \beta_1} \sum_{i:\, y_i=1} -y_i f(x_i) + \log (1+e^{f(x_i)}) \]

where \(f(x_i) = \beta_0 + \beta_1 x_i\). This is a non-linear (but convex) optimization problem. We will see later how to use gradient descent to find these parameters. You can learn more about algorithms to solve it in “Computational Methods” class (CMSC 460)

### 29.4.2 Making predictions

We can use a learned logistic regression model to make predictions. E.g., “on average, the probability that a person with a balance of $1,000 defaults is”:

\[ \hat{p}(1000) = \frac{e^{\hat{\beta}_0 + \hat{\beta}_1 \times 1000}}{1+e^{\beta_0 + \beta_1 \times 1000}} \approx \frac{e^{-10.6514 + 0.0055 \times 1000}}{1+e^{-10.6514 + 0.0055 \times 1000}} \\ \approx 0.00576 \]

### 29.4.3 Multiple logistic regression

This is a classification analog to linear regression:

\[ \log \frac{p(\mathbf{x})}{1-p(\mathbf{x})} = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p \]

```
fit <- glm(default ~ balance + income + student, data=Default, family="binomial")
fit %>%
tidy() %>%
knitr::kable(digits=4)
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | -10.8690 | 0.4923 | -22.0801 | 0.0000 |

balance | 0.0057 | 0.0002 | 24.7376 | 0.0000 |

income | 0.0000 | 0.0000 | 0.3698 | 0.7115 |

studentYes | -0.6468 | 0.2363 | -2.7376 | 0.0062 |

As in multiple linear regression it is essential to avoid **confounding!**. Consider an example of single logistic regression of default vs. student status:

```
fit1 <- glm(default ~ student, data=Default, family="binomial")
fit1 %>% tidy() %>%
knitr::kable(digits=4)
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | -3.5041 | 0.0707 | -49.5542 | 0e+00 |

studentYes | 0.4049 | 0.1150 | 3.5202 | 4e-04 |

and a multiple logistic regression:

```
fit2 <- glm(default ~ balance + income + student, data=Default, family="binomial")
fit2 %>% tidy() %>%
knitr::kable(digits=4)
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | -10.8690 | 0.4923 | -22.0801 | 0.0000 |

balance | 0.0057 | 0.0002 | 24.7376 | 0.0000 |

income | 0.0000 | 0.0000 | 0.3698 | 0.7115 |

studentYes | -0.6468 | 0.2363 | -2.7376 | 0.0062 |

### 29.4.4 Exercise

- Suppose we collect data for a group of students in a statistics class with variables X1 = hours studied, X2 = undergrad GPA, and Y = receive an A. We fit a logistic regression and produce estimated coefficients, \(\hat{\beta}_0=-6, \hat{\beta}_1=0.05,\hat{\beta}_2=1\).

Estimate the probability that a student who studies for 40h and has an undergraduate GPA of 3.5 gets an A in the class.

- With estimated parameters from previous question, and GPA of 3.5 as before, how many hours would the student need to study to have a 50% chance of getting an A in the class?

## 29.5 Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is a different linear method to estimate a probability model used for classification. Recall that we want to partition data based on **class probability**: e.g., *find the \(\mathbf{X}\) for which* \(P(\mathrm{default=Yes}|X) > P(\mathrm{default=No}|X)\). In logistic regression, **we made no assumption about \(\mathbf{X}\)**. In other cases, we **can** make assumptions about \(\mathbf{X}\) that improve prediction performance (if assumptions hold, obviously)

This suggests we can model `balance`

for each of the classes with a normal distribution. WARNING, BIG ASSUMPTION: We will assume `balance`

has the same *variance* for both classes (this is what makes LDA linear).
So, we estimate average `balance`

for people who *do not* default:

\[ \hat{\mu}_0 = \frac{1}{n_0} \sum_{i:\, y_i=0} x_i \]

for people who do default:

\[ \hat{\mu}_1 = \frac{1}{n_1} \sum_{i:\, y_i=1} x_i \]

and estimate variance for both classes as

\[ \hat{\sigma}^2 = \frac{1}{n-2} \sum_{k=1,2} \sum_{i:\, y_i=k} (x_i - \hat{\mu}_k)^2 \]

```
## # A tibble: 2 x 2
## default balance_mean
## <fct> <dbl>
## 1 No 804.
## 2 Yes 1748.
```

```
## # A tibble: 1 x 1
## balance_sd
## <dbl>
## 1 453.
```

We can “score” values of `balance`

based on these estimates:

\[ f_k(x) = \frac{1}{\sqrt{2\pi}\sigma} \exp \left(-\frac{1}{2\sigma^2} (x-\mu_k)^2 \right) \]

Remember, what we want is **posterior class probability** \(p(Y=k|X)\), for that we need to include the probability that we *observe* class \(k\). This is called **prior class probability**, denoted \(\pi_k\), means the proportion of times you expect people to default regardless of any other attribute. We can estimate from training data as the proportion of observations with label \(k\). Bayes’ Rule (or Theorem) gives us a way of computing \(P(Y=k|X)\) using score \(f_k(x)\) (from the class normal assumption) and prior \(\pi_k\):

\[ P(Y=k|X) = \frac{f_k(x) \pi_k}{\sum_l f_l(x) \pi_l} \]

If data (conditioned by class) is distributed so that \(f_k\) is the right probability function to use, then predicting the class that maximizes \(P(Y=k|X)\) is the **optimal** thing to do. This is referred to the *Bayes classifier* (aka the Holy Grail of classification)

### 29.5.1 How to train LDA

Compute class means and squared error based on class mean

```
lda_stats <- Default %>%
group_by(default) %>%
mutate(class_mean=mean(balance),
squared_error=(balance-class_mean)^2)
```

Compute class sizes and sum of squared errors

```
lda_stats <- lda_stats %>%
summarize(class_mean=first(class_mean),
class_size=n(),
sum_squares=sum(squared_error))
```

Compute class prior and variance (note same variance for both classes)

```
lda_stats <- lda_stats %>%
mutate(class_prior=class_size/sum(class_size),
sigma2=sum(sum_squares) / (sum(class_size) - 2)) %>%
dplyr::select(default, class_mean, class_prior, sigma2)
knitr::kable(lda_stats)
```

default | class_mean | class_prior | sigma2 |
---|---|---|---|

No | 803.9438 | 0.9667 | 205318.6 |

Yes | 1747.8217 | 0.0333 | 205318.6 |

How do we predict with LDA? Predict `Yes`

if \(P(Y=1|X) > P(Y=0|X)\)

Equivalently:

\[ \log{\frac{P(Y=1|X)}{P(Y=0|X)}} > 0 \Rightarrow \\ \log f_1(x) + \log \pi_1 > \log f_0(x) + \log \pi_0 \]

This turns out to be a linear function of \(x\)!

```
lda_log_ratio <- function(balance, lda_stats) {
n <- length(balance)
# subtract class mean
centered_balance <- rep(balance, 2) - rep(lda_stats$class_mean, each=n)
# scale by standard deviation
scaled_balance <- centered_balance / sqrt(lda_stats$sigma2[1])
# compute log normal density and add log class prior
lprobs <- dnorm(scaled_balance, log=TRUE) + log(rep(lda_stats$class_prior, each=n))
# compute log ratio of class probabilities
lprobs <- matrix(lprobs, nc=2)
colnames(lprobs) <- lda_stats$default
lprobs[,"Yes"] - lprobs[,"No"]
}
```

## 29.6 Summary

We approach classification as a class probability estimation problem. Logistic regression and LDA partition predictor space with linear functions. Logistic regression learns parameter using Maximum Likelihood (numerical optimization), while LDA learns parameter using means and variances (and assuming normal distribution)