# 25 Experiment design and hypothesis testing

In this section we see one instance in which we can apply the CLT in data analysis.

## 25.1 Inference

One way to think about how we use probability in data analysis (statistical and machine learning) is like this:

The LLN tells us that our parameter $$\hat{p}$$ will be close to $$p$$ on average, the CLT lets us answer how confident are we that we found $$p$$. We do this by constructing a confidence interval as follows. Since $$\hat{p} \sim N(p,\frac{\sqrt{p(1-p)}}{\sqrt{n}})$$, we want to find an interval $$[\hat{p}_{-}, \hat{p}_{+}]$$, with $$\hat{p}$$ at its center, with 95% of the probability specified by the CLT. Why? In that case, there is 95% that the value of parameter $$p$$ will be within that interval.

Now, how do we calculate this interval, since we want the interval to contain 95% of the probability? The probability for the tails (values outside this interval) will be $$(1-.95)/2$$ (since there are two tails). So, the lower value of the interval will be one where the normal probability distribution (with mean $$\hat{p}$$ and standard deviation $$\frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}}$$) is such that $$P(Y \leq \hat{p}_{-}) = .05/2$$, which we can calculate using the function qnorm function in R:

\begin{align} \hat{p}_{-} & = \mathtt{qnorm}(.05/2, \hat{p}, \frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}}) \\ {} & = \hat{p} + \mathtt{qnorm}(.05/2,0, \frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}}) \end{align}

The upper value of the interval is computed with probability $$1-(.05/2)$$, which by the symmetry of the normal distribution is given by $$\hat{p}_{+} = \hat{p} + -\mathtt{qnorm}(.05/2,0, \frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}})$$.

Let’s see how these intervals look for our twitter bot example:

library(dplyr)

get_estimate <- function(n, p=0.7) mean(sample(c(0,1), size=n, replace=TRUE, prob=c(1-p,p)))

set.seed(1)
# let's construct confidence intervals for samples of size n=10,100,500,1000,10000
tab <- data.frame(sample_size=c(10,100,500,1000,10000)) %>%
mutate(phat = sapply(sample_size,get_estimate)) %>%
mutate(se = sqrt(phat*(1-phat)) / sqrt(sample_size)) %>%
mutate(lower = phat + qnorm(.05/2, sd=se)) %>%
mutate(upper = phat + -qnorm(.05/2, sd=se))

knitr::kable(tab)
sample_size phat se lower upper
10 0.7000 0.1449138 0.4159742 0.9840258
100 0.6900 0.0462493 0.5993530 0.7806470
500 0.7020 0.0204546 0.6619097 0.7420903
1000 0.6940 0.0145727 0.6654380 0.7225620
10000 0.6982 0.0045904 0.6892030 0.7071970

For our sample of $$n=500$$, we would say that our estimate of $$p$$ is $$0.7 \pm -0.04$$. A compact way of writing that is that our estimate of $$p$$ is $${}_{0.66}0.7_{0.74}$$.

### 25.1.1 Hypothesis testing

How else is this framework useful? Suppose that before I sampled tweets I thought (hypothesized) that more than 50% of tweets are bot-generated. One way very popular way of thinking about this problem is to reject the hypothesis that this is not the case. In this, case we have a null hypothesis that 50% or less of tweets are bot-generated), against an alternative hypothesis that more than 50% of tweets are bot-generated. You will see this written in statistics textbooks as:

\begin{align} H_0: \, & p <= .5 & \textrm{(null)} \\ H_1: \, & p > .5 & \textrm{(alternative)} \end{align}

Note: this is a one-sided test vs. a two-sided test where the null hypothesis is that $$p=.5$$ and the alternative is $$p \neq .5$$. According to the CLT, estimates $$\hat{p}$$ of $$p$$ from $$n$$ samples would be distributed as $$N(.5, \frac{\sqrt{.5(1-.5)}}{\sqrt{n}})$$ (we use $$p=.5$$ as this is the worst case for the hypothesis we want to test).

Once we do have our sample of $$n$$ tweets we can get an estimate $$\hat{p}$$ as we did before. If we see that $$\hat{p}$$ (sample mean from our sample of tweets) is too far from $$p=.5$$ then we could reject the null hypothesis since the estimate we derived from the data we have is not statistically consistent with the null hypothesis. Now, how do we say our estimate $$\hat{p}$$ is too far? Here, we use the probability model given by the CLT. If $$P(Y \geq \hat{p}) \geq .95$$ under the null model (of $$p=.5$$), we say it is too far and we reject.

This 95% rejection threshold is conservative, but somewhat arbitrary. So we use one more metric, $$P(|Y| \geq \hat{p})$$ (the infamous p-value) to say: we could reject this hypothesis for all thresholds greater than this p-value.

Let’s see how testing the hypothesis $$p > .5$$ would look like for our tweet example

tab <- tab %>%
mutate(p_value = 1-pnorm(phat, mean=.5, sd=se))
knitr::kable(tab)
sample_size phat se lower upper p_value
10 0.7000 0.1449138 0.4159742 0.9840258 0.0837731
100 0.6900 0.0462493 0.5993530 0.7806470 0.0000199
500 0.7020 0.0204546 0.6619097 0.7420903 0.0000000
1000 0.6940 0.0145727 0.6654380 0.7225620 0.0000000
10000 0.6982 0.0045904 0.6892030 0.7071970 0.0000000

Notice that rejection occurs when the parameter value for the null hypothesis $$p=.5$$ is outside the 95% confidence interval. Another note, these results hold for $$n$$ sufficiently large that the normal distribution in the CLT provides a good approximation of the distribution of estimates $$\hat{p}$$. In cases where $$n$$ is smaller, the $$t$$-distribution, as opposed to the normal distribution, provides a better approximation of the distribution of estimates $$\hat{p}$$. In that case, instead of using pnorm in the calculations above, we would use pt (for $$t$$-distribution) and the testing procedure above is referred to as a $$t$$-test (one-sided or two-sided as above). Now, as $$n$$ grows, the $$t$$-distribution approaches a normal distribution which is why analysts use the $$t$$-test regularly.

## 25.2 A/B Testing

A classic experimental design where hypothesis testing is commonly used in A/B testing. Here we are interested in seeing if proposed changes to a webpage has a desired effect. For example, we would like to know if page visitors follow a link more often after a page redesign.

Here we have two estimates $$\hat{p}_A$$ and $$\hat{p}_B$$, the proportion of clicks for design A and B respectively. The null hypothesis we would test is that there is no difference in proportions between the two designs. Mathematically, we would like to know “What is the probability that we observe a difference in proportions this large under the null hypothesis”. We will work this out as a homework exercise.

## 25.3 Summary

Inference: estimate parameter from data based on assumed probability model (for example, matching expectation. We’ll see later another method called maximum likelihood).

For averages the LLN and CLT tells us how to compute probabilities from a single parameter estimate, that is, derived from one dataset of samples. With these probabilities we can construct confidence intervals for our estimate.

Testing: Having a hypothesis about our parameter of interest, we can use probability under this hypothesis to see how statistically consistent our data is with that hypothesis, and reject the hypothesis if data is not statistically consistent enough (again using probability from CLT when dealing with averages).

## 25.4 Probability Distributions

In this example we saw three distributions:

### 25.4.1 Bernoulli

Notation: $$X \sim \mathrm{Bernoulli}(p)$$.
Values: $$X \in \{0,1\}$$
Parameter: $$p$$, $$p(X=1)=p$$ (probability of success).
Expected Value: $$\mathbb{E} X = p$$
Variance: $$\mathrm{var}(X) = p(1-p)$$.

We can write the probability mass function as

$p(X=x)=p^x(1-p)^{(1-x)}$

### 25.4.2 Binomial

This corresponds to the number of $$1$$’s in a draw of $$n$$ independent $$\mathrm{Bernoulli}(p)$$ random variables.

Notation: $$X \sim \mathrm{Bin(n,p)}$$.
Values: $$X \in 0,1,2,\ldots,n$$
Parameters: $$p$$ (probability of success), $$n$$ number of Bernoulli draws
Expected Value: $$\mathbb{E} X=np$$
Variance: $$\mathrm{var}(X) = np(1-p)$$

Here the probability mass function is a little more complicated since we have many different ways in which $$n$$ draws of independent Bernoulli random variables result in the same number of successess

$p(X=k) = \binom{n}{k} p^k(1-p)^{n-k}$

### 25.4.3 Normal (Gaussian) distribution

Notation: $$X \sim N(\mu,\sigma)$$
Values: $$X \in \mathbb{R}$$
Parameters: mean $$\mu$$, standard deviation $$\sigma$$
Expected Value: $$\mathbb{E} X = \mu$$
Variance: $$\mathrm{var}(X) = \sigma^2$$

The probability density function was given above.

A useful reference for probability distributions can be found here: https://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/

### 25.4.4 Distributions in R

For a majority of common distributions, R has the so-called d,p,q,r family of functions:

function use
d probability density (or mass) function
p cumulative probability function
q quantile function
r random value generator

For example, to use these for the Binomial distribution:

# using n=10, p=.3

# compute probability mass function value for k=4 successess
dbinom(4, n=10, p=.3)

# compute cumulative probability function for k=4 successess
pbinom(4, n=10, p=.3)

# compute the number of success corresponding to the .80th quantile
qbinom(.8, n=10, p=.3)

# generate a random value k
rbinom(1, n=10, p=.3)