POSTED: April 7, 2020
LAST UPDATE: April 20, 2020
DUE: April 20, 2020

In this project you will practice and experiment with linear regression using data from gapminder.org. I recommend spending a little time looking at material there, it is quite an informative site.

We will use a subset of data provided by gapminder provided by Jennifer Bryan described in it’s github page.

The following commands load the dataset in R

library(gapminder)
data(gapminder)

gapminder
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows

For this exercise you will explore how life expectancy has changed over 50 years across the world, and how economic measures like gross domestic product (GDP) are related to it.

For python (or R I suppose), you can get the data from http://www.hcbravo.org/IntroDataSci/misc/gapminder.csv.

Exercise 1: Make a scatter plot of life expectancy across time.

Question 1: Is there a general trend (e.g., increasing or decreasing) for life expectancy across time? Is this trend linear? (answering this qualitatively from the plot, you will do a statistical analysis of this question shortly)

A slightly different way of making the same plot is looking at the distribution of life expectancy across countries as it changes over time:

library(tidyverse)
library(ggplot2)

gapminder %>%
  ggplot(aes(x=factor(year), y=lifeExp)) +
    geom_violin() +
    labs(title="Life expectancy over time",
         x = "year",
         y = "life expectancy")

This type of plot is called a violin plot, and it displays the distribution of the variable in the y-axis for each value of the variable in the x-axis.

Question 2: How would you describe the distribution of life expectancy across countries for individual years? Is it skewed, or not? Unimodal or not? Symmetric around its center?

Based on this plot, consider the following questions.

Question 3: Suppose I fit a linear regression model of life expectancy vs. year (treating it as a continuous variable), and test for a relationship between year and life expectancy, will you reject the null hypothesis of no relationship? (do this without fitting the model yet. I am testing your intuition.)

Question 4: What would a violin plot of residuals from the linear model in Question 3 vs. year look like? (Again, don’t do the analysis yet, answer this intuitively)

Question 5: According to the assumptions of the linear regression model, what should that violin plot look like?

Exercise 2: Fit a linear regression model using the lm function for life expectancy vs. year (as a continuous variable). Use the broom::tidy to look at the resulting model.

Question 6: On average, by how much does life expectancy increase every year around the world?

Question 7: Do you reject the null hypothesis of no relationship between year and life expectancy? Why?

Exercise 3: Make a violin plot of residuals vs. year for the linear model from Exercise 2 (use the broom::augment function).

Question 8: Does the plot of Exercise 3 match your expectations (as you answered Question 4)?

Exercise 4: Make a boxplot (or violin plot) of model residuals vs. continent.

Question 9: Is there a dependence between model residual and continent? If so, what would that suggest when performing a regression analysis of life expectancy across time?

Exercise 5: Use geom_smooth(method=lm) in ggplot as part of a scatter plot of life expectancy vs. year, grouped by continent (e.g., using the color aesthetic mapping).

Question 10: Based on this plot, should your regression model include an interaction term for continent and year? Why?

Exercise 6: Fit a linear regression model for life expectancy including a term for an interaction between continent and year. Use the broom::tidy function to show the resulting model.

Question 11: Are all parameters in the model significantly different from zero? If not, which are not significantly different from zero?

Question 12: On average, by how much does life expectancy increase each year for each continent? (Provide code to answer this question by extracting relevant estimates from model fit)

Exercise 7: Make a residuals vs. year violin plot for the interaction model. Comment on how well it matches assumptions of the linear regression model. Do the same for a residuals vs. fitted values plot.

Submission

Prepare and knit a single Rmarkdown file or jupyter notebook that includes submission information described above.

All axes in plots should be labeled in an informative manner. Your answers to any exercise that refers to a plot should include both (a) a text description of your plot, and (b) a sentence or two of interpretation.