8 Basic plotting with ggplot

We will spend a good amount of time in the course discussing data visualization. It serves many important roles in data analysis. We use it to gain understanding of dataset characteristics throughout analyses and it is a key element of communicating insights we have derived from data analyses with our target audience. In this section, we will introduce basic functionality of the ggplot package to start our discussion of visualization throughout the course.

The ggplot package is designed to work well with the tidyverse set of packages. As such, it is designed around the Entity-Attribute data model. Also, it can be included as part of data frame operation pipelines. Let’s start with a simple example. Let’s create a dot plot of the number of arrests per district in our dataset:

The ggplot design is very elegant, takes some thinking to get used to, but is extremely powerful. The central premise is to characterize the building pieces behind ggplot plots as follows:

  1. The data that goes into a plot, a data frame of entities and attributes
  2. The mapping between data attributes and graphical (aesthetic) characteristics
  3. The geometric representation of these graphical characteristics

So in our example we can fill in these three parts as follows:

  1. Data: We pass a data frame to the ggplot function with the %>% operator at the end of the group_by-summarize pipeline.

  2. Mapping: Here we map the num_arrests attribute to the x position in the plot and the district attribute to the y position in the plot. Every ggplot will contain one or more aes calls.

  3. Geometry: Here we choose points as the geometric representations of our chosen graphical characteristics using the geom_point function.

In general, the ggplot call will have the following structure:

8.1 Plot Construction Details

8.1.1 Mappings

Some of the graphical characteristics we will commonly map attributes to include:

Argument Definition
x position along x axis
y position along y axis
color color
shape shape (applicable to e.g., points)
size size
label string used as label (applicable to text)

8.1.2 Representations

Representations we will use frequently are

Function Representation
geom_point points
geom_bar rectangles
geom_text strings
geom_smooth smoothed line (advanced)
geom_hex hexagonal binning

We can include multiple geometric representations in a single plot, for example points and text, by adding (+) multiple geom_<representation> functions. Also, we can include mappings inside a geom_ call to map characteristics to attributes strictly for that specific representation. For example geom_point(mapping=aes(color=<attribute>)) maps color to some attribute only for the point representation specified by that call. Mappings given in the ggplot call apply to all representations added to the plot.

This cheat sheet is very handy: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

8.2 Frequently Used Plots

We will look comprehensively at data visualization in more detail later in the course, but for now will list a few common plots we use in data analysis and how they are created using ggplot. Let’s switch data frame to the mpg dataset for our examples:

## # A tibble: 234 x 11
##    manufacturer model displ  year   cyl trans
##    <chr>        <chr> <dbl> <int> <int> <chr>
##  1 audi         a4      1.8  1999     4 auto…
##  2 audi         a4      1.8  1999     4 manu…
##  3 audi         a4      2    2008     4 manu…
##  4 audi         a4      2    2008     4 auto…
##  5 audi         a4      2.8  1999     6 auto…
##  6 audi         a4      2.8  1999     6 manu…
##  7 audi         a4      3.1  2008     6 auto…
##  8 audi         a4 q…   1.8  1999     4 manu…
##  9 audi         a4 q…   1.8  1999     4 auto…
## 10 audi         a4 q…   2    2008     4 manu…
## # … with 224 more rows, and 5 more variables:
## #   drv <chr>, cty <int>, hwy <int>, fl <chr>,
## #   class <chr>

8.2.1 Scatter plot

Used to visualize the relationship between two attributes.

8.2.2 Bar graph

Used to visualize the relationship between a continuous variable to a categorical (or discrete) attribute

8.2.3 Histogram

Used to visualize the distribution of the values of a numeric attribute

8.2.4 Boxplot

Used to visualize the distribution of a numeric attribute based on a categorical attribute

Exercise: Make a box plot showing the distribution of ages for arrests for the SOUTHERN district conditioned on sex.