8 Basic plotting with ggplot
We will spend a good amount of time in the course discussing data visualization. It serves many important roles in data analysis. We use it to gain understanding of dataset characteristics throughout analyses and it is a key element of communicating insights we have derived from data analyses with our target audience. In this section, we will introduce basic functionality of the ggplot
package to start our discussion of visualization throughout the course.
The ggplot
package is designed to work well with the tidyverse
set of packages. As such, it is designed around the Entity-Attribute data model.
Also, it can be included as part of data frame operation pipelines. Let’s start with a simple example. Let’s create a dot plot of the number of arrests per district in our dataset:
arrest_tab %>%
group_by(district) %>%
summarize(num_arrests=n()) %>%
ggplot(mapping=aes(y=district, x=num_arrests)) +
geom_point()
The ggplot
design is very elegant, takes some thinking to get used to, but is extremely powerful. The central premise is to characterize the building pieces behind ggplot
plots as follows:
- The data that goes into a plot, a data frame of entities and attributes
- The mapping between data attributes and graphical (aesthetic) characteristics
- The geometric representation of these graphical characteristics
So in our example we can fill in these three parts as follows:
Data: We pass a data frame to the
ggplot
function with the%>%
operator at the end of the group_by-summarize pipeline.Mapping: Here we map the
num_arrests
attribute to thex
position in the plot and thedistrict
attribute to they
position in the plot. Everyggplot
will contain one or moreaes
calls.Geometry: Here we choose points as the geometric representations of our chosen graphical characteristics using the
geom_point
function.
In general, the ggplot
call will have the following structure:
<data_frame> %>%
ggplot(mapping=aes(<graphical_characteristic>=<attribute>)) +
geom_<representation>()
8.1 Plot Construction Details
8.1.1 Mappings
Some of the graphical characteristics we will commonly map attributes to include:
Argument | Definition |
---|---|
x |
position along x axis |
y |
position along y axis |
color |
color |
shape |
shape (applicable to e.g., points) |
size |
size |
label |
string used as label (applicable to text) |
8.1.2 Representations
Representations we will use frequently are
Function | Representation |
---|---|
geom_point |
points |
geom_bar |
rectangles |
geom_text |
strings |
geom_smooth |
smoothed line (advanced) |
geom_hex |
hexagonal binning |
We can include multiple geometric representations in a single plot, for example points and text, by adding (+
) multiple geom_<representation>
functions. Also, we can include mappings inside a geom_
call to map characteristics to attributes strictly for that specific representation. For example geom_point(mapping=aes(color=<attribute>))
maps color to some attribute only for the point representation specified by that call. Mappings given in the ggplot
call apply to all representations added to the plot.
This cheat sheet is very handy: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
8.2 Frequently Used Plots
We will look comprehensively at data visualization in more detail later in the course, but for now will list a few common plots we use in data analysis and how they are created using ggplot
. Let’s switch data frame to the mpg
dataset for our examples:
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans
## <chr> <chr> <dbl> <int> <int> <chr>
## 1 audi a4 1.8 1999 4 auto…
## 2 audi a4 1.8 1999 4 manu…
## 3 audi a4 2 2008 4 manu…
## 4 audi a4 2 2008 4 auto…
## 5 audi a4 2.8 1999 6 auto…
## 6 audi a4 2.8 1999 6 manu…
## 7 audi a4 3.1 2008 6 auto…
## 8 audi a4 q… 1.8 1999 4 manu…
## 9 audi a4 q… 1.8 1999 4 auto…
## 10 audi a4 q… 2 2008 4 manu…
## # … with 224 more rows, and 5 more variables:
## # drv <chr>, cty <int>, hwy <int>, fl <chr>,
## # class <chr>
8.2.1 Scatter plot
Used to visualize the relationship between two attributes.
8.2.2 Bar graph
Used to visualize the relationship between a continuous variable to a categorical (or discrete) attribute
mpg %>%
group_by(cyl) %>%
summarize(mean_mpg=mean(hwy)) %>%
ggplot(mapping=aes(x=cyl, y=mean_mpg)) +
geom_bar(stat="identity")
8.2.3 Histogram
Used to visualize the distribution of the values of a numeric attribute
8.2.4 Boxplot
Used to visualize the distribution of a numeric attribute based on a categorical attribute
Exercise: Make a box plot showing the distribution of ages for arrests for the SOUTHERN district conditioned on sex.