The grammar of graphics: R code

Principles of graph grammar and tidy data

Dec 29, 2021

Mapping
- Practice
Facets
- Practice
Scale
- Practice
Statistics
- Practice
Geometries
- Examples
- Practice
Coordinates
Themes
- Practice
Quick plot
Annotate
- Practice
Plot Composition

logo

If you’re at all familiar with ggplot, you may know the basic structure of a call to the ggplot() function. For an introduction to the ggplot package, you can check out the visualization and data structure lecture. When you call ggplot, you should provide a data source, usually a data frame or tibble. Afterward, you can build a ggplot by mapping different variables in your data source to different aesthetics. For example, there are colour, shape, size of points, the position of the x or y-axis, ans so on. To demonstrate this and even more procedures using ggplot, we will work with different data sets such as:

Growth of Orange Trees: the Orange data frame has 35 rows and 3 columns of records of the growth of orange trees.

glimpse(Orange)

## Rows: 35
## Columns: 3
## $ Tree          <ord> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3,…
## $ age           <dbl> 118, 484, 664, 1004, 1231, 1372, 1582, 118, 484, 664, 10…
## $ circumference <dbl> 30, 58, 87, 115, 120, 142, 145, 33, 69, 111, 156, 172, 2…

Storm tracks data: This data is a subset of the NOAA Atlantic hurricane database best track data, https://www.nhc.noaa.gov/data/#hurdat. The data includes the positions and attributes of 198 tropical storms, measured every six hours during the lifetime of a storm.

glimpse(storms)

## Rows: 10,010
## Columns: 13
## $ name        <chr> "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "A…
## $ year        <dbl> 1975, 1975, 1975, 1975, 1975, 1975, 1975, 1975, 1975, 1975…
## $ month       <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7…
## $ day         <int> 27, 27, 27, 27, 28, 28, 28, 28, 29, 29, 29, 29, 30, 30, 30…
## $ hour        <dbl> 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18, 0,…
## $ lat         <dbl> 27.5, 28.5, 29.5, 30.5, 31.5, 32.4, 33.3, 34.0, 34.4, 34.0…
## $ long        <dbl> -79.0, -79.0, -79.0, -79.0, -78.8, -78.7, -78.0, -77.0, -7…
## $ status      <chr> "tropical depression", "tropical depression", "tropical de…
## $ category    <ord> -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ wind        <int> 25, 25, 25, 25, 25, 25, 25, 30, 35, 40, 45, 50, 50, 55, 60…
## $ pressure    <int> 1013, 1013, 1013, 1013, 1012, 1012, 1011, 1006, 1004, 1002…
## $ ts_diameter <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ hu_diameter <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

Carbon Dioxide Uptake in Grass Plants: The CO2 data frame has 84 rows and 5 columns of data from an experiment on the cold tolerance of the grass species Echinochloa crus-galli.

glimpse(CO2)

## Rows: 84
## Columns: 5
## $ Plant     <ord> Qn1, Qn1, Qn1, Qn1, Qn1, Qn1, Qn1, Qn2, Qn2, Qn2, Qn2, Qn2, …
## $ Type      <fct> Quebec, Quebec, Quebec, Quebec, Quebec, Quebec, Quebec, Queb…
## $ Treatment <fct> nonchilled, nonchilled, nonchilled, nonchilled, nonchilled, …
## $ conc      <dbl> 95, 175, 250, 350, 500, 675, 1000, 95, 175, 250, 350, 500, 6…
## $ uptake    <dbl> 16.0, 30.4, 34.8, 37.2, 35.3, 39.2, 39.7, 13.6, 27.3, 37.1, …

Simulated cattle breeding program (cbp): the cbp data frame has 10,000 rows and 7 columns of data from a simulation using the AlphaSimR package. The data is comprised of founders and phenotyped individuals (on the Milk Yield), where for each one, we have information of parents, sex, and herd.

cbp <- readRDS("./data/animal_sim.RDS")
glimpse(cbp)

## Rows: 10,000
## Columns: 7
## $ ind       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ father    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mother    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ year      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ sex       <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, …
## $ phenotype <dbl> 37.66915, 35.79590, 28.42813, 33.62726, 32.86620, 31.63590, …
## $ herd      <fct> E, B, A, D, A, A, A, E, E, C, A, B, B, A, C, C, E, C, A, A, …

Mapping

p <- ggplot(data = cbp, aes(x = year, y = phenotype, color = sex))
p + layer(geom = "point", stat = "identity", position = "identity", 
          params = list(shape=1, na.rm = FALSE))

Practice

Use the data CO2 (Biochemical Oxygen Demand) to build a layer with geom = "point" using conc as the response variable, uptake as an explanatory variable, and colour the points by Treatment.

data(CO2)
pEx1 <- ggplot(data = , aes(x = , y = , color = ))
pEx1 + layer(geom = , stat = "identity", position = "identity", 
             params = list(shape=1, na.rm = FALSE))

Facets

# facet_wrap
p + layer(geom = "point", stat = "identity", position = "identity", 
          params = list(shape=1, na.rm = FALSE)) +
  facet_wrap(facet = "sex")

# facet_grid
p + layer(geom = "point", stat = "identity", position = "identity", 
          params = list(shape=1, na.rm = FALSE)) +
  facet_grid(rows = vars(sex), cols = vars(herd))

Practice

Using the previous exercise as a starting point, add a facet_wrap() using Plant as a subgroup.

pEx1 + layer(geom = "point", stat = "identity", position = "identity", 
          params = list(shape=1, na.rm = FALSE)) +
    facet_wrap()

Make a plot using the storms database considering ts_diameter as response, hu_diameter as explanatory variable, and add a facet_grid() using rows as status and columns as category.

ggplot(data = , aes(y= , x=)) +
  layer(geom = "point", stat = "identity", position = "identity", 
          params = list(shape=1, na.rm = FALSE)) +
  facet_grid(rows = , cols = )

Scale

p2 <- p + layer(geom = "point", stat = "identity", position = "identity", 
          params = list(shape=1, na.rm = FALSE)) +
  facet_grid(rows = vars(sex), cols = vars(herd)) +
  scale_colour_manual(values = c("#00AFBB", "#FC4E07"))
p2

p2 + scale_color_brewer(palette = "Dark2")

Practice

Make a plot using the storms database using ts_diameter as a response, pressure as an explanatory variable, colored by status. Also add a facet_wrap() by category and do a scale transformation on the x and y axis using a log10() function.

ggplot(data = storms, aes(y= , x= )) +
  layer(geom = "point", stat = "identity", position = "identity", 
          params = list(shape=1, na.rm = FALSE)) +
  facet_wrap(...) +
  scale_y_*() +
  scale_x_*() +
  ylab(expression(log[10]("ts_diameter"))) +
  xlab(expression(log[10]("pressure")))

Statistics

# facet_wrap
data(Orange)
ggplot(data = Orange, aes(x = age, y = circumference)) +
  layer(geom = "point", stat = "identity", position = "identity", 
          params = list(shape=1, na.rm = FALSE)) +
  layer(geom = "line", stat = "smooth", position = "identity",
        params = list(method = "lm", se = FALSE))

ggplot(data = cbp, aes(x = year, y = phenotype, group = year)) + 
  layer(geom = "boxplot", stat = "boxplot", position = "identity",
        params = list(outlier.colour="red", outlier.shape = 1,
                      na.rm = FALSE)) +
  facet_grid(rows = vars(sex), cols = vars(herd))

Practice

The main difference between the regression line and LOESS is that while the regression line is a straight line representing the relationship between the x and y, a LOESS line is mainly used to identify trends in the data. Faceting allows you to split the data into subgroups, build a plot adding points, a regression line, and a LOESS curve for each Tree level in the Orange data set. For this, you can use the formula circumference ~ age.

ggplot(data = Orange, aes(x = , y = )) +
  layer(geom = , stat = "identity", position = "identity", 
          params = list(shape=1, na.rm = FALSE)) +
  layer(geom = , stat = , position = "identity",
        params = list(method = , se = FALSE)) +
  layer(geom = , stat = , position = "identity",
        params = list(method = , se = FALSE, colour = "red")) +
  facet_wrap(...)

Geometries

ggplot(data = Orange, aes(x = age, y = circumference)) +
  layer(geom = "point", stat = "identity", position = "identity", 
          params = list(shape=1, na.rm = FALSE, width = 0.1, height = 0.1)) +
  layer(geom = "line", stat = "smooth", position = "identity",
        params = list(method = "lm", se = FALSE)) +
  facet_wrap(facet = "Tree")

Examples

# Density plot
p <- ggplot(data = cbp)
p + geom_density(aes(x=phenotype))

# Density plot per year
p + geom_density_ridges(aes(x = phenotype, y = year, fill = year))

# Histogram - absolute frequency of storms per month and year
ggplot(data = storms, aes(x = year)) +
  facet_wrap(~month)+
  geom_histogram(fill = "lightblue", color = "black")+
  ylab("Absolute Frequency of Tropical Storms") +
  xlab("Year") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

# Histogram - relative frequency of storms per month
ggplot(data = storms, aes(x = year)) +
  geom_histogram(aes(y=stat(count)/sum(..count..)),
                 fill = "lightblue", color = "black")+
  scale_y_continuous(labels=scales::percent) +
  ylab("Relative Frequency of Tropical Storms") +
  xlab("Year") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

# Histogram - absolute frequency of storms per year (stack by month)
ggplot(data = storms, aes(x = year, fill = factor(month))) +
  geom_histogram(color = "black", position = "stack")+
  ylab("Absolute Frequency of Tropical Storms") +
  xlab("Year") +
  labs(fill="Month") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

# Histogram - proportion of storms per month by year
ggplot(data = storms, aes(x = year, fill = factor(month))) +
  geom_histogram(color = "black", position = "fill")+
  ylab("Proportion of Tropical Storms per Month by Year") +
  xlab("Year") +
  labs(fill="Month") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

# Exercise: build a histogram with relative frequency per month within year

# Ribbon and stat_summary - Plot of a time-series
ggplot(data = Orange, aes(x = age, y = circumference)) +
  stat_summary(fun = mean, geom = "ribbon",
               alpha = 0.3, fun.max = function(x) mean(x) + 2*sd(x),
               fun.min = function(x) mean(x) - 2*sd(x), fill = "#00AFBB") +
  stat_summary(fun = mean, geom ="line", colour = "black") +
  stat_summary(fun = mean, geom ="point", colour = "blue",
               size = 2) +
  geom_point(shape=1)

Practice

A violin plot is a method of plotting numeric data, and it is similar to a box plot but with a rotated kernel density plot on each side. Violin plots can show the probability density of the data at different values, usually smoothed by a kernel density estimator. We can also combine both stories into only one. In this exercise, your task is to build a violin plot adding a box plot internally using the cbp data. Also try to modify the arguments (like alpha, coef etc) to see how they change the final plot.

ggplot(data = cbp, aes(x = year, y = phenotype, group = year)) +
  geom_violin(aes(fill = ), size = 1, alpha = 0.5) +
  geom_boxplot(outlier.alpha = 0, coef =0, 
               colour = "black", width = 0.2)

Coordinates

# Cartesian coordinate
p <- ggplot(data = storms, aes(x = "", fill = factor(category))) +
  geom_bar(aes(y = stat(count)/sum(..count..))) +
  scale_y_continuous(labels=scales::percent) +
  labs(title = "Cartesian Coordinate", fill = "Category", y = "Proportion")
p

# Polar coordinate
p + coord_polar(theta="y") +
  labs(title = "Polar Coordinate")

Themes

p <- ggplot(data = Orange, aes(x = age, y = circumference)) +
  stat_summary(fun = mean, geom = "ribbon",
               alpha = 0.3, fun.max = max,
               fun.min = min, fill = "#00AFBB") +
  stat_summary(fun = mean, geom ="line", colour = "black") +
  stat_summary(fun = mean, geom ="point", colour = "blue",
               size = 2) +
  geom_point(shape=1)

p + theme_bw()

p + theme_classic()

p + theme_light()

p + theme_grey()

p + theme_minimal()

p + theme_void()

p + labs(title = "Modifing themes") +
  theme(axis.title = element_text(size = 13, colour = "blue", 
                                  family="serif",
                                  face = "bold"),
        axis.text = element_text(size=12, colour = "red"))

p + labs(title = "Modifing themes") +
  theme(text = element_text(size = 13, colour = "blue", 
                                  family="serif",
                                  face = "bold"))

p +
  theme(panel.grid = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_rect(fill=NA, color="black", 
                                    size=0.5, linetype="dashed"),
        axis.line = element_line(colour = "darkblue", 
                      size = 1, linetype = "solid"),
        axis.ticks = element_line(color="black", size=1.2),
        panel.background = element_rect(fill = "white"),
        axis.title = element_text(size = 13, colour = "darkblue", 
                                  family="serif",
                                  face = "bold"),
        axis.text = element_text(size=12, colour = "red"))

p <- ggplot(data = cbp, aes(x = year, y = phenotype, color = sex)) +
  geom_point()

p + theme_classic() +
    theme(legend.position="top",
          legend.background = element_rect(fill=NULL,
                                 size=0.5, linetype="solid", 
                                 colour ="darkblue"))

Practice

Let we start with this initial ggplot object:

cbp$year2 <- as.numeric(cbp$year)
p <- ggplot(data = cbp, aes(x = year2, y = phenotype)) +
  geom_smooth(aes(linetype = herd, color = herd), 
              method = "loess", se = FALSE) +
  stat_summary(aes(group = herd,colour = herd),fun = mean, geom ="point",
               size = 1.3, shape = 1) +
  facet_wrap(~sex)
p

Now we have some tasks to make this plot looks more professional.

Add a new standard theme:

p2 <- p + theme_*() # suggestion to classic

Change the legend from right to top, increase text (title and axis) size to 13, increase the axis ticks to 1.3, increase legend key width to 1.3 cm, and include a legend key with fill = "white" and colour = "gray40":

p3 <- p2 +
  theme(legend.position = "top",
        axis.* = element_*(size = ),
        axis.* = element_*(size= ),
        axis.* = element_*(size = ),
        legend.* = element_*(fill = "white", colour = "gray40"),
        legend.key.* = unit( , "cm"))

Change the y label to Phenotype, x label to Year, and legend title from herd to Herd.

p3 + 
  labs(...)

Quick plot

# Using qplot
qplot( year, phenotype, data = cbp, facets = sex ~ herd)

# using ggplot
ggplot(data = cbp, aes(x = year, y = phenotype)) +
  geom_point() +
  facet_grid(rows = vars(sex), cols = vars(herd))

Annotate

# Example 1
ggplot(storms, aes(x = wind, y = pressure, colour = status)) +
  geom_point(shape=1) + 
  annotate("rect", xmin = 63, xmax = 161, ymin = 880, ymax = 1010,
  alpha = .2) +
  annotate("text", x = 110, y = 888, label = "Gilbert (1988)") +
  annotate("text", x = 110, y = 882, label = "Wilma (2005)") +
  geom_segment(aes(x = 135, y = 888, xend = 159, yend = 888),
               arrow = arrow(length = unit(0.2, "cm")),
               show.legend = FALSE)+
  geom_segment(aes(x = 135, y = 882, xend = 159, yend = 882),
               arrow = arrow(length = unit(0.2, "cm")),
               show.legend = FALSE)

# Example 2
colnames(Orange)[2:3] <- c("x","y")
p1 <- ggplot(data = Orange, aes(x = x, y = y)) +
  geom_point(shape=1) +
  geom_smooth(method = "lm", se = FALSE) +
    ggpmisc::stat_poly_eq(
    aes(label =  paste(after_stat(eq.label),
                 after_stat(adj.rr.label),
                 sep = "*\", \"*")),
    formula = y ~ poly(x, 1, raw = TRUE))
p1

# Using the package ggpmisc with facets
p2 <- p1 + facet_wrap(~Tree)

tag_facet(p2, 
          x = 1300, y = -Inf, 
          vjust = -1,
          open = "", close = ")",
          tag_pool = LETTERS)

# Example 3
ex1 <- data.frame(y = 1:3, family = c("sans", "serif", "Roboto"))
ggplot(data=ex1) + 
  geom_text(aes(x=1, y=y, label = family, 
                colour = family, family = family),
            show.legend = FALSE, size = 6)+ 
  geom_label(aes(x=0, y=y, label = family, family = family,
             fontface = c(1:3)))+
  xlim(c(-0.5,1.5)) +
  theme_classic(base_size = 12)

Practice

Use the function lmNote to add the regression equation and \(R^2\) to the plot using a label geometry.

# Function to extract coefficients lm(y~x)
lmNote <- function(y, x, digits=2) {
  p <- lm(y~x)
  z <- list(beta0 = format(as.numeric(coef(p)[1]), digits = digits),
            beta1 = format(abs(as.numeric(coef(p)[2])), digits = digits),
            r2 = format(summary(p)$r.squared, digits = digits));
  if (coef(p)[2] >= 0) {
    tmp <- substitute(hat(y) == beta0 + beta1 %.% x*","~~R^2~"="~r2,z)
  } else {
    tmp <- substitute(hat(y) == beta0 - beta1 %.% x*","~~R^2~"="~r2,z)  
  }
  as.character(as.expression(tmp))
}

data("Orange")
ggplot(data = Orange, aes(x = age, y = circumference)) +
  geom_point(shape=1) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_*(x=400, y=200, label = lmNote(y = Orange$, x = Orange$),
            parse = TRUE)

Plot Composition

p1 <- ggplot(data = storms, aes(x = year, fill = factor(month))) +
  geom_histogram(color = "black", position = "fill")+
  ylab("Proportion of Tropical Storms per Month by Year") +
  xlab("Year") +
  labs(fill="Month") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

p2 <- ggplot(data = storms, aes(x = "", fill = factor(status):factor(category))) +
  geom_bar(aes(y = stat(count)/sum(..count..))) +
  scale_y_continuous(labels=scales::percent) +
  labs(title = "Polar Coordinate", fill = "Category", y = "Proportion") + 
  coord_polar(theta="y")

p3 <- ggplot(data = storms, aes(x = year)) +
  geom_histogram(aes(y=stat(count)/sum(..count..)),
                 fill = "lightblue", color = "black")+
  scale_y_continuous(labels=scales::percent) +
  ylab("Relative Frequency of Tropical Storms") +
  xlab("Year") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

(p3 | p2) +
  plot_annotation(tag_levels = "A")

p1|(p3/p2)

data("Orange")
p1 <- ggplot(data = Orange, aes(x = age, y = circumference)) +
  stat_summary(fun = mean, geom = "ribbon",
               alpha = 0.3, fun.max = max,
               fun.min = min, fill = "#00AFBB") +
  stat_summary(fun = mean, geom ="line", colour = "black") +
  stat_summary(fun = mean, geom ="point", colour = "blue",
               size = 2) +
  geom_point(shape=1)

OrangeSummary <- Orange %>%
  group_by(age) %>%
  summarise(
    Age = unique(age),
    Mean = mean(circumference),
    Median = median(circumference),
    Sd = sd(circumference)
  ) %>%
  round(1)

p1 | gridExtra::tableGrob(OrangeSummary)

text <- paste("The Orange data frame has 35 rows and 3 columns", 
              "of records of the growth of orange trees.", sep = "\n")
p4 <- wrap_elements(ggpubr::text_grob(text, face="bold",color = "blue"))

p4/( p1 | gridExtra::tableGrob(OrangeSummary))

Thiago de Paula Oliveira

Consultant Statistician

My research interests include statistical modelling, agriculture, genetics, and sports.