The grammar of graphics: R code
Principles of graph grammar and tidy data
If you’re at all familiar with ggplot
, you may know the basic structure of a call to the ggplot()
function. For an introduction to the ggplot
package, you can check out the visualization and data structure lecture. When you call ggplot, you should provide a data source, usually a data frame or tibble
. Afterward, you can build a ggplot
by mapping different variables in your data source to different aesthetics. For example, there are colour
, shape
, size
of points, the position
of the x or y-axis, ans so on. To demonstrate this and even more procedures using ggplot
, we will work with different data sets such as:
- Growth of Orange Trees: the Orange data frame has 35 rows and 3 columns of records of the growth of orange trees.
glimpse(Orange)
## Rows: 35
## Columns: 3
## $ Tree <ord> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3,…
## $ age <dbl> 118, 484, 664, 1004, 1231, 1372, 1582, 118, 484, 664, 10…
## $ circumference <dbl> 30, 58, 87, 115, 120, 142, 145, 33, 69, 111, 156, 172, 2…
- Storm tracks data: This data is a subset of the NOAA Atlantic hurricane database best track data, https://www.nhc.noaa.gov/data/#hurdat. The data includes the positions and attributes of 198 tropical storms, measured every six hours during the lifetime of a storm.
glimpse(storms)
## Rows: 10,010
## Columns: 13
## $ name <chr> "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "A…
## $ year <dbl> 1975, 1975, 1975, 1975, 1975, 1975, 1975, 1975, 1975, 1975…
## $ month <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7…
## $ day <int> 27, 27, 27, 27, 28, 28, 28, 28, 29, 29, 29, 29, 30, 30, 30…
## $ hour <dbl> 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18, 0,…
## $ lat <dbl> 27.5, 28.5, 29.5, 30.5, 31.5, 32.4, 33.3, 34.0, 34.4, 34.0…
## $ long <dbl> -79.0, -79.0, -79.0, -79.0, -78.8, -78.7, -78.0, -77.0, -7…
## $ status <chr> "tropical depression", "tropical depression", "tropical de…
## $ category <ord> -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ wind <int> 25, 25, 25, 25, 25, 25, 25, 30, 35, 40, 45, 50, 50, 55, 60…
## $ pressure <int> 1013, 1013, 1013, 1013, 1012, 1012, 1011, 1006, 1004, 1002…
## $ ts_diameter <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ hu_diameter <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
- Carbon Dioxide Uptake in Grass Plants: The
CO2
data frame has 84 rows and 5 columns of data from an experiment on the cold tolerance of the grass species Echinochloa crus-galli.
glimpse(CO2)
## Rows: 84
## Columns: 5
## $ Plant <ord> Qn1, Qn1, Qn1, Qn1, Qn1, Qn1, Qn1, Qn2, Qn2, Qn2, Qn2, Qn2, …
## $ Type <fct> Quebec, Quebec, Quebec, Quebec, Quebec, Quebec, Quebec, Queb…
## $ Treatment <fct> nonchilled, nonchilled, nonchilled, nonchilled, nonchilled, …
## $ conc <dbl> 95, 175, 250, 350, 500, 675, 1000, 95, 175, 250, 350, 500, 6…
## $ uptake <dbl> 16.0, 30.4, 34.8, 37.2, 35.3, 39.2, 39.7, 13.6, 27.3, 37.1, …
- Simulated cattle breeding program (cbp): the
cbp
data frame has 10,000 rows and 7 columns of data from a simulation using theAlphaSimR
package. The data is comprised of founders and phenotyped individuals (on the Milk Yield), where for each one, we have information of parents, sex, and herd.
<- readRDS("./data/animal_sim.RDS")
cbp glimpse(cbp)
## Rows: 10,000
## Columns: 7
## $ ind <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ father <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mother <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ year <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ sex <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, …
## $ phenotype <dbl> 37.66915, 35.79590, 28.42813, 33.62726, 32.86620, 31.63590, …
## $ herd <fct> E, B, A, D, A, A, A, E, E, C, A, B, B, A, C, C, E, C, A, A, …
Mapping
<- ggplot(data = cbp, aes(x = year, y = phenotype, color = sex))
p + layer(geom = "point", stat = "identity", position = "identity",
p params = list(shape=1, na.rm = FALSE))
Practice
Use the data CO2 (Biochemical Oxygen Demand) to build a layer with geom = "point"
using conc
as the response variable, uptake
as an explanatory variable, and colour the points by Treatment
.
data(CO2)
<- ggplot(data = , aes(x = , y = , color = ))
pEx1 + layer(geom = , stat = "identity", position = "identity",
pEx1 params = list(shape=1, na.rm = FALSE))
Facets
# facet_wrap
+ layer(geom = "point", stat = "identity", position = "identity",
p params = list(shape=1, na.rm = FALSE)) +
facet_wrap(facet = "sex")
# facet_grid
+ layer(geom = "point", stat = "identity", position = "identity",
p params = list(shape=1, na.rm = FALSE)) +
facet_grid(rows = vars(sex), cols = vars(herd))
Practice
- Using the previous exercise as a starting point, add a
facet_wrap()
using Plant as a subgroup.
+ layer(geom = "point", stat = "identity", position = "identity",
pEx1 params = list(shape=1, na.rm = FALSE)) +
facet_wrap()
- Make a plot using the
storms
database consideringts_diameter
as response,hu_diameter
as explanatory variable, and add afacet_grid()
using rows asstatus
and columns ascategory
.
ggplot(data = , aes(y= , x=)) +
layer(geom = "point", stat = "identity", position = "identity",
params = list(shape=1, na.rm = FALSE)) +
facet_grid(rows = , cols = )
Scale
<- p + layer(geom = "point", stat = "identity", position = "identity",
p2 params = list(shape=1, na.rm = FALSE)) +
facet_grid(rows = vars(sex), cols = vars(herd)) +
scale_colour_manual(values = c("#00AFBB", "#FC4E07"))
p2
+ scale_color_brewer(palette = "Dark2") p2
Practice
Make a plot using the storms
database using ts_diameter
as a response, pressure
as an explanatory variable, colored by status
. Also add a facet_wrap()
by category
and do a scale transformation on the x
and y
axis using a log10()
function.
ggplot(data = storms, aes(y= , x= )) +
layer(geom = "point", stat = "identity", position = "identity",
params = list(shape=1, na.rm = FALSE)) +
facet_wrap(...) +
*() +
scale_y_*() +
scale_x_ylab(expression(log[10]("ts_diameter"))) +
xlab(expression(log[10]("pressure")))
Statistics
# facet_wrap
data(Orange)
ggplot(data = Orange, aes(x = age, y = circumference)) +
layer(geom = "point", stat = "identity", position = "identity",
params = list(shape=1, na.rm = FALSE)) +
layer(geom = "line", stat = "smooth", position = "identity",
params = list(method = "lm", se = FALSE))
ggplot(data = cbp, aes(x = year, y = phenotype, group = year)) +
layer(geom = "boxplot", stat = "boxplot", position = "identity",
params = list(outlier.colour="red", outlier.shape = 1,
na.rm = FALSE)) +
facet_grid(rows = vars(sex), cols = vars(herd))
Practice
The main difference between the regression line and LOESS is that while the regression line is a straight line representing the relationship between the x and y, a LOESS line is mainly used to identify trends in the data. Faceting allows you to split the data into subgroups, build a plot adding points, a regression line, and a LOESS curve for each Tree
level in the Orange data set. For this, you can use the formula circumference ~ age
.
ggplot(data = Orange, aes(x = , y = )) +
layer(geom = , stat = "identity", position = "identity",
params = list(shape=1, na.rm = FALSE)) +
layer(geom = , stat = , position = "identity",
params = list(method = , se = FALSE)) +
layer(geom = , stat = , position = "identity",
params = list(method = , se = FALSE, colour = "red")) +
facet_wrap(...)
Geometries
ggplot(data = Orange, aes(x = age, y = circumference)) +
layer(geom = "point", stat = "identity", position = "identity",
params = list(shape=1, na.rm = FALSE, width = 0.1, height = 0.1)) +
layer(geom = "line", stat = "smooth", position = "identity",
params = list(method = "lm", se = FALSE)) +
facet_wrap(facet = "Tree")
Examples
# Density plot
<- ggplot(data = cbp)
p + geom_density(aes(x=phenotype)) p
# Density plot per year
+ geom_density_ridges(aes(x = phenotype, y = year, fill = year)) p
# Histogram - absolute frequency of storms per month and year
ggplot(data = storms, aes(x = year)) +
facet_wrap(~month)+
geom_histogram(fill = "lightblue", color = "black")+
ylab("Absolute Frequency of Tropical Storms") +
xlab("Year") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
# Histogram - relative frequency of storms per month
ggplot(data = storms, aes(x = year)) +
geom_histogram(aes(y=stat(count)/sum(..count..)),
fill = "lightblue", color = "black")+
scale_y_continuous(labels=scales::percent) +
ylab("Relative Frequency of Tropical Storms") +
xlab("Year") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
# Histogram - absolute frequency of storms per year (stack by month)
ggplot(data = storms, aes(x = year, fill = factor(month))) +
geom_histogram(color = "black", position = "stack")+
ylab("Absolute Frequency of Tropical Storms") +
xlab("Year") +
labs(fill="Month") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
# Histogram - proportion of storms per month by year
ggplot(data = storms, aes(x = year, fill = factor(month))) +
geom_histogram(color = "black", position = "fill")+
ylab("Proportion of Tropical Storms per Month by Year") +
xlab("Year") +
labs(fill="Month") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
# Exercise: build a histogram with relative frequency per month within year
# Ribbon and stat_summary - Plot of a time-series
ggplot(data = Orange, aes(x = age, y = circumference)) +
stat_summary(fun = mean, geom = "ribbon",
alpha = 0.3, fun.max = function(x) mean(x) + 2*sd(x),
fun.min = function(x) mean(x) - 2*sd(x), fill = "#00AFBB") +
stat_summary(fun = mean, geom ="line", colour = "black") +
stat_summary(fun = mean, geom ="point", colour = "blue",
size = 2) +
geom_point(shape=1)
Practice
A violin plot is a method of plotting numeric data, and it is similar to a box plot but with a rotated kernel density plot on each side. Violin plots can show the probability density of the data at different values, usually smoothed by a kernel density estimator. We can also combine both stories into only one. In this exercise, your task is to build a violin plot adding a box plot internally using the cbp data. Also try to modify the arguments (like alpha
, coef
etc) to see how they change the final plot.
ggplot(data = cbp, aes(x = year, y = phenotype, group = year)) +
geom_violin(aes(fill = ), size = 1, alpha = 0.5) +
geom_boxplot(outlier.alpha = 0, coef =0,
colour = "black", width = 0.2)
Coordinates
# Cartesian coordinate
<- ggplot(data = storms, aes(x = "", fill = factor(category))) +
p geom_bar(aes(y = stat(count)/sum(..count..))) +
scale_y_continuous(labels=scales::percent) +
labs(title = "Cartesian Coordinate", fill = "Category", y = "Proportion")
p
# Polar coordinate
+ coord_polar(theta="y") +
p labs(title = "Polar Coordinate")
Themes
<- ggplot(data = Orange, aes(x = age, y = circumference)) +
p stat_summary(fun = mean, geom = "ribbon",
alpha = 0.3, fun.max = max,
fun.min = min, fill = "#00AFBB") +
stat_summary(fun = mean, geom ="line", colour = "black") +
stat_summary(fun = mean, geom ="point", colour = "blue",
size = 2) +
geom_point(shape=1)
+ theme_bw() p
+ theme_classic() p
+ theme_light() p
+ theme_grey() p
+ theme_minimal() p
+ theme_void() p
+ labs(title = "Modifing themes") +
p theme(axis.title = element_text(size = 13, colour = "blue",
family="serif",
face = "bold"),
axis.text = element_text(size=12, colour = "red"))
+ labs(title = "Modifing themes") +
p theme(text = element_text(size = 13, colour = "blue",
family="serif",
face = "bold"))
+
p theme(panel.grid = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(fill=NA, color="black",
size=0.5, linetype="dashed"),
axis.line = element_line(colour = "darkblue",
size = 1, linetype = "solid"),
axis.ticks = element_line(color="black", size=1.2),
panel.background = element_rect(fill = "white"),
axis.title = element_text(size = 13, colour = "darkblue",
family="serif",
face = "bold"),
axis.text = element_text(size=12, colour = "red"))
<- ggplot(data = cbp, aes(x = year, y = phenotype, color = sex)) +
p geom_point()
+ theme_classic() +
p theme(legend.position="top",
legend.background = element_rect(fill=NULL,
size=0.5, linetype="solid",
colour ="darkblue"))
Practice
Let we start with this initial ggplot
object:
$year2 <- as.numeric(cbp$year)
cbp<- ggplot(data = cbp, aes(x = year2, y = phenotype)) +
p geom_smooth(aes(linetype = herd, color = herd),
method = "loess", se = FALSE) +
stat_summary(aes(group = herd,colour = herd),fun = mean, geom ="point",
size = 1.3, shape = 1) +
facet_wrap(~sex)
p
Now we have some tasks to make this plot looks more professional.
- Add a new standard theme:
<- p + theme_*() # suggestion to classic p2
- Change the legend from right to top, increase text (title and axis) size to
13
, increase the axis ticks to 1.3, increase legend key width to 1.3 cm, and include a legend key withfill = "white"
andcolour = "gray40"
:
<- p2 +
p3 theme(legend.position = "top",
* = element_*(size = ),
axis.* = element_*(size= ),
axis.* = element_*(size = ),
axis.* = element_*(fill = "white", colour = "gray40"),
legend.* = unit( , "cm")) legend.key.
- Change the y label to
Phenotype
, x label toYear
, and legend title from herd to Herd.
+
p3 labs(...)
Quick plot
# Using qplot
qplot( year, phenotype, data = cbp, facets = sex ~ herd)
# using ggplot
ggplot(data = cbp, aes(x = year, y = phenotype)) +
geom_point() +
facet_grid(rows = vars(sex), cols = vars(herd))
Annotate
# Example 1
ggplot(storms, aes(x = wind, y = pressure, colour = status)) +
geom_point(shape=1) +
annotate("rect", xmin = 63, xmax = 161, ymin = 880, ymax = 1010,
alpha = .2) +
annotate("text", x = 110, y = 888, label = "Gilbert (1988)") +
annotate("text", x = 110, y = 882, label = "Wilma (2005)") +
geom_segment(aes(x = 135, y = 888, xend = 159, yend = 888),
arrow = arrow(length = unit(0.2, "cm")),
show.legend = FALSE)+
geom_segment(aes(x = 135, y = 882, xend = 159, yend = 882),
arrow = arrow(length = unit(0.2, "cm")),
show.legend = FALSE)
# Example 2
colnames(Orange)[2:3] <- c("x","y")
<- ggplot(data = Orange, aes(x = x, y = y)) +
p1 geom_point(shape=1) +
geom_smooth(method = "lm", se = FALSE) +
::stat_poly_eq(
ggpmiscaes(label = paste(after_stat(eq.label),
after_stat(adj.rr.label),
sep = "*\", \"*")),
formula = y ~ poly(x, 1, raw = TRUE))
p1
# Using the package ggpmisc with facets
<- p1 + facet_wrap(~Tree)
p2
tag_facet(p2,
x = 1300, y = -Inf,
vjust = -1,
open = "", close = ")",
tag_pool = LETTERS)
# Example 3
<- data.frame(y = 1:3, family = c("sans", "serif", "Roboto"))
ex1 ggplot(data=ex1) +
geom_text(aes(x=1, y=y, label = family,
colour = family, family = family),
show.legend = FALSE, size = 6)+
geom_label(aes(x=0, y=y, label = family, family = family,
fontface = c(1:3)))+
xlim(c(-0.5,1.5)) +
theme_classic(base_size = 12)
Practice
Use the function lmNote
to add the regression equation and \(R^2\) to the plot using a label geometry.
# Function to extract coefficients lm(y~x)
<- function(y, x, digits=2) {
lmNote <- lm(y~x)
p <- list(beta0 = format(as.numeric(coef(p)[1]), digits = digits),
z beta1 = format(abs(as.numeric(coef(p)[2])), digits = digits),
r2 = format(summary(p)$r.squared, digits = digits));
if (coef(p)[2] >= 0) {
<- substitute(hat(y) == beta0 + beta1 %.% x*","~~R^2~"="~r2,z)
tmp else {
} <- substitute(hat(y) == beta0 - beta1 %.% x*","~~R^2~"="~r2,z)
tmp
}as.character(as.expression(tmp))
}
data("Orange")
ggplot(data = Orange, aes(x = age, y = circumference)) +
geom_point(shape=1) +
geom_smooth(method = "lm", se = FALSE) +
*(x=400, y=200, label = lmNote(y = Orange$, x = Orange$),
geom_parse = TRUE)
Plot Composition
<- ggplot(data = storms, aes(x = year, fill = factor(month))) +
p1 geom_histogram(color = "black", position = "fill")+
ylab("Proportion of Tropical Storms per Month by Year") +
xlab("Year") +
labs(fill="Month") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
<- ggplot(data = storms, aes(x = "", fill = factor(status):factor(category))) +
p2 geom_bar(aes(y = stat(count)/sum(..count..))) +
scale_y_continuous(labels=scales::percent) +
labs(title = "Polar Coordinate", fill = "Category", y = "Proportion") +
coord_polar(theta="y")
<- ggplot(data = storms, aes(x = year)) +
p3 geom_histogram(aes(y=stat(count)/sum(..count..)),
fill = "lightblue", color = "black")+
scale_y_continuous(labels=scales::percent) +
ylab("Relative Frequency of Tropical Storms") +
xlab("Year") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
| p2) +
(p3 plot_annotation(tag_levels = "A")
|(p3/p2) p1
data("Orange")
<- ggplot(data = Orange, aes(x = age, y = circumference)) +
p1 stat_summary(fun = mean, geom = "ribbon",
alpha = 0.3, fun.max = max,
fun.min = min, fill = "#00AFBB") +
stat_summary(fun = mean, geom ="line", colour = "black") +
stat_summary(fun = mean, geom ="point", colour = "blue",
size = 2) +
geom_point(shape=1)
<- Orange %>%
OrangeSummary group_by(age) %>%
summarise(
Age = unique(age),
Mean = mean(circumference),
Median = median(circumference),
Sd = sd(circumference)
%>%
) round(1)
| gridExtra::tableGrob(OrangeSummary) p1
<- paste("The Orange data frame has 35 rows and 3 columns",
text "of records of the growth of orange trees.", sep = "\n")
<- wrap_elements(ggpubr::text_grob(text, face="bold",color = "blue"))
p4
/( p1 | gridExtra::tableGrob(OrangeSummary)) p4