Tidyverse: answers
Principles of graph grammar and tidy data: tidyverse exercises
Tidy data
Tidy data refers to data arranged to make data processing, analysis, and visualization simpler. Remember that in a tidy data set we should consider:
- Each variable must have its column.
- Each observation must have its row.
- Each value must have its cell.
Exercise 1
Let’s say we want to organize the data anscombe
. Below I shpw how this data looks like:
anscombe
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## 4 9 9 9 8 8.81 8.77 7.11 8.84
## 5 11 11 11 8 8.33 9.26 7.81 8.47
## 6 14 14 14 8 9.96 8.10 8.84 7.04
## 7 6 6 6 8 7.24 6.13 6.08 5.25
## 8 4 4 4 19 4.26 3.10 5.39 12.50
## 9 12 12 12 8 10.84 9.13 8.15 5.56
## 10 7 7 7 8 4.82 7.26 6.42 7.91
## 11 5 5 5 8 5.68 4.74 5.73 6.89
- Organize this data set to obtain tidy data. Remember here we have two response variables been measured four times.
Most of the selecting, separating, mutating and renaming is taking place within the pivot function calls.
<- anscombe %>%
(ex1 pivot_longer(everything(),
names_to = c(".value", "rep"),
names_pattern = "(.)([0-9])"
))
## # A tibble: 44 × 3
## rep x y
## <chr> <dbl> <dbl>
## 1 1 10 8.04
## 2 2 10 9.14
## 3 3 10 7.46
## 4 4 8 6.58
## 5 1 8 6.95
## 6 2 8 8.14
## 7 3 8 6.77
## 8 4 8 5.76
## 9 1 13 7.58
## 10 2 13 8.74
## # … with 34 more rows
- Filter the data set to get replications 2 and 4, and summarise it to get the maximum, minimum, and mean values.
%>%
ex1 filter(rep %in% c(2,4)) %>%
summarise(
across(c(x,y), list(mean = mean, min = min, max = max),
.names = "{.col}.{.fn}"
))
## # A tibble: 1 × 6
## x.mean x.min x.max y.mean y.min y.max
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 9 4 19 7.50 3.1 12.5
Exercise 2
Often you do not need the entire data set, but just part of it.
- Here, you should make the data
mtcars
tidy before making any selection.
<- readRDS("./data/dataEx3.rds")) (dataEx3
## # A tibble: 32 × 12
## cyl disp hp drat wt qsec vs am carb `4` `3` `5`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 160 110 3.9 2.62 16.5 0 1 4 21 NA NA
## 2 6 160 110 3.9 2.88 17.0 0 1 4 21 NA NA
## 3 4 108 93 3.85 2.32 18.6 1 1 1 22.8 NA NA
## 4 6 258 110 3.08 3.22 19.4 1 0 1 NA 21.4 NA
## 5 8 360 175 3.15 3.44 17.0 0 0 2 NA 18.7 NA
## 6 6 225 105 2.76 3.46 20.2 1 0 1 NA 18.1 NA
## 7 8 360 245 3.21 3.57 15.8 0 0 4 NA 14.3 NA
## 8 4 147. 62 3.69 3.19 20 1 0 2 24.4 NA NA
## 9 4 141. 95 3.92 3.15 22.9 1 0 2 22.8 NA NA
## 10 6 168. 123 3.92 3.44 18.3 1 0 4 19.2 NA NA
## # … with 22 more rows
As you can see, some columns are not variable names but values. Create two new variables calling mpg (for observations) and gear (with column values).
<- dataEx3 %>%
dataEx3 pivot_longer(
cols = matches("([1-9])"),
names_to = "gear",
values_to = "mpg",
values_drop_na = TRUE
)
- Select the columns mpg, hp, gear, and carb, and then make a plot using ggplot2 where
mpg
is the response variable, andhp
is the co-variate in the x-axis. Also include different shapes and colours forgear
, and facets forcarb
.
%>%
dataEx3 select(mpg, hp, gear, carb) %>%
ggplot(aes(y=mpg, x = hp, shape = gear,
colour = gear)) +
geom_point() +
facet_wrap(~carb) +
theme_bw()
Exercise 3
The following data represents song rankings for Billboard top 100 in the year 2000. The rank of the song is displayed in each week after it entered.
billboard
## # A tibble: 317 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby D… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The Ha… 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors… Krypto… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobble… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give M… 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Teens Dancin… 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliyah I Don'… 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliyah Try Ag… 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams, … Open M… 2000-08-26 76 76 74 69 68 67 61 58
## # … with 307 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>,
## # wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
## # wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
## # wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
## # wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
## # wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
## # wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, …
A slightly more complex case where columns have a common prefix and missing missings are structural, so should be dropped. So, make this data tidy.
%>%
billboard pivot_longer(
cols = starts_with("wk"),
names_to = "week",
names_prefix = "wk",
values_to = "rank",
values_drop_na = TRUE
)
Data Structure
Exercise 1
- Make this data tidy by including
tmin
andtmax
as variable. Remember that here type is carrying to variables names rather than factors.
<- as_tibble(readRDS("./data/dataEx2.RDS"))) (dataEx2
## # A tibble: 20 × 4
## id date type value
## <chr> <date> <fct> <dbl>
## 1 Ind 1 2020-01-01 tmin 35.8
## 2 Ind 1 2020-01-01 tmax 36.7
## 3 Ind 1 2020-01-02 tmin 35.9
## 4 Ind 1 2020-01-02 tmax 36.8
## 5 Ind 1 2020-01-03 tmin 36.0
## 6 Ind 1 2020-01-03 tmax 36.8
## 7 Ind 1 2020-01-04 tmin 36.1
## 8 Ind 1 2020-01-04 tmax 36.9
## 9 Ind 1 2020-01-05 tmin 36.1
## 10 Ind 1 2020-01-05 tmax 36.9
## 11 Ind 1 2020-01-06 tmin 36.2
## 12 Ind 1 2020-01-06 tmax 37.0
## 13 Ind 1 2020-01-07 tmin 36.3
## 14 Ind 1 2020-01-07 tmax 37.1
## 15 Ind 1 2020-01-08 tmin 36.3
## 16 Ind 1 2020-01-08 tmax 37.2
## 17 Ind 1 2020-01-09 tmin 36.6
## 18 Ind 1 2020-01-09 tmax 37.3
## 19 Ind 1 2020-01-10 tmin 36.6
## 20 Ind 1 2020-01-10 tmax 38.0
<- dataEx2 %>%
(dataEx2 pivot_wider(values_from = value, names_from = type))
## # A tibble: 10 × 4
## id date tmin tmax
## <chr> <date> <dbl> <dbl>
## 1 Ind 1 2020-01-01 35.8 36.7
## 2 Ind 1 2020-01-02 35.9 36.8
## 3 Ind 1 2020-01-03 36.0 36.8
## 4 Ind 1 2020-01-04 36.1 36.9
## 5 Ind 1 2020-01-05 36.1 36.9
## 6 Ind 1 2020-01-06 36.2 37.0
## 7 Ind 1 2020-01-07 36.3 37.1
## 8 Ind 1 2020-01-08 36.3 37.2
## 9 Ind 1 2020-01-09 36.6 37.3
## 10 Ind 1 2020-01-10 36.6 38.0
Now, build a new variable called tdiff
, which is the difference between tmax
and tmin
. Moreover, display a ggplot2
graph that shows tdiff
over time.
%>%
dataEx2 ::mutate(tdiff = tmax - tmin) %>%
dplyrggplot(aes(y = tdiff, x = date)) +
geom_point() +
geom_smooth(se = FALSE) +
ylab("tmax - tmin") + xlab ("Date") +
theme_classic()
Exercise 2
Our cattle data data is already in a tidy format.
<- readRDS("./data/animal_sim.RDS")) (cbp
## # A tibble: 10,000 × 7
## ind father mother year sex phenotype herd
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <fct>
## 1 1 NA NA 0 M 37.7 E
## 2 2 NA NA 0 M 35.8 B
## 3 3 NA NA 0 M 28.4 A
## 4 4 NA NA 0 M 33.6 D
## 5 5 NA NA 0 M 32.9 A
## 6 6 NA NA 0 M 31.6 A
## 7 7 NA NA 0 M 38.8 A
## 8 8 NA NA 0 M 33.3 E
## 9 9 NA NA 0 M 39.0 E
## 10 10 NA NA 0 M 46.0 C
## # … with 9,990 more rows
For this exercise, complete the following tasks with that data set:
- Calculate the average phenotype per year by sex and herd using the
summarise()
function in the dplyr package. - Add two columns to cattle data using the
mutate()
function:- Column 1: Phenotype should be rescaled to have a mean of zero and a standard deviation of one. You can call this new variable as
PhenoStd
. - Column 2: Rank the
PhenoStd
using the functionmin_rank()
. - The output data frame should have only
PhenoStd > 0
.
- Column 1: Phenotype should be rescaled to have a mean of zero and a standard deviation of one. You can call this new variable as
%>%
cbp group_by(year, sex, herd) %>%
summarise(
mean = mean(phenotype)
)
## # A tibble: 100 × 4
## # Groups: year, sex [20]
## year sex herd mean
## <fct> <fct> <fct> <dbl>
## 1 0 F A 40.3
## 2 0 F B 40.6
## 3 0 F C 40.3
## 4 0 F D 39.1
## 5 0 F E 40.1
## 6 0 M A 41.0
## 7 0 M B 40.9
## 8 0 M C 40.7
## 9 0 M D 39.9
## 10 0 M E 40.4
## # … with 90 more rows
<- function(x){
.scale - mean(x, na.rm=TRUE)) / sd(x, na.rm=TRUE)
(x
}
%>%
cbp ::mutate(PhenoStd = .scale(phenotype)) %>%
dplyr::mutate(RankPhenoStd = min_rank(PhenoStd)) %>%
dplyrfilter(PhenoStd > 0)
## # A tibble: 5,057 × 9
## ind father mother year sex phenotype herd PhenoStd RankPhenoStd
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <fct> <dbl> <int>
## 1 11 NA NA 0 M 51.0 A 0.103 5338
## 2 23 NA NA 0 M 51.9 E 0.191 5661
## 3 29 NA NA 0 M 53.5 C 0.355 6258
## 4 34 NA NA 0 M 50.1 D 0.00987 4981
## 5 61 NA NA 0 M 53.6 C 0.366 6306
## 6 65 NA NA 0 M 55.2 A 0.529 6883
## 7 69 NA NA 0 M 51.3 C 0.135 5461
## 8 84 NA NA 0 M 52.0 A 0.207 5726
## 9 90 NA NA 0 M 50.5 B 0.0506 5144
## 10 106 NA NA 0 M 52.1 C 0.209 5731
## # … with 5,047 more rows
Exercise 3
Excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country. This data has 142 countries observed from the year 1952 to 2007 in increments of 5 years. The response variable observed was the life expectancy at birth (in years), population size, and Per capita gross domestic product (GDP).
Per capita gross domestic product (GDP) measures a country’s economic response per person and is calculated by dividing its GDP by its population. It is a global measure for gauging the prosperity of nations as we can analyze the worth of a country based on its economic growth. Thus, countries that have the highest per capita GDP tend to be more developed.
gapminder
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # … with 1,694 more rows
Questions:
- What are the ten highest
gdpPercap
values?
%>%
gapminder slice_max(gdpPercap, n = 10)
## # A tibble: 10 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Kuwait Asia 1957 58.0 212846 113523.
## 2 Kuwait Asia 1972 67.7 841934 109348.
## 3 Kuwait Asia 1952 55.6 160000 108382.
## 4 Kuwait Asia 1962 60.5 358266 95458.
## 5 Kuwait Asia 1967 64.6 575003 80895.
## 6 Kuwait Asia 1977 69.3 1140357 59265.
## 7 Norway Europe 2007 80.2 4627926 49357.
## 8 Kuwait Asia 2007 77.6 2505559 47307.
## 9 Singapore Asia 2007 80.0 4553009 47143.
## 10 Norway Europe 2002 79.0 4535591 44684.
- Find both the median life expectancy (
lifeExp
) and the median and maximum GDP per capita (gdpPercap
) in 1957, 1982, and 2007, by country and continent. Call themmedianLifeExp
,medianGdpPercap
, andmaxGdpPercap
, respectively.
<- gapminder %>%
(dat filter(year %in% c("1957","1982","2007")) %>%
group_by(year, country, continent) %>%
summarise(
medianlifeExp = median(lifeExp),
medianGdpPercap = median(gdpPercap),
maxGdpPercap = max(gdpPercap)
))
## # A tibble: 426 × 6
## # Groups: year, country [426]
## year country continent medianlifeExp medianGdpPercap maxGdpPercap
## <int> <fct> <fct> <dbl> <dbl> <dbl>
## 1 1957 Afghanistan Asia 30.3 821. 821.
## 2 1957 Albania Europe 59.3 1942. 1942.
## 3 1957 Algeria Africa 45.7 3014. 3014.
## 4 1957 Angola Africa 32.0 3828. 3828.
## 5 1957 Argentina Americas 64.4 6857. 6857.
## 6 1957 Australia Oceania 70.3 10950. 10950.
## 7 1957 Austria Europe 67.5 8843. 8843.
## 8 1957 Bahrain Asia 53.8 11636. 11636.
## 9 1957 Bangladesh Asia 39.3 662. 662.
## 10 1957 Belgium Europe 69.2 9715. 9715.
## # … with 416 more rows
- Use a scatter plot to compare the median GDP and median life expectancy. Use the variables continent and year to produce this plot.
%>%
dat ggplot(aes(x = medianGdpPercap, y = medianlifeExp)) +
facet_wrap(~ continent) +
geom_point(shape = 1)