Tidyverse: answers

Principles of graph grammar and tidy data: tidyverse exercises

Dec 29, 2021

Tidy data
- Exercise 1
- Exercise 2
Exercise 3
Data Structure

logo

Tidy data

Tidy data refers to data arranged to make data processing, analysis, and visualization simpler. Remember that in a tidy data set we should consider:

Each variable must have its column.
Each observation must have its row.
Each value must have its cell.

Exercise 1

Let’s say we want to organize the data anscombe. Below I shpw how this data looks like:

anscombe

##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Organize this data set to obtain tidy data. Remember here we have two response variables been measured four times.

Most of the selecting, separating, mutating and renaming is taking place within the pivot function calls.

(ex1 <- anscombe %>%
  pivot_longer(everything(),
               names_to = c(".value", "rep"),
               names_pattern = "(.)([0-9])"
               ))

## # A tibble: 44 × 3
##    rep       x     y
##    <chr> <dbl> <dbl>
##  1 1        10  8.04
##  2 2        10  9.14
##  3 3        10  7.46
##  4 4         8  6.58
##  5 1         8  6.95
##  6 2         8  8.14
##  7 3         8  6.77
##  8 4         8  5.76
##  9 1        13  7.58
## 10 2        13  8.74
## # … with 34 more rows

Filter the data set to get replications 2 and 4, and summarise it to get the maximum, minimum, and mean values.

ex1 %>%
  filter(rep %in% c(2,4)) %>%
  summarise(
    across(c(x,y), list(mean = mean, min = min, max = max),
           .names = "{.col}.{.fn}"
  ))

## # A tibble: 1 × 6
##   x.mean x.min x.max y.mean y.min y.max
##    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1      9     4    19   7.50   3.1  12.5

Exercise 2

Often you do not need the entire data set, but just part of it.

Here, you should make the data mtcars tidy before making any selection.

(dataEx3 <- readRDS("./data/dataEx3.rds"))

## # A tibble: 32 × 12
##      cyl  disp    hp  drat    wt  qsec    vs    am  carb   `4`   `3`   `5`
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     6  160    110  3.9   2.62  16.5     0     1     4  21    NA      NA
##  2     6  160    110  3.9   2.88  17.0     0     1     4  21    NA      NA
##  3     4  108     93  3.85  2.32  18.6     1     1     1  22.8  NA      NA
##  4     6  258    110  3.08  3.22  19.4     1     0     1  NA    21.4    NA
##  5     8  360    175  3.15  3.44  17.0     0     0     2  NA    18.7    NA
##  6     6  225    105  2.76  3.46  20.2     1     0     1  NA    18.1    NA
##  7     8  360    245  3.21  3.57  15.8     0     0     4  NA    14.3    NA
##  8     4  147.    62  3.69  3.19  20       1     0     2  24.4  NA      NA
##  9     4  141.    95  3.92  3.15  22.9     1     0     2  22.8  NA      NA
## 10     6  168.   123  3.92  3.44  18.3     1     0     4  19.2  NA      NA
## # … with 22 more rows

As you can see, some columns are not variable names but values. Create two new variables calling mpg (for observations) and gear (with column values).

dataEx3 <- dataEx3 %>%
  pivot_longer(
   cols = matches("([1-9])"),
   names_to = "gear",
   values_to = "mpg",
   values_drop_na = TRUE
  )

Select the columns mpg, hp, gear, and carb, and then make a plot using ggplot2 where mpg is the response variable, and hp is the co-variate in the x-axis. Also include different shapes and colours for gear, and facets for carb.

dataEx3 %>%
  select(mpg, hp, gear, carb) %>%
  ggplot(aes(y=mpg, x = hp, shape = gear,
             colour = gear)) +
  geom_point() +
  facet_wrap(~carb) +
  theme_bw()

Exercise 3

The following data represents song rankings for Billboard top 100 in the year 2000. The rank of the song is displayed in each week after it entered.

billboard

## # A tibble: 317 × 79
##    artist   track   date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>    <chr>   <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac    Baby D… 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+her  The Ha… 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doors… Krypto… 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doors… Loser   2000-10-21      76    76    72    69    67    65    55    59
##  5 504 Boyz Wobble… 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0     Give M… 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Teens  Dancin… 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliyah  I Don'… 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliyah  Try Ag… 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams, … Open M… 2000-08-26      76    76    74    69    68    67    61    58
## # … with 307 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>,
## #   wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
## #   wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
## #   wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
## #   wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
## #   wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
## #   wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, …

A slightly more complex case where columns have a common prefix and missing missings are structural, so should be dropped. So, make this data tidy.

billboard %>%
  pivot_longer(
   cols = starts_with("wk"),
   names_to = "week",
   names_prefix = "wk",
   values_to = "rank",
   values_drop_na = TRUE
 )

Data Structure

Exercise 1

Make this data tidy by including tmin and tmax as variable. Remember that here type is carrying to variables names rather than factors.

(dataEx2 <- as_tibble(readRDS("./data/dataEx2.RDS")))

## # A tibble: 20 × 4
##    id    date       type  value
##    <chr> <date>     <fct> <dbl>
##  1 Ind 1 2020-01-01 tmin   35.8
##  2 Ind 1 2020-01-01 tmax   36.7
##  3 Ind 1 2020-01-02 tmin   35.9
##  4 Ind 1 2020-01-02 tmax   36.8
##  5 Ind 1 2020-01-03 tmin   36.0
##  6 Ind 1 2020-01-03 tmax   36.8
##  7 Ind 1 2020-01-04 tmin   36.1
##  8 Ind 1 2020-01-04 tmax   36.9
##  9 Ind 1 2020-01-05 tmin   36.1
## 10 Ind 1 2020-01-05 tmax   36.9
## 11 Ind 1 2020-01-06 tmin   36.2
## 12 Ind 1 2020-01-06 tmax   37.0
## 13 Ind 1 2020-01-07 tmin   36.3
## 14 Ind 1 2020-01-07 tmax   37.1
## 15 Ind 1 2020-01-08 tmin   36.3
## 16 Ind 1 2020-01-08 tmax   37.2
## 17 Ind 1 2020-01-09 tmin   36.6
## 18 Ind 1 2020-01-09 tmax   37.3
## 19 Ind 1 2020-01-10 tmin   36.6
## 20 Ind 1 2020-01-10 tmax   38.0

(dataEx2 <- dataEx2 %>%
  pivot_wider(values_from = value, names_from = type))

## # A tibble: 10 × 4
##    id    date        tmin  tmax
##    <chr> <date>     <dbl> <dbl>
##  1 Ind 1 2020-01-01  35.8  36.7
##  2 Ind 1 2020-01-02  35.9  36.8
##  3 Ind 1 2020-01-03  36.0  36.8
##  4 Ind 1 2020-01-04  36.1  36.9
##  5 Ind 1 2020-01-05  36.1  36.9
##  6 Ind 1 2020-01-06  36.2  37.0
##  7 Ind 1 2020-01-07  36.3  37.1
##  8 Ind 1 2020-01-08  36.3  37.2
##  9 Ind 1 2020-01-09  36.6  37.3
## 10 Ind 1 2020-01-10  36.6  38.0

Now, build a new variable called tdiff, which is the difference between tmax and tmin. Moreover, display a ggplot2 graph that shows tdiff over time.

dataEx2 %>%
  dplyr::mutate(tdiff = tmax - tmin) %>%
  ggplot(aes(y = tdiff, x = date)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  ylab("tmax - tmin") + xlab ("Date") +
  theme_classic()

Exercise 2

Our cattle data data is already in a tidy format.

(cbp <- readRDS("./data/animal_sim.RDS"))

## # A tibble: 10,000 × 7
##      ind father mother year  sex   phenotype herd 
##    <dbl>  <dbl>  <dbl> <fct> <fct>     <dbl> <fct>
##  1     1     NA     NA 0     M          37.7 E    
##  2     2     NA     NA 0     M          35.8 B    
##  3     3     NA     NA 0     M          28.4 A    
##  4     4     NA     NA 0     M          33.6 D    
##  5     5     NA     NA 0     M          32.9 A    
##  6     6     NA     NA 0     M          31.6 A    
##  7     7     NA     NA 0     M          38.8 A    
##  8     8     NA     NA 0     M          33.3 E    
##  9     9     NA     NA 0     M          39.0 E    
## 10    10     NA     NA 0     M          46.0 C    
## # … with 9,990 more rows

For this exercise, complete the following tasks with that data set:

Calculate the average phenotype per year by sex and herd using the summarise() function in the dplyr package.
Add two columns to cattle data using the mutate() function:
1. Column 1: Phenotype should be rescaled to have a mean of zero and a standard deviation of one. You can call this new variable as PhenoStd.
2. Column 2: Rank the PhenoStd using the function min_rank().
3. The output data frame should have only PhenoStd > 0.

cbp %>%
  group_by(year, sex, herd) %>%
  summarise(
    mean = mean(phenotype)
  )

## # A tibble: 100 × 4
## # Groups:   year, sex [20]
##    year  sex   herd   mean
##    <fct> <fct> <fct> <dbl>
##  1 0     F     A      40.3
##  2 0     F     B      40.6
##  3 0     F     C      40.3
##  4 0     F     D      39.1
##  5 0     F     E      40.1
##  6 0     M     A      41.0
##  7 0     M     B      40.9
##  8 0     M     C      40.7
##  9 0     M     D      39.9
## 10 0     M     E      40.4
## # … with 90 more rows

.scale <- function(x){
  (x - mean(x, na.rm=TRUE)) / sd(x, na.rm=TRUE)
}

cbp %>%
  dplyr::mutate(PhenoStd = .scale(phenotype)) %>%
  dplyr::mutate(RankPhenoStd = min_rank(PhenoStd)) %>%
  filter(PhenoStd > 0)

## # A tibble: 5,057 × 9
##      ind father mother year  sex   phenotype herd  PhenoStd RankPhenoStd
##    <dbl>  <dbl>  <dbl> <fct> <fct>     <dbl> <fct>    <dbl>        <int>
##  1    11     NA     NA 0     M          51.0 A      0.103           5338
##  2    23     NA     NA 0     M          51.9 E      0.191           5661
##  3    29     NA     NA 0     M          53.5 C      0.355           6258
##  4    34     NA     NA 0     M          50.1 D      0.00987         4981
##  5    61     NA     NA 0     M          53.6 C      0.366           6306
##  6    65     NA     NA 0     M          55.2 A      0.529           6883
##  7    69     NA     NA 0     M          51.3 C      0.135           5461
##  8    84     NA     NA 0     M          52.0 A      0.207           5726
##  9    90     NA     NA 0     M          50.5 B      0.0506          5144
## 10   106     NA     NA 0     M          52.1 C      0.209           5731
## # … with 5,047 more rows

Exercise 3

Excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country. This data has 142 countries observed from the year 1952 to 2007 in increments of 5 years. The response variable observed was the life expectancy at birth (in years), population size, and Per capita gross domestic product (GDP).

Per capita gross domestic product (GDP) measures a country’s economic response per person and is calculated by dividing its GDP by its population. It is a global measure for gauging the prosperity of nations as we can analyze the worth of a country based on its economic growth. Thus, countries that have the highest per capita GDP tend to be more developed.

gapminder

## # A tibble: 1,704 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows

Questions:

What are the ten highest gdpPercap values?

gapminder %>%
  slice_max(gdpPercap, n = 10)

## # A tibble: 10 × 6
##    country   continent  year lifeExp     pop gdpPercap
##    <fct>     <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Kuwait    Asia       1957    58.0  212846   113523.
##  2 Kuwait    Asia       1972    67.7  841934   109348.
##  3 Kuwait    Asia       1952    55.6  160000   108382.
##  4 Kuwait    Asia       1962    60.5  358266    95458.
##  5 Kuwait    Asia       1967    64.6  575003    80895.
##  6 Kuwait    Asia       1977    69.3 1140357    59265.
##  7 Norway    Europe     2007    80.2 4627926    49357.
##  8 Kuwait    Asia       2007    77.6 2505559    47307.
##  9 Singapore Asia       2007    80.0 4553009    47143.
## 10 Norway    Europe     2002    79.0 4535591    44684.

Find both the median life expectancy (lifeExp) and the median and maximum GDP per capita (gdpPercap) in 1957, 1982, and 2007, by country and continent. Call them medianLifeExp, medianGdpPercap, and maxGdpPercap, respectively.

(dat <- gapminder %>%
  filter(year %in% c("1957","1982","2007")) %>%
  group_by(year, country, continent) %>%
  summarise(
    medianlifeExp = median(lifeExp),
    medianGdpPercap = median(gdpPercap),
    maxGdpPercap = max(gdpPercap)
  ))

## # A tibble: 426 × 6
## # Groups:   year, country [426]
##     year country     continent medianlifeExp medianGdpPercap maxGdpPercap
##    <int> <fct>       <fct>             <dbl>           <dbl>        <dbl>
##  1  1957 Afghanistan Asia               30.3            821.         821.
##  2  1957 Albania     Europe             59.3           1942.        1942.
##  3  1957 Algeria     Africa             45.7           3014.        3014.
##  4  1957 Angola      Africa             32.0           3828.        3828.
##  5  1957 Argentina   Americas           64.4           6857.        6857.
##  6  1957 Australia   Oceania            70.3          10950.       10950.
##  7  1957 Austria     Europe             67.5           8843.        8843.
##  8  1957 Bahrain     Asia               53.8          11636.       11636.
##  9  1957 Bangladesh  Asia               39.3            662.         662.
## 10  1957 Belgium     Europe             69.2           9715.        9715.
## # … with 416 more rows

Use a scatter plot to compare the median GDP and median life expectancy. Use the variables continent and year to produce this plot.

dat %>%
  ggplot(aes(x = medianGdpPercap, y = medianlifeExp)) +
  facet_wrap(~ continent) +
  geom_point(shape = 1)

Thiago de Paula Oliveira

Consultant Statistician

My research interests include statistical modelling, agriculture, genetics, and sports.