Tidyverse
Visualization and Data Structure
Tidy data refers to data arranged to make data processing, analysis, and visualization simpler. Remember that in a tidy data set we should consider:
- Each variable must have its column.
- Each observation must have its row.
- Each value must have its cell.
Video
Slides
Exercises
Exercise 1
Let’s say we want to organize the data anscombe
. Below I show how this data looks like:
anscombe
- Organize this data set to obtain tidy data. Remember here we have two response variables been measured four times.
ex1 <- anscombe %>%
- Filter the data set to get replications 2 and 4, and summarise it to get the maximum, minimum, and mean values.
ex2 %>%
filter() %>%
summarise(
)
Exercise 2
Often you do not need the entire data set, but just part of it.
- Here, you should make the data
mtcars
tidy before making any selection.
(dataEx3 <- readRDS("./data/dataEx3.rds"))
As you can see, some columns are not variable names but values. Create two new variables calling mpg (for observations) and gear (with column values).
dataEx3 <- dataEx3 %>%
pivot_longer(
)
- Select the columns mpg, hp, gear, and carb, and then make a plot using ggplot2 where
mpg
is the response variable, andhp
is the co-variate in the x-axis. Also include different shapes and colours forgear
, and facets forcarb
.
dataEx3 %>%
select() %>%
ggplot() %>%
geom_point() %>%
facet_wrap() %>%
theme_bw()
Exercise 3
The following data represents song rankings for Billboard top 100 in the year 2000. The rank of the song is displayed in each week after it entered.
billboard
A slightly more complex case where columns have a common prefix and missing missings are structural, so should be dropped. So, make this data tidy.
billboard %>%
Data Structure
Exercise 1
- Make this data tidy by including
tmin
andtmax
as variable. Remember that here type is carrying to variables names rather than factors.
(dataEx2 <- as.tibble(readRDS("./data/dataEx2.RDS")))
dataEx2 <- dataEx2 %>%
pivot_wider()
Now, build a new variable called tdiff
, which is the difference between tmax
and tmin
. Moreover, display a ggplot2
graph that shows tdiff
over time.
dataEx2 %>%
Exercise 2
Our cattle data data is already in a tidy format.
(cbp <- readRDS("./data/animal_sim.RDS"))
For this exercise, complete the following tasks with that data set:
- Calculate the average phenotype per year by sex and herd using the
summarise()
function in the dplyr package. - Add two columns to cattle data using the
mutate()
function:- Column 1: Phenotype should be rescaled to have a mean of zero and a standard deviation of one. You can call this new variable as
PhenoStd
. - Column 2: Rank the
PhenoStd
using the functionmin_rank()
. - The output data frame should have only
PhenoStd > 0
.
- Column 1: Phenotype should be rescaled to have a mean of zero and a standard deviation of one. You can call this new variable as
cbp %>%
summarise(
) %>%
Exercise 3
Excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country. This data has 142 countries observed from the year 1952 to 2007 in increments of 5 years. The response variable observed was the life expectancy at birth (in years), population size, and Per capita gross domestic product (GDP).
Per capita gross domestic product (GDP) measures a country’s economic response per person and is calculated by dividing its GDP by its population. It is a global measure for gauging the prosperity of nations as we can analyze the worth of a country based on its economic growth. Thus, countries that have the highest per capita GDP tend to be more developed.
gapminder
Questions:
- What are the ten highest
gpdPercap
values?
gapminder %>%
- Find both the median life expectancy (
lifeExp
) and the median and maximum GDP per capita (gdpPercap
) in 1957, 1982, and 2007, by country and continent. Call themmedianLifeExp
,medianGdpPercap
, andmaxGdpPercap
, respectively.
dat <- gapminder %>%
- Use a scatter plot to compare the median GDP and median life expectancy. Use the variables continent and year to produce this plot.
dat %>%