library(tidyverse)Wrangling data in the tidyverse
When working with a data frame, you will usually want to modify it in some way. You may want to make some calculations based on different columns in the data or calculate some summary statistics. You may only need a subset of the data, either of the rows or columns. You might want to split the data into separate data frames or join together related data frames. Or you may need to clean up the data, changing column names, modifying values in different ways, or changing the shape of the data frame before doing your analysis and visualization. These are some of the more common tasks of data wrangling.1
We can break these tasks into three main categories:
- Manipulating rows and columns.
- Joining together related data frames.
- Modifying the structure of a data frame.
Pipes
Before getting into the details of how to go about doing these data wrangling tasks using the tidyverse, we should introduce a tool that will make this process of data cleaning much easier to write and more legible to read, namely the pipe (|>).2 In doing almost any data wrangling task, you will want to perform multiple manipulations. You may want to remove some columns, and then remove rows with missing data, and then calculate the frequency of one or more column, and then calculate some summary statistics.
The pipe is the and then. It links together multiple functions together, taking the output from the left of the pipe and moving it to the function on the right. Using the pipe will look something like this, and remember you can think of it as and then:
historians |>
read secondary literature |>
read primary sources |>
write |>
do more research |>
rewrite |>
publish
In this analogy historians take the place of the data and the various actions are the functions.
1. Manipulating rows and columns with dplyr
The core set of dplyr functions are designed to work with the pipe. They take in a data frame as their input in the first argument (named .data); subsequent arguments include the (unquoted) column names that are operated on; and the output is a data frame, which can be piped into another function. You can think of these functions as verbs to manipulate rows and columns. The core functions are:
| Function | Description |
|---|---|
select() |
Subset columns. |
filter() |
Subset rows by conditions based on data in columns. |
mutate() |
Create new columns using information from other columns. |
group_by() |
Group the data by values in column such as by village or by sex. |
summarize() |
Aggregate the data, usually on grouped data, to create summary tables. |
arrange() |
Arrange the order of rows by values in columns. Often used with desc() to reverse default order. |
- To practice with these functions go to the Wrangling data with dplyr worksheet.
- To read more about the core dplyr functions, see R for Data Science: Data Transformation.
- For even more practice, check out the Data Carpentries, R for Social Science episode on Data Wrangling with dplyr.
Let’s quickly show how these work using the penguins dataset.
penguins |>
select(species, bill_len, bill_dep, sex) |> # Subset columns, not strictly necessary
filter(!is.na(bill_len) | !is.na(bill_dep)) |> # Remove missing values for bill measurements
mutate(bill_ratio = bill_dep / bill_len) |> # Create new column for ration of depth to length
group_by(species, sex) |> # Group the data by species and sex
summarize(avg_ratio = mean(bill_ratio), # Find the average bill ratio broken down by species and sex
n = n(), # Count the number of observations in each group
.groups = "drop") |> # Drop the remaining grouping of sex
arrange(desc(avg_ratio)) # Arrange the rows by the average ratio in descending order# A tibble: 8 × 4
species sex avg_ratio n
<fct> <fct> <dbl> <int>
1 Adelie <NA> 0.485 5
2 Adelie female 0.474 73
3 Adelie male 0.474 73
4 Chinstrap female 0.379 34
5 Chinstrap male 0.377 34
6 Gentoo <NA> 0.319 4
7 Gentoo male 0.318 61
8 Gentoo female 0.313 58
3. Modifying the structure of a data frame
A second task you may face is needing to alter the structure of your data to make it tidy. This is the specialty of the tidyr package. It takes us back to the principles of tidy data:
- Each variable is a column; each column is a variable.
- Each observation is a row; each row is an observation.
- Each value is a cell; each cell is a single value.
Wide to long
However, data is not always tidy; sometimes for good reason. For instance, in collecting data it might be easier to create a data frame in a wide format in which variable values are spread out in separate columns. For instance, a data frame with the counts for each penguin species by year.
penguins_wide# A tibble: 3 × 4
species `2007` `2008` `2009`
<fct> <int> <int> <int>
1 Adelie 50 50 52
2 Chinstrap 26 18 24
3 Gentoo 34 46 44
To make this data frame tidy we need to create a new column for year and a new column for the values in the current year columns, in this case count. to do this we use the pivot_longer() function. The key arguments for the function are:
cols: Columns to pivot into a longer format, or alternatively the columns to disregard in pivoting using the not operator (!).names_to: The name new column or columns to create from the information stored in the column names specified by thecolsargument. Should be a name in quotation marks.values_to: The name of the column to create from the data stored in cell values. Should be a name in quotation marks.
Let’s see how this works.
penguins_wide |>
pivot_longer(
cols = !species,
names_to = "year",
values_to = "count"
)# A tibble: 9 × 3
species year count
<fct> <chr> <int>
1 Adelie 2007 50
2 Adelie 2008 50
3 Adelie 2009 52
4 Chinstrap 2007 26
5 Chinstrap 2008 18
6 Chinstrap 2009 24
7 Gentoo 2007 34
8 Gentoo 2008 46
9 Gentoo 2009 44
Long to wide
Going from a wider format to a longer format is the more common transformation, but you may find yourself in the situation of wanting to use a wider format to present your data in a table. We can reverse what we did with pivot_longer() using pivot_wider(). The key arguments mirror those for pivot_longer():
names_from: Column used to create new column names. Does not need to be in quotes because it is a column in the data.values_from: Column used to create cell values. Does not need to be in quotes because it is a column in the data.
penguins_long |>
pivot_wider(
names_from = year,
values_from = count)# A tibble: 3 × 4
species `2007` `2008` `2009`
<fct> <int> <int> <int>
1 Adelie 50 50 52
2 Chinstrap 26 18 24
3 Gentoo 34 46 44
Resources
These examples provide a short glimpse of the ways that you can manipulate the structure of data frames with tidyr. For more information see:
- A good overview is provided by the Data tidying chapter of R for Data Science.
- The tidy data article in the tidyr package. It is a code heavy version of Hadley Wickham’s Tidy data paper.
- The Pivoting article in the tidyr package.
- For even more practice, check out the Data Carpentries, R for Social Science episode on Data Wrangling with tidyr.
Footnotes
Hadley Wickham, “Tidy Data,” Journal of Statistical Software 59, no. 10 (2014), https://doi.org/10.18637/jss.v059.i10.↩︎
This is the base R pipe that was introduced in R 4.1 in 2021. You may also see the magrittr pipe (
%>%). Their functionality is essentially the same. The base R pipe is created using the pipe character (Shift + backslash) and the greater than sign. You can quickly type it out with the keyboard shortcutCmd/Ctrl + Shift + M. To setup RStudio to use the base pipe with this keyboard shortcut, see Setting up RStudio for success.↩︎