Wrangling data in the tidyverse

When working with a data frame, you will usually want to modify it in some way. You may want to make some calculations based on different columns in the data or calculate some summary statistics. You may only need a subset of the data, either of the rows or columns. You might want to split the data into separate data frames or join together related data frames. Or you may need to clean up the data, changing column names, modifying values in different ways, or changing the shape of the data frame before doing your analysis and visualization. These are some of the more common tasks of data wrangling.¹

We can break these tasks into three main categories:

Manipulating rows and columns.
Joining together related data frames.
Modifying the structure of a data frame.

Pipes

Before getting into the details of how to go about doing these data wrangling tasks using the tidyverse, we should introduce a tool that will make this process of data cleaning much easier to write and more legible to read, namely the pipe (|>).² In doing almost any data wrangling task, you will want to perform multiple manipulations. You may want to remove some columns, and then remove rows with missing data, and then calculate the frequency of one or more column, and then calculate some summary statistics.

The pipe is the and then. It links together multiple functions together, taking the output from the left of the pipe and moving it to the function on the right. Using the pipe will look something like this, and remember you can think of it as and then:

historians |>
  read secondary literature |>
  read primary sources |>
  write |>
  do more research |>
  rewrite |>
  publish

In this analogy historians take the place of the data and the various actions are the functions.

1. Manipulating rows and columns with dplyr

The core set of dplyr functions are designed to work with the pipe. They take in a data frame as their input in the first argument (named .data); subsequent arguments include the (unquoted) column names that are operated on; and the output is a data frame, which can be piped into another function. You can think of these functions as verbs to manipulate rows and columns. The core functions are:

Function	Description
`select()`	Subset columns.
`filter()`	Subset rows by conditions based on data in columns.
`mutate()`	Create new columns using information from other columns.
`group_by()`	Group the data by values in column such as by village or by sex.
`summarize()`	Aggregate the data, usually on grouped data, to create summary tables.
`arrange()`	Arrange the order of rows by values in columns. Often used with `desc()` to reverse default order.

To practice with these functions go to the Wrangling data with dplyr worksheet.
To read more about the core dplyr functions, see R for Data Science: Data Transformation.
For even more practice, check out the Data Carpentries, R for Social Science episode on Data Wrangling with dplyr.

Let’s quickly show how these work using the penguins dataset.

library(tidyverse)

penguins |> 
  select(species, bill_len, bill_dep, sex) |> # Subset columns, not strictly necessary
  filter(!is.na(bill_len) | !is.na(bill_dep)) |> # Remove missing values for bill measurements
  mutate(bill_ratio = bill_dep / bill_len) |> # Create new column for ration of depth to length
  group_by(species, sex) |> # Group the data by species and sex
  summarize(avg_ratio = mean(bill_ratio), # Find the average bill ratio broken down by species and sex
            n = n(), # Count the number of observations in each group
            .groups = "drop") |> # Drop the remaining grouping of sex
  arrange(desc(avg_ratio)) # Arrange the rows by the average ratio in descending order

# A tibble: 8 × 4
  species   sex    avg_ratio     n
  <fct>     <fct>      <dbl> <int>
1 Adelie    <NA>       0.485     5
2 Adelie    female     0.474    73
3 Adelie    male       0.474    73
4 Chinstrap female     0.379    34
5 Chinstrap male       0.377    34
6 Gentoo    <NA>       0.319     4
7 Gentoo    male       0.318    61
8 Gentoo    female     0.313    58

2. Joining together related data frames

A common data wrangling task is to join together related data frames. This situation can come about by the desire split up a dataset into multiple datasets to keep them clean and tidy. A common example in the humanities might be a dataset that has location names that correspond to latitude and longitude values. Instead of having latitude and longitude columns for all of our data, we might want to have a separate data frame with that information that we only need to join to the rest of the dataset when we want to create a geographic visualization. We can do this with joins. Let’s see how it works with some data about the location of siblings at different times.

# Religious affiliation of the family members
religion

# A tibble: 13 × 2
   person                   religion 
   <chr>                    <chr>    
 1 Elizabeth Zeghers        Calvinist
 2 Jan della Faille de Oude Catholic 
 3 Andries van der Meulen   Calvinist
 4 Daniel van der Meulen    Calvinist
 5 Anna van der Meulen      Calvinist
 6 Sara van der Meulen      Calvinist
 7 Jan della Faille         Calvinist
 8 Marten della Faille      Catholic 
 9 Carlo della Faille       Calvinist
10 Anna della Faille        Catholic 
11 Steven della Faille      Calvinist
12 Jacques della Faille     Calvinist
13 Hester della Faille      Calvinist

# Movements of the siblings
movements

# A tibble: 74 × 3
   person                 place     date      
   <chr>                  <chr>     <date>    
 1 Andries van der Meulen Antwerp   1584-07-01
 2 Andries van der Meulen Bremen    1585-09-01
 3 Anna van der Meulen    Cologne   1584-07-01
 4 Anna van der Meulen    Bremen    1588-06-01
 5 Anna van der Meulen    Stade     1588-08-01
 6 Anna van der Meulen    Bremen    1592-03-01
 7 Sara van der Meulen    Antwerp   1584-07-01
 8 Sara van der Meulen    Bremen    1585-09-01
 9 Sara van der Meulen    Frankfurt 1586-03-01
10 Sara van der Meulen    Cologne   1586-05-01
# ℹ 64 more rows

# Latitude and longitude of the locations of the siblings
locations

# A tibble: 21 × 3
   location          lat   lng
   <chr>           <dbl> <dbl>
 1 Antwerp          51.2  4.40
 2 Bremen           53.1  8.81
 3 Cologne          50.9  6.96
 4 Stade            53.6  9.48
 5 Frankfurt        50.1  8.68
 6 Utrecht          52.1  5.13
 7 Leiden           52.2  4.48
 8 Geertruidenberg  51.7  4.85
 9 Bergen op Zoom   51.5  4.29
10 The Hague        52.1  4.31
# ℹ 11 more rows

There are four main join functions. All take two data frames and join them by a column that is shared by both data frames. The names of the column to join by do not need to be the same, but it this case they are, namely "person". The four functions differ in how they join the data frames.

Function	Description
`left_join()`	Keep all observations in the first (left) data frame.
`right_join()`	Keep all observations in the second (right) data frame.
`full_join()`	Keep all observations from both data frames.
`inner_join()`	Only keep observations that are in both data frames, dropping rows that do not have a match.

Let’s start with a left join between religion and movements:

left_join(religion, movements, by = "person")

# A tibble: 76 × 4
   person                   religion  place      date      
   <chr>                    <chr>     <chr>      <date>    
 1 Elizabeth Zeghers        Calvinist <NA>       NA        
 2 Jan della Faille de Oude Catholic  <NA>       NA        
 3 Andries van der Meulen   Calvinist Antwerp    1584-07-01
 4 Andries van der Meulen   Calvinist Bremen     1585-09-01
 5 Daniel van der Meulen    Calvinist Antwerp    1584-07-01
 6 Daniel van der Meulen    Calvinist Middelburg 1584-08-01
 7 Daniel van der Meulen    Calvinist Delft      1584-09-01
 8 Daniel van der Meulen    Calvinist Haarlem    1584-12-01
 9 Daniel van der Meulen    Calvinist Delft      1585-02-01
10 Daniel van der Meulen    Calvinist Bremen     1585-10-01
# ℹ 66 more rows

The NAs for place and date for Elizabeth Zeghers and Jan della Faille de Oude show that there is no movement data for these individuals. We could drop these by doing a right join instead. Note the different in the number of rows. An alternative would be to switch the position of the dataframes within the left_join().

right_join(religion, movements, by = "person")

# A tibble: 74 × 4
   person                 religion  place      date      
   <chr>                  <chr>     <chr>      <date>    
 1 Andries van der Meulen Calvinist Antwerp    1584-07-01
 2 Andries van der Meulen Calvinist Bremen     1585-09-01
 3 Daniel van der Meulen  Calvinist Antwerp    1584-07-01
 4 Daniel van der Meulen  Calvinist Middelburg 1584-08-01
 5 Daniel van der Meulen  Calvinist Delft      1584-09-01
 6 Daniel van der Meulen  Calvinist Haarlem    1584-12-01
 7 Daniel van der Meulen  Calvinist Delft      1585-02-01
 8 Daniel van der Meulen  Calvinist Bremen     1585-10-01
 9 Daniel van der Meulen  Calvinist Middelburg 1586-05-01
10 Daniel van der Meulen  Calvinist Delft      1586-06-01
# ℹ 64 more rows

Now let’s check joining the movements data frame to the locations data frame. Here we have the complication that our columns to join by have different names. We can use the join_by() function to tell R which columns are equivalent.

left_join(movements, locations,
          by = join_by(place == location))

# A tibble: 74 × 5
   person                 place     date         lat   lng
   <chr>                  <chr>     <date>     <dbl> <dbl>
 1 Andries van der Meulen Antwerp   1584-07-01  51.2  4.40
 2 Andries van der Meulen Bremen    1585-09-01  53.1  8.81
 3 Anna van der Meulen    Cologne   1584-07-01  50.9  6.96
 4 Anna van der Meulen    Bremen    1588-06-01  53.1  8.81
 5 Anna van der Meulen    Stade     1588-08-01  53.6  9.48
 6 Anna van der Meulen    Bremen    1592-03-01  53.1  8.81
 7 Sara van der Meulen    Antwerp   1584-07-01  51.2  4.40
 8 Sara van der Meulen    Bremen    1585-09-01  53.1  8.81
 9 Sara van der Meulen    Frankfurt 1586-03-01  50.1  8.68
10 Sara van der Meulen    Cologne   1586-05-01  50.9  6.96
# ℹ 64 more rows

Let’s now join all three data frames together using the pipe.

religion |> 
  right_join(movements, by = "person") |> 
  left_join(locations,
            by = join_by(place == location))

# A tibble: 74 × 6
   person                 religion  place      date         lat   lng
   <chr>                  <chr>     <chr>      <date>     <dbl> <dbl>
 1 Andries van der Meulen Calvinist Antwerp    1584-07-01  51.2  4.40
 2 Andries van der Meulen Calvinist Bremen     1585-09-01  53.1  8.81
 3 Daniel van der Meulen  Calvinist Antwerp    1584-07-01  51.2  4.40
 4 Daniel van der Meulen  Calvinist Middelburg 1584-08-01  51.5  3.61
 5 Daniel van der Meulen  Calvinist Delft      1584-09-01  52.0  4.36
 6 Daniel van der Meulen  Calvinist Haarlem    1584-12-01  52.4  4.64
 7 Daniel van der Meulen  Calvinist Delft      1585-02-01  52.0  4.36
 8 Daniel van der Meulen  Calvinist Bremen     1585-10-01  53.1  8.81
 9 Daniel van der Meulen  Calvinist Middelburg 1586-05-01  51.5  3.61
10 Daniel van der Meulen  Calvinist Delft      1586-06-01  52.0  4.36
# ℹ 64 more rows

Now are data would be ready to map. This ability to quickly join together related datasets makes it possible to greatly simplify the structure of the data if you are in the situation of creating your dataset. Instead of writing the latitude and longitude over and over again, possibly introducing copy and paste errors, you can have it in one place and only use it when necessary.

Download the three data frames to play around with them: religion, movements, and locations.

Resources

A good overview is provided by the Joins chapter of R for Data Science.
See also the Two-table verbs article in the dplyr package.

3. Modifying the structure of a data frame

A second task you may face is needing to alter the structure of your data to make it tidy. This is the specialty of the tidyr package. It takes us back to the principles of tidy data:

Each variable is a column; each column is a variable.
Each observation is a row; each row is an observation.
Each value is a cell; each cell is a single value.

Wide to long

However, data is not always tidy; sometimes for good reason. For instance, in collecting data it might be easier to create a data frame in a wide format in which variable values are spread out in separate columns. For instance, a data frame with the counts for each penguin species by year.

penguins_wide

# A tibble: 3 × 4
  species   `2007` `2008` `2009`
  <fct>      <int>  <int>  <int>
1 Adelie        50     50     52
2 Chinstrap     26     18     24
3 Gentoo        34     46     44

To make this data frame tidy we need to create a new column for year and a new column for the values in the current year columns, in this case count. to do this we use the pivot_longer() function. The key arguments for the function are:

cols: Columns to pivot into a longer format, or alternatively the columns to disregard in pivoting using the not operator (!).
names_to: The name new column or columns to create from the information stored in the column names specified by the cols argument. Should be a name in quotation marks.
values_to: The name of the column to create from the data stored in cell values. Should be a name in quotation marks.

Let’s see how this works.

penguins_wide |> 
  pivot_longer(
    cols = !species,
    names_to = "year",
    values_to = "count"
  )

# A tibble: 9 × 3
  species   year  count
  <fct>     <chr> <int>
1 Adelie    2007     50
2 Adelie    2008     50
3 Adelie    2009     52
4 Chinstrap 2007     26
5 Chinstrap 2008     18
6 Chinstrap 2009     24
7 Gentoo    2007     34
8 Gentoo    2008     46
9 Gentoo    2009     44

Long to wide

Going from a wider format to a longer format is the more common transformation, but you may find yourself in the situation of wanting to use a wider format to present your data in a table. We can reverse what we did with pivot_longer() using pivot_wider(). The key arguments mirror those for pivot_longer():

names_from: Column used to create new column names. Does not need to be in quotes because it is a column in the data.
values_from: Column used to create cell values. Does not need to be in quotes because it is a column in the data.

penguins_long |> 
  pivot_wider(
    names_from = year,
    values_from = count)

# A tibble: 3 × 4
  species   `2007` `2008` `2009`
  <fct>      <int>  <int>  <int>
1 Adelie        50     50     52
2 Chinstrap     26     18     24
3 Gentoo        34     46     44

Resources

These examples provide a short glimpse of the ways that you can manipulate the structure of data frames with tidyr. For more information see:

A good overview is provided by the Data tidying chapter of R for Data Science.
The tidy data article in the tidyr package. It is a code heavy version of Hadley Wickham’s Tidy data paper.
The Pivoting article in the tidyr package.
For even more practice, check out the Data Carpentries, R for Social Science episode on Data Wrangling with tidyr.

Footnotes

Hadley Wickham, “Tidy Data,” Journal of Statistical Software 59, no. 10 (2014), https://doi.org/10.18637/jss.v059.i10.↩︎
This is the base R pipe that was introduced in R 4.1 in 2021. You may also see the magrittr pipe (%>%). Their functionality is essentially the same. The base R pipe is created using the pipe character (Shift + backslash) and the greater than sign. You can quickly type it out with the keyboard shortcut Cmd/Ctrl + Shift + M. To setup RStudio to use the base pipe with this keyboard shortcut, see Setting up RStudio for success.↩︎