Wrangling data in the tidyverse

When working with a data frame, you will usually want to modify it in some way. You may want to make some calculations based on different columns in the data or calculate some summary statistics. You may only need a subset of the data, either of the rows or columns. You might want to split the data into separate data frames or join together related data frames. Or you may need to clean up the data, changing column names, modifying values in different ways, or changing the shape of the data frame before doing your analysis and visualization. These are some of the more common tasks of data wrangling.1

We can break these tasks into three main categories:

  1. Manipulating rows and columns.
  2. Joining together related data frames.
  3. Modifying the structure of a data frame.

Pipes

Before getting into the details of how to go about doing these data wrangling tasks using the tidyverse, we should introduce a tool that will make this process of data cleaning much easier to write and more legible to read, namely the pipe (|>).2 In doing almost any data wrangling task, you will want to perform multiple manipulations. You may want to remove some columns, and then remove rows with missing data, and then calculate the frequency of one or more column, and then calculate some summary statistics.

The pipe is the and then. It links together multiple functions together, taking the output from the left of the pipe and moving it to the function on the right. Using the pipe will look something like this, and remember you can think of it as and then:

historians |>
  read secondary literature |>
  read primary sources |>
  write |>
  do more research |>
  rewrite |>
  publish

In this analogy historians take the place of the data and the various actions are the functions.

1. Manipulating rows and columns with dplyr

The core set of dplyr functions are designed to work with the pipe. They take in a data frame as their input in the first argument (named .data); subsequent arguments include the (unquoted) column names that are operated on; and the output is a data frame, which can be piped into another function. You can think of these functions as verbs to manipulate rows and columns. The core functions are:

Function Description
select() Subset columns.
filter() Subset rows by conditions based on data in columns.
mutate() Create new columns using information from other columns.
group_by() Group the data by values in column such as by village or by sex.
summarize() Aggregate the data, usually on grouped data, to create summary tables.
arrange() Arrange the order of rows by values in columns. Often used with desc() to reverse default order.

Let’s quickly show how these work using the penguins dataset.

library(tidyverse)
penguins |> 
  select(species, bill_len, bill_dep, sex) |> # Subset columns, not strictly necessary
  filter(!is.na(bill_len) | !is.na(bill_dep)) |> # Remove missing values for bill measurements
  mutate(bill_ratio = bill_dep / bill_len) |> # Create new column for ration of depth to length
  group_by(species, sex) |> # Group the data by species and sex
  summarize(avg_ratio = mean(bill_ratio), # Find the average bill ratio broken down by species and sex
            n = n(), # Count the number of observations in each group
            .groups = "drop") |> # Drop the remaining grouping of sex
  arrange(desc(avg_ratio)) # Arrange the rows by the average ratio in descending order
# A tibble: 8 × 4
  species   sex    avg_ratio     n
  <fct>     <fct>      <dbl> <int>
1 Adelie    <NA>       0.485     5
2 Adelie    female     0.474    73
3 Adelie    male       0.474    73
4 Chinstrap female     0.379    34
5 Chinstrap male       0.377    34
6 Gentoo    <NA>       0.319     4
7 Gentoo    male       0.318    61
8 Gentoo    female     0.313    58

3. Modifying the structure of a data frame

A second task you may face is needing to alter the structure of your data to make it tidy. This is the specialty of the tidyr package. It takes us back to the principles of tidy data:

  1. Each variable is a column; each column is a variable.
  2. Each observation is a row; each row is an observation.
  3. Each value is a cell; each cell is a single value.

Wide to long

However, data is not always tidy; sometimes for good reason. For instance, in collecting data it might be easier to create a data frame in a wide format in which variable values are spread out in separate columns. For instance, a data frame with the counts for each penguin species by year.

penguins_wide
# A tibble: 3 × 4
  species   `2007` `2008` `2009`
  <fct>      <int>  <int>  <int>
1 Adelie        50     50     52
2 Chinstrap     26     18     24
3 Gentoo        34     46     44

To make this data frame tidy we need to create a new column for year and a new column for the values in the current year columns, in this case count. to do this we use the pivot_longer() function. The key arguments for the function are:

  • cols: Columns to pivot into a longer format, or alternatively the columns to disregard in pivoting using the not operator (!).
  • names_to: The name new column or columns to create from the information stored in the column names specified by the cols argument. Should be a name in quotation marks.
  • values_to: The name of the column to create from the data stored in cell values. Should be a name in quotation marks.

Let’s see how this works.

penguins_wide |> 
  pivot_longer(
    cols = !species,
    names_to = "year",
    values_to = "count"
  )
# A tibble: 9 × 3
  species   year  count
  <fct>     <chr> <int>
1 Adelie    2007     50
2 Adelie    2008     50
3 Adelie    2009     52
4 Chinstrap 2007     26
5 Chinstrap 2008     18
6 Chinstrap 2009     24
7 Gentoo    2007     34
8 Gentoo    2008     46
9 Gentoo    2009     44

Long to wide

Going from a wider format to a longer format is the more common transformation, but you may find yourself in the situation of wanting to use a wider format to present your data in a table. We can reverse what we did with pivot_longer() using pivot_wider(). The key arguments mirror those for pivot_longer():

  • names_from: Column used to create new column names. Does not need to be in quotes because it is a column in the data.
  • values_from: Column used to create cell values. Does not need to be in quotes because it is a column in the data.
penguins_long |> 
  pivot_wider(
    names_from = year,
    values_from = count)
# A tibble: 3 × 4
  species   `2007` `2008` `2009`
  <fct>      <int>  <int>  <int>
1 Adelie        50     50     52
2 Chinstrap     26     18     24
3 Gentoo        34     46     44

Resources

These examples provide a short glimpse of the ways that you can manipulate the structure of data frames with tidyr. For more information see:

Footnotes

  1. Hadley Wickham, “Tidy Data,” Journal of Statistical Software 59, no. 10 (2014), https://doi.org/10.18637/jss.v059.i10.↩︎

  2. This is the base R pipe that was introduced in R 4.1 in 2021. You may also see the magrittr pipe (%>%). Their functionality is essentially the same. The base R pipe is created using the pipe character (Shift + backslash) and the greater than sign. You can quickly type it out with the keyboard shortcut Cmd/Ctrl + Shift + M. To setup RStudio to use the base pipe with this keyboard shortcut, see Setting up RStudio for success.↩︎