9  Tidy data

Published

March 21, 2023

Modified

January 8, 2024

This vignette is a code heavy version of the Tidy Data paper (Wickham 2014) using data and code from tidyr. It defines tidy data, providing a nice clear example from some toy classroom data and then shows five types of messy data and how they might be tidied.

9.1 Tidy data: Definition

“Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.”

  1. Every column is a variable.
  2. Every row is an observation.
  3. Every cell is a single value.

9.2 The structure and semantics of data

Structure of rows and columns does little to help us define whether data is tidy or not and how they relate to each other. This can be seen in representing some typical data from a classroom in two different ways that are typical for how such data might be entered.

classroom <- tribble(
  ~name,    ~quiz1, ~quiz2, ~test1,
  "Billy",  NA,     "D",    "C",
  "Suzy",   "F",    NA,     NA,
  "Lionel", "B",    "C",    "B",
  "Jenny",  "A",    "A",    "B"
  )
classroom
#> # A tibble: 4 × 4
#>   name   quiz1 quiz2 test1
#>   <chr>  <chr> <chr> <chr>
#> 1 Billy  <NA>  D     C    
#> 2 Suzy   F     <NA>  <NA> 
#> 3 Lionel B     C     B    
#> 4 Jenny  A     A     B

tribble(
  ~assessment, ~Billy, ~Suzy, ~Lionel, ~Jenny,
  "quiz1",     NA,     "F",   "B",     "A",
  "quiz2",     "D",    NA,    "C",     "A",
  "test1",     "C",    NA,    "B",     "B"
  )
#> # A tibble: 3 × 5
#>   assessment Billy Suzy  Lionel Jenny
#>   <chr>      <chr> <chr> <chr>  <chr>
#> 1 quiz1      <NA>  F     B      A    
#> 2 quiz2      D     <NA>  C      A    
#> 3 test1      C     <NA>  B      B

To better define tidy data we can break data into values organized into observations and variables.

  • Values: Usually either numbers (if quantitative) or strings (if qualitative).
  • Variable: contains all values that measure the same underlying attribute (like height, temperature, duration) across units.
  • Observation: contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

The consequence of this definition of tidy data can be seen by making the classroom data tidy.

classroom |> 
  pivot_longer(quiz1:test1, names_to = "assessment", values_to = "grade") |> 
  arrange(name, assessment)
#> # A tibble: 12 × 3
#>    name   assessment grade
#>    <chr>  <chr>      <chr>
#>  1 Billy  quiz1      <NA> 
#>  2 Billy  quiz2      D    
#>  3 Billy  test1      C    
#>  4 Jenny  quiz1      A    
#>  5 Jenny  quiz2      A    
#>  6 Jenny  test1      B    
#>  7 Lionel quiz1      B    
#>  8 Lionel quiz2      C    
#>  9 Lionel test1      B    
#> 10 Suzy   quiz1      F    
#> 11 Suzy   quiz2      <NA> 
#> 12 Suzy   test1      <NA>

This structure of data is more repetitive, but it is easier to compute on and break down into variables and observations:

  • Variables: name, type of assessment, and grade
  • Observations: The grade given to each student for each test.

It is often trivial to define variables and observations in a given context, but it is difficult to do so in general. Whether a value should be a different onservation or a different variable is often dependent upon the context of the data.

“Tidy data is particularly well suited for vectorised programming languages like R, because the layout ensures that values of different variables from the same observation are always paired.”

9.3 Tidying messy datasets

Five types of messy data:

  1. Column headers are values, not variable names
  2. Multiple variables stored in one column
  3. Variables are stored in both rows and columns
  4. Multiple types in one table
  5. One type in multiple tables

9.3.1 Column headers are values, not variable names

Having values in column headers can be a nice way to present data and can be efficient for computing if you can perform matrix expressions on the data.

relig_income
#> # A tibble: 18 × 11
#>    religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
#>    <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
#>  1 Agnostic      27        34        60        81        76       137        122
#>  2 Atheist       12        27        37        52        35        70         73
#>  3 Buddhist      27        21        30        34        33        58         62
#>  4 Catholic     418       617       732       670       638      1116        949
#>  5 Don’t k…      15        14        15        11        10        35         21
#>  6 Evangel…     575       869      1064       982       881      1486        949
#>  7 Hindu          1         9         7         9        11        34         47
#>  8 Histori…     228       244       236       238       197       223        131
#>  9 Jehovah…      20        27        24        24        21        30         15
#> 10 Jewish        19        19        25        25        30        95         69
#> 11 Mainlin…     289       495       619       655       651      1107        939
#> 12 Mormon        29        40        48        51        56       112         85
#> 13 Muslim         6         7         9        10         9        23         16
#> 14 Orthodox      13        17        23        32        32        47         38
#> 15 Other C…       9         7        11        13        13        14         18
#> 16 Other F…      20        33        40        46        49        63         46
#> 17 Other W…       5         2         3         4         2         7          3
#> 18 Unaffil…     217       299       374       365       341       528        407
#> # ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#> #   `Don't know/refused` <dbl>

The structure of the data frame is 11 columns that are actually variables, 18 religious groups, and the frequencies of the two other variables. To tidy the data the data frame needs to be pivoted to a longer format in which the column names into a key-value pairs.

relig_income |> 
  pivot_longer(-religion, names_to = "income", values_to = "frequency")
#> # A tibble: 180 × 3
#>    religion income             frequency
#>    <chr>    <chr>                  <dbl>
#>  1 Agnostic <$10k                     27
#>  2 Agnostic $10-20k                   34
#>  3 Agnostic $20-30k                   60
#>  4 Agnostic $30-40k                   81
#>  5 Agnostic $40-50k                   76
#>  6 Agnostic $50-75k                  137
#>  7 Agnostic $75-100k                 122
#>  8 Agnostic $100-150k                109
#>  9 Agnostic >150k                     84
#> 10 Agnostic Don't know/refused        96
#> # ℹ 170 more rows

A good example of this kind of data is found in billboard, which records the ranking of songs after they crack the Billboard top 100 for 75 weeks. This form is not tidy, but it is good for data entry, as it reduces duplication of data. Here, there is a bit of extra tidying to do to make the variables more conducive to the tidy structure.

billboard
#> # A tibble: 317 × 79
#>    artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
#>    <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
#>  2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
#>  3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
#>  4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
#>  5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
#>  6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
#>  7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
#>  8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
#>  9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
#> 10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
#> # ℹ 307 more rows
#> # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#> #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#> #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#> #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#> #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#> #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

billboard_long <- billboard |> 
  pivot_longer(
    wk1:wk76, 
    names_to = "week", 
    values_to = "rank", 
    values_drop_na = TRUE
  ) |> 
  mutate(
    week = as.integer(gsub("wk", "", week)),
    date = as.Date(date.entered) + 7 * (week - 1),
    date.entered = NULL
  )
billboard_long 
#> # A tibble: 5,307 × 5
#>    artist  track                    week  rank date      
#>    <chr>   <chr>                   <int> <dbl> <date>    
#>  1 2 Pac   Baby Don't Cry (Keep...     1    87 2000-02-26
#>  2 2 Pac   Baby Don't Cry (Keep...     2    82 2000-03-04
#>  3 2 Pac   Baby Don't Cry (Keep...     3    72 2000-03-11
#>  4 2 Pac   Baby Don't Cry (Keep...     4    77 2000-03-18
#>  5 2 Pac   Baby Don't Cry (Keep...     5    87 2000-03-25
#>  6 2 Pac   Baby Don't Cry (Keep...     6    94 2000-04-01
#>  7 2 Pac   Baby Don't Cry (Keep...     7    99 2000-04-08
#>  8 2Ge+her The Hardest Part Of ...     1    91 2000-09-02
#>  9 2Ge+her The Hardest Part Of ...     2    87 2000-09-09
#> 10 2Ge+her The Hardest Part Of ...     3    92 2000-09-16
#> # ℹ 5,297 more rows

9.3.2 Multiple variables stored in one column

Sometimes there are more than one variable stored in each column. This is the case in the tb (tuberculosis) that comes from the World Health Organization. It shows tuberculosis cases by country, year, and demographic group that includes both gender m or f and age group.

tb <- read_csv("data/tb.csv")
#> Rows: 5769 Columns: 22
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (1): iso2
#> dbl (21): year, m04, m514, m014, m1524, m2534, m3544, m4554, m5564, m65, mu,...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tb
#> # A tibble: 5,769 × 22
#>    iso2   year   m04  m514  m014 m1524 m2534 m3544 m4554 m5564   m65    mu   f04
#>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 AD     1989    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>  2 AD     1990    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>  3 AD     1991    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>  4 AD     1992    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>  5 AD     1993    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>  6 AD     1994    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>  7 AD     1996    NA    NA     0     0     0     4     1     0     0    NA    NA
#>  8 AD     1997    NA    NA     0     0     1     2     2     1     6    NA    NA
#>  9 AD     1998    NA    NA     0     0     0     1     0     0     0    NA    NA
#> 10 AD     1999    NA    NA     0     0     0     1     1     0     0    NA    NA
#> # ℹ 5,759 more rows
#> # ℹ 9 more variables: f514 <dbl>, f014 <dbl>, f1524 <dbl>, f2534 <dbl>,
#> #   f3544 <dbl>, f4554 <dbl>, f5564 <dbl>, f65 <dbl>, fu <dbl>

The transformation to pivot longer and split the header variable can be done in a single step by supplying multiple column names to names_to and also supplying a grouped regular expression to names_pattern to split the variable after the first character.

tb %>% pivot_longer(
  !c(iso2, year), 
  names_to = c("sex", "age"), 
  names_pattern = "(.)(.+)",
  values_to = "n", 
  values_drop_na = TRUE
)
#> # A tibble: 35,750 × 5
#>    iso2   year sex   age       n
#>    <chr> <dbl> <chr> <chr> <dbl>
#>  1 AD     1996 m     014       0
#>  2 AD     1996 m     1524      0
#>  3 AD     1996 m     2534      0
#>  4 AD     1996 m     3544      4
#>  5 AD     1996 m     4554      1
#>  6 AD     1996 m     5564      0
#>  7 AD     1996 m     65        0
#>  8 AD     1996 f     014       0
#>  9 AD     1996 f     1524      1
#> 10 AD     1996 f     2534      1
#> # ℹ 35,740 more rows

9.3.3 Variables are stored in both rows and columns

A complex form of messy data is when variables are stored in both rows and columns. This occurs in weather data recording max and min temperatures in day columns.

weather <- read_csv("data/weather.csv")
#> Rows: 22 Columns: 35
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (2): id, element
#> dbl (25): year, month, d1, d2, d3, d4, d5, d6, d7, d8, d10, d11, d13, d14, d...
#> lgl  (8): d9, d12, d18, d19, d20, d21, d22, d24
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weather
#> # A tibble: 22 × 35
#>    id       year month element    d1    d2    d3    d4    d5    d6    d7    d8
#>    <chr>   <dbl> <dbl> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 MX17004  2010     1 tmax       NA  NA    NA      NA  NA      NA    NA    NA
#>  2 MX17004  2010     1 tmin       NA  NA    NA      NA  NA      NA    NA    NA
#>  3 MX17004  2010     2 tmax       NA  27.3  24.1    NA  NA      NA    NA    NA
#>  4 MX17004  2010     2 tmin       NA  14.4  14.4    NA  NA      NA    NA    NA
#>  5 MX17004  2010     3 tmax       NA  NA    NA      NA  32.1    NA    NA    NA
#>  6 MX17004  2010     3 tmin       NA  NA    NA      NA  14.2    NA    NA    NA
#>  7 MX17004  2010     4 tmax       NA  NA    NA      NA  NA      NA    NA    NA
#>  8 MX17004  2010     4 tmin       NA  NA    NA      NA  NA      NA    NA    NA
#>  9 MX17004  2010     5 tmax       NA  NA    NA      NA  NA      NA    NA    NA
#> 10 MX17004  2010     5 tmin       NA  NA    NA      NA  NA      NA    NA    NA
#> # ℹ 12 more rows
#> # ℹ 23 more variables: d9 <lgl>, d10 <dbl>, d11 <dbl>, d12 <lgl>, d13 <dbl>,
#> #   d14 <dbl>, d15 <dbl>, d16 <dbl>, d17 <dbl>, d18 <lgl>, d19 <lgl>,
#> #   d20 <lgl>, d21 <lgl>, d22 <lgl>, d23 <dbl>, d24 <lgl>, d25 <dbl>,
#> #   d26 <dbl>, d27 <dbl>, d28 <dbl>, d29 <dbl>, d30 <dbl>, d31 <dbl>

To deal with this dual nature of the data we need to pivot the day variables longer to get the temperature readings in a column and then pivot the element column wider to split the temperature readings between min and max.

# Pivot-longer: day columns
weather_longer <- weather |> 
  pivot_longer(
    d1:d31, 
    names_to = "day", 
    values_to = "value", 
    values_drop_na = TRUE
  ) |> 
  mutate(
    date = ymd(paste(year, month, gsub("d", "", day), sep = "-"))
  ) |> 
  select(id, date, element, value) |> 
  arrange(date)
weather_longer
#> # A tibble: 66 × 4
#>    id      date       element value
#>    <chr>   <date>     <chr>   <dbl>
#>  1 MX17004 2010-01-30 tmax     27.8
#>  2 MX17004 2010-01-30 tmin     14.5
#>  3 MX17004 2010-02-02 tmax     27.3
#>  4 MX17004 2010-02-02 tmin     14.4
#>  5 MX17004 2010-02-03 tmax     24.1
#>  6 MX17004 2010-02-03 tmin     14.4
#>  7 MX17004 2010-02-11 tmax     29.7
#>  8 MX17004 2010-02-11 tmin     13.4
#>  9 MX17004 2010-02-23 tmax     29.9
#> 10 MX17004 2010-02-23 tmin     10.7
#> # ℹ 56 more rows

# Pivot-wider: element column
weather_longer |> 
  pivot_wider(names_from = element, values_from = value)
#> # A tibble: 33 × 4
#>    id      date        tmax  tmin
#>    <chr>   <date>     <dbl> <dbl>
#>  1 MX17004 2010-01-30  27.8  14.5
#>  2 MX17004 2010-02-02  27.3  14.4
#>  3 MX17004 2010-02-03  24.1  14.4
#>  4 MX17004 2010-02-11  29.7  13.4
#>  5 MX17004 2010-02-23  29.9  10.7
#>  6 MX17004 2010-03-05  32.1  14.2
#>  7 MX17004 2010-03-10  34.5  16.8
#>  8 MX17004 2010-03-16  31.1  17.6
#>  9 MX17004 2010-04-27  36.3  16.7
#> 10 MX17004 2010-05-27  33.2  18.2
#> # ℹ 23 more rows

9.3.4 Multiple types in one table

This occurs when multiple observational units are stored on one table instead of in multiple tables. One simple example of this is in the billboard data where there is information about both the song and its ranking on the billboard list. This leads to duplication of information about the artist in this case. It would be better to divide this information into two tables linked through a song id.

# Song table
song <- billboard_long |> 
  distinct(artist, track) |>
  mutate(song_id = row_number())
song
#> # A tibble: 317 × 3
#>    artist         track                   song_id
#>    <chr>          <chr>                     <int>
#>  1 2 Pac          Baby Don't Cry (Keep...       1
#>  2 2Ge+her        The Hardest Part Of ...       2
#>  3 3 Doors Down   Kryptonite                    3
#>  4 3 Doors Down   Loser                         4
#>  5 504 Boyz       Wobble Wobble                 5
#>  6 98^0           Give Me Just One Nig...       6
#>  7 A*Teens        Dancing Queen                 7
#>  8 Aaliyah        I Don't Wanna                 8
#>  9 Aaliyah        Try Again                     9
#> 10 Adams, Yolanda Open My Heart                10
#> # ℹ 307 more rows

# Rank table
rank <- billboard_long |> 
  left_join(song, c("artist", "track")) |>
  select(song_id, date, week, rank)
rank
#> # A tibble: 5,307 × 4
#>    song_id date        week  rank
#>      <int> <date>     <int> <dbl>
#>  1       1 2000-02-26     1    87
#>  2       1 2000-03-04     2    82
#>  3       1 2000-03-11     3    72
#>  4       1 2000-03-18     4    77
#>  5       1 2000-03-25     5    87
#>  6       1 2000-04-01     6    94
#>  7       1 2000-04-08     7    99
#>  8       2 2000-09-02     1    91
#>  9       2 2000-09-09     2    87
#> 10       2 2000-09-16     3    92
#> # ℹ 5,297 more rows

9.3.5 One type in multiple tables

Sometimes an observational unit is spread among many tables that might be split by a variable such as person, year, location, etc. The fix for this is to bring the tables together and tidy:

  1. Read the files into a list of tables.
  2. For each table, add a new column that records the original file name (the file name is often the value of an important variable).
  3. Combine all tables into a single table.

This can be done with purrr

paths <- dir("data", pattern = "\\.csv$", full.names = TRUE)
names(paths) <- basename(paths)
map_dfr(paths, read.csv, stringsAsFactors = FALSE, .id = "filename")