9 Tidy data
This vignette is a code heavy version of the Tidy Data paper (Wickham 2014) using data and code from tidyr. It defines tidy data, providing a nice clear example from some toy classroom data and then shows five types of messy data and how they might be tidied.
9.1 Tidy data: Definition
“Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.”
- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.
9.2 The structure and semantics of data
Structure of rows and columns does little to help us define whether data is tidy or not and how they relate to each other. This can be seen in representing some typical data from a classroom in two different ways that are typical for how such data might be entered.
classroom <- tribble(
~name, ~quiz1, ~quiz2, ~test1,
"Billy", NA, "D", "C",
"Suzy", "F", NA, NA,
"Lionel", "B", "C", "B",
"Jenny", "A", "A", "B"
)
classroom
#> # A tibble: 4 × 4
#> name quiz1 quiz2 test1
#> <chr> <chr> <chr> <chr>
#> 1 Billy <NA> D C
#> 2 Suzy F <NA> <NA>
#> 3 Lionel B C B
#> 4 Jenny A A B
tribble(
~assessment, ~Billy, ~Suzy, ~Lionel, ~Jenny,
"quiz1", NA, "F", "B", "A",
"quiz2", "D", NA, "C", "A",
"test1", "C", NA, "B", "B"
)
#> # A tibble: 3 × 5
#> assessment Billy Suzy Lionel Jenny
#> <chr> <chr> <chr> <chr> <chr>
#> 1 quiz1 <NA> F B A
#> 2 quiz2 D <NA> C A
#> 3 test1 C <NA> B B
To better define tidy data we can break data into values organized into observations and variables.
- Values: Usually either numbers (if quantitative) or strings (if qualitative).
- Variable: contains all values that measure the same underlying attribute (like height, temperature, duration) across units.
- Observation: contains all values measured on the same unit (like a person, or a day, or a race) across attributes.
The consequence of this definition of tidy data can be seen by making the classroom
data tidy.
classroom |>
pivot_longer(quiz1:test1, names_to = "assessment", values_to = "grade") |>
arrange(name, assessment)
#> # A tibble: 12 × 3
#> name assessment grade
#> <chr> <chr> <chr>
#> 1 Billy quiz1 <NA>
#> 2 Billy quiz2 D
#> 3 Billy test1 C
#> 4 Jenny quiz1 A
#> 5 Jenny quiz2 A
#> 6 Jenny test1 B
#> 7 Lionel quiz1 B
#> 8 Lionel quiz2 C
#> 9 Lionel test1 B
#> 10 Suzy quiz1 F
#> 11 Suzy quiz2 <NA>
#> 12 Suzy test1 <NA>
This structure of data is more repetitive, but it is easier to compute on and break down into variables and observations:
- Variables:
name
, type ofassessment
, andgrade
- Observations: The grade given to each student for each test.
It is often trivial to define variables and observations in a given context, but it is difficult to do so in general. Whether a value should be a different onservation or a different variable is often dependent upon the context of the data.
“Tidy data is particularly well suited for vectorised programming languages like R, because the layout ensures that values of different variables from the same observation are always paired.”
9.3 Tidying messy datasets
Five types of messy data:
- Column headers are values, not variable names
- Multiple variables stored in one column
- Variables are stored in both rows and columns
- Multiple types in one table
- One type in multiple tables
9.3.1 Column headers are values, not variable names
Having values in column headers can be a nice way to present data and can be efficient for computing if you can perform matrix expressions on the data.
relig_income
#> # A tibble: 18 × 11
#> religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Agnostic 27 34 60 81 76 137 122
#> 2 Atheist 12 27 37 52 35 70 73
#> 3 Buddhist 27 21 30 34 33 58 62
#> 4 Catholic 418 617 732 670 638 1116 949
#> 5 Don’t k… 15 14 15 11 10 35 21
#> 6 Evangel… 575 869 1064 982 881 1486 949
#> 7 Hindu 1 9 7 9 11 34 47
#> 8 Histori… 228 244 236 238 197 223 131
#> 9 Jehovah… 20 27 24 24 21 30 15
#> 10 Jewish 19 19 25 25 30 95 69
#> 11 Mainlin… 289 495 619 655 651 1107 939
#> 12 Mormon 29 40 48 51 56 112 85
#> 13 Muslim 6 7 9 10 9 23 16
#> 14 Orthodox 13 17 23 32 32 47 38
#> 15 Other C… 9 7 11 13 13 14 18
#> 16 Other F… 20 33 40 46 49 63 46
#> 17 Other W… 5 2 3 4 2 7 3
#> 18 Unaffil… 217 299 374 365 341 528 407
#> # ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#> # `Don't know/refused` <dbl>
The structure of the data frame is 11 columns that are actually variables, 18 religious groups, and the frequencies of the two other variables. To tidy the data the data frame needs to be pivoted to a longer format in which the column names into a key-value pairs.
relig_income |>
pivot_longer(-religion, names_to = "income", values_to = "frequency")
#> # A tibble: 180 × 3
#> religion income frequency
#> <chr> <chr> <dbl>
#> 1 Agnostic <$10k 27
#> 2 Agnostic $10-20k 34
#> 3 Agnostic $20-30k 60
#> 4 Agnostic $30-40k 81
#> 5 Agnostic $40-50k 76
#> 6 Agnostic $50-75k 137
#> 7 Agnostic $75-100k 122
#> 8 Agnostic $100-150k 109
#> 9 Agnostic >150k 84
#> 10 Agnostic Don't know/refused 96
#> # ℹ 170 more rows
A good example of this kind of data is found in billboard
, which records the ranking of songs after they crack the Billboard top 100 for 75 weeks. This form is not tidy, but it is good for data entry, as it reduces duplication of data. Here, there is a bit of extra tidying to do to make the variables more conducive to the tidy structure.
billboard
#> # A tibble: 317 × 79
#> artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
#> 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
#> 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
#> 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
#> 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
#> 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
#> 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
#> 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
#> 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
#> 10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
#> # ℹ 307 more rows
#> # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#> # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#> # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#> # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#> # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#> # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
billboard_long <- billboard |>
pivot_longer(
wk1:wk76,
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
) |>
mutate(
week = as.integer(gsub("wk", "", week)),
date = as.Date(date.entered) + 7 * (week - 1),
date.entered = NULL
)
billboard_long
#> # A tibble: 5,307 × 5
#> artist track week rank date
#> <chr> <chr> <int> <dbl> <date>
#> 1 2 Pac Baby Don't Cry (Keep... 1 87 2000-02-26
#> 2 2 Pac Baby Don't Cry (Keep... 2 82 2000-03-04
#> 3 2 Pac Baby Don't Cry (Keep... 3 72 2000-03-11
#> 4 2 Pac Baby Don't Cry (Keep... 4 77 2000-03-18
#> 5 2 Pac Baby Don't Cry (Keep... 5 87 2000-03-25
#> 6 2 Pac Baby Don't Cry (Keep... 6 94 2000-04-01
#> 7 2 Pac Baby Don't Cry (Keep... 7 99 2000-04-08
#> 8 2Ge+her The Hardest Part Of ... 1 91 2000-09-02
#> 9 2Ge+her The Hardest Part Of ... 2 87 2000-09-09
#> 10 2Ge+her The Hardest Part Of ... 3 92 2000-09-16
#> # ℹ 5,297 more rows
9.3.2 Multiple variables stored in one column
Sometimes there are more than one variable stored in each column. This is the case in the tb
(tuberculosis) that comes from the World Health Organization. It shows tuberculosis cases by country
, year
, and demographic group that includes both gender m
or f
and age
group.
tb <- read_csv("data/tb.csv")
#> Rows: 5769 Columns: 22
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): iso2
#> dbl (21): year, m04, m514, m014, m1524, m2534, m3544, m4554, m5564, m65, mu,...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tb
#> # A tibble: 5,769 × 22
#> iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA
#> 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA
#> 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA
#> 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA
#> 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA
#> 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA
#> 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA
#> 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA
#> 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA
#> 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA
#> # ℹ 5,759 more rows
#> # ℹ 9 more variables: f514 <dbl>, f014 <dbl>, f1524 <dbl>, f2534 <dbl>,
#> # f3544 <dbl>, f4554 <dbl>, f5564 <dbl>, f65 <dbl>, fu <dbl>
The transformation to pivot longer and split the header variable can be done in a single step by supplying multiple column names to names_to
and also supplying a grouped regular expression to names_pattern
to split the variable after the first character.
tb %>% pivot_longer(
!c(iso2, year),
names_to = c("sex", "age"),
names_pattern = "(.)(.+)",
values_to = "n",
values_drop_na = TRUE
)
#> # A tibble: 35,750 × 5
#> iso2 year sex age n
#> <chr> <dbl> <chr> <chr> <dbl>
#> 1 AD 1996 m 014 0
#> 2 AD 1996 m 1524 0
#> 3 AD 1996 m 2534 0
#> 4 AD 1996 m 3544 4
#> 5 AD 1996 m 4554 1
#> 6 AD 1996 m 5564 0
#> 7 AD 1996 m 65 0
#> 8 AD 1996 f 014 0
#> 9 AD 1996 f 1524 1
#> 10 AD 1996 f 2534 1
#> # ℹ 35,740 more rows
9.3.3 Variables are stored in both rows and columns
A complex form of messy data is when variables are stored in both rows and columns. This occurs in weather data recording max and min temperatures in day columns.
weather <- read_csv("data/weather.csv")
#> Rows: 22 Columns: 35
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): id, element
#> dbl (25): year, month, d1, d2, d3, d4, d5, d6, d7, d8, d10, d11, d13, d14, d...
#> lgl (8): d9, d12, d18, d19, d20, d21, d22, d24
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weather
#> # A tibble: 22 × 35
#> id year month element d1 d2 d3 d4 d5 d6 d7 d8
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 MX17004 2010 1 tmax NA NA NA NA NA NA NA NA
#> 2 MX17004 2010 1 tmin NA NA NA NA NA NA NA NA
#> 3 MX17004 2010 2 tmax NA 27.3 24.1 NA NA NA NA NA
#> 4 MX17004 2010 2 tmin NA 14.4 14.4 NA NA NA NA NA
#> 5 MX17004 2010 3 tmax NA NA NA NA 32.1 NA NA NA
#> 6 MX17004 2010 3 tmin NA NA NA NA 14.2 NA NA NA
#> 7 MX17004 2010 4 tmax NA NA NA NA NA NA NA NA
#> 8 MX17004 2010 4 tmin NA NA NA NA NA NA NA NA
#> 9 MX17004 2010 5 tmax NA NA NA NA NA NA NA NA
#> 10 MX17004 2010 5 tmin NA NA NA NA NA NA NA NA
#> # ℹ 12 more rows
#> # ℹ 23 more variables: d9 <lgl>, d10 <dbl>, d11 <dbl>, d12 <lgl>, d13 <dbl>,
#> # d14 <dbl>, d15 <dbl>, d16 <dbl>, d17 <dbl>, d18 <lgl>, d19 <lgl>,
#> # d20 <lgl>, d21 <lgl>, d22 <lgl>, d23 <dbl>, d24 <lgl>, d25 <dbl>,
#> # d26 <dbl>, d27 <dbl>, d28 <dbl>, d29 <dbl>, d30 <dbl>, d31 <dbl>
To deal with this dual nature of the data we need to pivot the day variables longer to get the temperature readings in a column and then pivot the element column wider to split the temperature readings between min and max.
# Pivot-longer: day columns
weather_longer <- weather |>
pivot_longer(
d1:d31,
names_to = "day",
values_to = "value",
values_drop_na = TRUE
) |>
mutate(
date = ymd(paste(year, month, gsub("d", "", day), sep = "-"))
) |>
select(id, date, element, value) |>
arrange(date)
weather_longer
#> # A tibble: 66 × 4
#> id date element value
#> <chr> <date> <chr> <dbl>
#> 1 MX17004 2010-01-30 tmax 27.8
#> 2 MX17004 2010-01-30 tmin 14.5
#> 3 MX17004 2010-02-02 tmax 27.3
#> 4 MX17004 2010-02-02 tmin 14.4
#> 5 MX17004 2010-02-03 tmax 24.1
#> 6 MX17004 2010-02-03 tmin 14.4
#> 7 MX17004 2010-02-11 tmax 29.7
#> 8 MX17004 2010-02-11 tmin 13.4
#> 9 MX17004 2010-02-23 tmax 29.9
#> 10 MX17004 2010-02-23 tmin 10.7
#> # ℹ 56 more rows
# Pivot-wider: element column
weather_longer |>
pivot_wider(names_from = element, values_from = value)
#> # A tibble: 33 × 4
#> id date tmax tmin
#> <chr> <date> <dbl> <dbl>
#> 1 MX17004 2010-01-30 27.8 14.5
#> 2 MX17004 2010-02-02 27.3 14.4
#> 3 MX17004 2010-02-03 24.1 14.4
#> 4 MX17004 2010-02-11 29.7 13.4
#> 5 MX17004 2010-02-23 29.9 10.7
#> 6 MX17004 2010-03-05 32.1 14.2
#> 7 MX17004 2010-03-10 34.5 16.8
#> 8 MX17004 2010-03-16 31.1 17.6
#> 9 MX17004 2010-04-27 36.3 16.7
#> 10 MX17004 2010-05-27 33.2 18.2
#> # ℹ 23 more rows
9.3.4 Multiple types in one table
This occurs when multiple observational units are stored on one table instead of in multiple tables. One simple example of this is in the billboard
data where there is information about both the song and its ranking on the billboard list. This leads to duplication of information about the artist
in this case. It would be better to divide this information into two tables linked through a song id.
# Song table
song <- billboard_long |>
distinct(artist, track) |>
mutate(song_id = row_number())
song
#> # A tibble: 317 × 3
#> artist track song_id
#> <chr> <chr> <int>
#> 1 2 Pac Baby Don't Cry (Keep... 1
#> 2 2Ge+her The Hardest Part Of ... 2
#> 3 3 Doors Down Kryptonite 3
#> 4 3 Doors Down Loser 4
#> 5 504 Boyz Wobble Wobble 5
#> 6 98^0 Give Me Just One Nig... 6
#> 7 A*Teens Dancing Queen 7
#> 8 Aaliyah I Don't Wanna 8
#> 9 Aaliyah Try Again 9
#> 10 Adams, Yolanda Open My Heart 10
#> # ℹ 307 more rows
# Rank table
rank <- billboard_long |>
left_join(song, c("artist", "track")) |>
select(song_id, date, week, rank)
rank
#> # A tibble: 5,307 × 4
#> song_id date week rank
#> <int> <date> <int> <dbl>
#> 1 1 2000-02-26 1 87
#> 2 1 2000-03-04 2 82
#> 3 1 2000-03-11 3 72
#> 4 1 2000-03-18 4 77
#> 5 1 2000-03-25 5 87
#> 6 1 2000-04-01 6 94
#> 7 1 2000-04-08 7 99
#> 8 2 2000-09-02 1 91
#> 9 2 2000-09-09 2 87
#> 10 2 2000-09-16 3 92
#> # ℹ 5,297 more rows
9.3.5 One type in multiple tables
Sometimes an observational unit is spread among many tables that might be split by a variable such as person, year, location, etc. The fix for this is to bring the tables together and tidy:
- Read the files into a list of tables.
- For each table, add a new column that records the original file name (the file name is often the value of an important variable).
- Combine all tables into a single table.
This can be done with purrr