11 tidyr: Rectandling
This vignette discusses the process of rectangling data: taking a deeply nested list (often sourced from wild caught JSON or XML) and taming it into a tidy data set of rows and columns. It uses three main functions:
-
unnest_longer()
: takes each element of a list-column and makes a new row. -
unnest_wider()
: takes each element of a list-column and makes a new column. -
hoist()
is similar tounnest_wider()
but only plucks out selected components and can reach down multiple levels.
A very large number of data rectangling problems can be solved by combining jsonlite::read_json()
with these functions and a splash of dplyr
.
The vignette uses the repurrrsive package, which provides a number of deeply nested lists originally mostly captured from web APIs.
11.1 GitHub users
Data: gh_users
Each user is a named list, where each element represents a column.
users <- tibble(user = gh_users)
users
#> # A tibble: 6 × 1
#> user
#> <list>
#> 1 <named list [30]>
#> 2 <named list [30]>
#> 3 <named list [30]>
#> 4 <named list [30]>
#> 5 <named list [30]>
#> 6 <named list [30]>
# names in list
names(users$user[[1]])
#> [1] "login" "id" "avatar_url"
#> [4] "gravatar_id" "url" "html_url"
#> [7] "followers_url" "following_url" "gists_url"
#> [10] "starred_url" "subscriptions_url" "organizations_url"
#> [13] "repos_url" "events_url" "received_events_url"
#> [16] "type" "site_admin" "name"
#> [19] "company" "blog" "location"
#> [22] "email" "hireable" "bio"
#> [25] "public_repos" "public_gists" "followers"
#> [28] "following" "created_at" "updated_at"
unnest_wider()
: takes every component and makes a new column.
users |> unnest_wider(user)
#> # A tibble: 6 × 30
#> login id avatar_url gravatar_id url html_url followers_url following_url
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 gabo… 6.60e5 https://a… "" http… https:/… https://api.… https://api.…
#> 2 jenn… 5.99e5 https://a… "" http… https:/… https://api.… https://api.…
#> 3 jtle… 1.57e6 https://a… "" http… https:/… https://api.… https://api.…
#> 4 juli… 1.25e7 https://a… "" http… https:/… https://api.… https://api.…
#> 5 leep… 3.51e6 https://a… "" http… https:/… https://api.… https://api.…
#> 6 masa… 8.36e6 https://a… "" http… https:/… https://api.… https://api.…
#> # ℹ 22 more variables: gists_url <chr>, starred_url <chr>,
#> # subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>,
#> # events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>,
#> # name <chr>, company <chr>, blog <chr>, location <chr>, email <chr>,
#> # hireable <lgl>, bio <chr>, public_repos <int>, public_gists <int>,
#> # followers <int>, following <int>, created_at <chr>, updated_at <chr>
hoist()
: pulls out selected components using the same syntax as purrr::pluck()
. It moves components out of the inner list into the top-level data frame as columns.
users |> hoist(user,
followers = "followers",
login = "login",
url = "html_url"
)
#> # A tibble: 6 × 4
#> followers login url user
#> <int> <chr> <chr> <list>
#> 1 303 gaborcsardi https://github.com/gaborcsardi <named list [27]>
#> 2 780 jennybc https://github.com/jennybc <named list [27]>
#> 3 3958 jtleek https://github.com/jtleek <named list [27]>
#> 4 115 juliasilge https://github.com/juliasilge <named list [27]>
#> 5 213 leeper https://github.com/leeper <named list [27]>
#> 6 34 masalmon https://github.com/masalmon <named list [27]>
11.2 GitHub repos
Data: gh_repos
gh_repos
is a list of repositories that belong to six different users.
repos <- tibble(repo = gh_repos)
repos
#> # A tibble: 6 × 1
#> repo
#> <list>
#> 1 <list [30]>
#> 2 <list [30]>
#> 3 <list [30]>
#> 4 <list [26]>
#> 5 <list [30]>
#> 6 <list [30]>
First need to use unnest_longer()
because the first level of the list is a list of observations.
repos <- repos |> unnest_longer(repo)
repos
#> # A tibble: 176 × 1
#> repo
#> <list>
#> 1 <named list [68]>
#> 2 <named list [68]>
#> 3 <named list [68]>
#> 4 <named list [68]>
#> 5 <named list [68]>
#> 6 <named list [68]>
#> 7 <named list [68]>
#> 8 <named list [68]>
#> 9 <named list [68]>
#> 10 <named list [68]>
#> # ℹ 166 more rows
Then we can use unnest_wider()
or hoist()
. This uses login = c("owner", "login")
to reach two levels deep inside of a list.
repos |> hoist(repo,
login = c("owner", "login"),
name = "name",
homepage = "homepage",
watchers = "watchers_count"
)
#> # A tibble: 176 × 5
#> login name homepage watchers repo
#> <chr> <chr> <chr> <int> <list>
#> 1 gaborcsardi after <NA> 5 <named list [65]>
#> 2 gaborcsardi argufy <NA> 19 <named list [65]>
#> 3 gaborcsardi ask <NA> 5 <named list [65]>
#> 4 gaborcsardi baseimports <NA> 0 <named list [65]>
#> 5 gaborcsardi citest <NA> 0 <named list [65]>
#> 6 gaborcsardi clisymbols "" 18 <named list [65]>
#> 7 gaborcsardi cmaker <NA> 0 <named list [65]>
#> 8 gaborcsardi cmark <NA> 0 <named list [65]>
#> 9 gaborcsardi conditions <NA> 0 <named list [65]>
#> 10 gaborcsardi crayon <NA> 52 <named list [65]>
#> # ℹ 166 more rows
11.3 Game of Thrones characters
Data: got_chars
got_chars
has a similar structure to gh_users
: it is a list of named lists, where each element of the inner list describes some attribute of a GoT character.
chars <- tibble(char = got_chars)
chars
#> # A tibble: 30 × 1
#> char
#> <list>
#> 1 <named list [18]>
#> 2 <named list [18]>
#> 3 <named list [18]>
#> 4 <named list [18]>
#> 5 <named list [18]>
#> 6 <named list [18]>
#> 7 <named list [18]>
#> 8 <named list [18]>
#> 9 <named list [18]>
#> 10 <named list [18]>
#> # ℹ 20 more rows
chars <- chars |> unnest_wider(char)
chars
#> # A tibble: 30 × 18
#> url id name gender culture born died alive titles aliases father
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <list> <list> <chr>
#> 1 https://w… 1022 Theo… Male "Ironb… "In … "" TRUE <chr> <chr> ""
#> 2 https://w… 1052 Tyri… Male "" "In … "" TRUE <chr> <chr> ""
#> 3 https://w… 1074 Vict… Male "Ironb… "In … "" TRUE <chr> <chr> ""
#> 4 https://w… 1109 Will Male "" "" "In … FALSE <chr> <chr> ""
#> 5 https://w… 1166 Areo… Male "Norvo… "In … "" TRUE <chr> <chr> ""
#> 6 https://w… 1267 Chett Male "" "At … "In … FALSE <chr> <chr> ""
#> 7 https://w… 1295 Cres… Male "" "In … "In … FALSE <chr> <chr> ""
#> 8 https://w… 130 Aria… Female "Dorni… "In … "" TRUE <chr> <chr> ""
#> 9 https://w… 1303 Daen… Female "Valyr… "In … "" TRUE <chr> <chr> ""
#> 10 https://w… 1319 Davo… Male "Weste… "In … "" TRUE <chr> <chr> ""
#> # ℹ 20 more rows
#> # ℹ 7 more variables: mother <chr>, spouse <chr>, allegiances <list>,
#> # books <list>, povBooks <list>, tvSeries <list>, playedBy <list>
The difference with gh_users
is that many of the resulting columns are themselves list columns. What you do next is an issue of what you want to do in your analysis.
Maybe you want a row for every book and TV series that the character appears in:
chars |>
select(name, books, tvSeries) |>
pivot_longer(c(books, tvSeries),
names_to = "media",
values_to = "value") |>
unnest_longer(value)
#> # A tibble: 179 × 3
#> name media value
#> <chr> <chr> <chr>
#> 1 Theon Greyjoy books A Game of Thrones
#> 2 Theon Greyjoy books A Storm of Swords
#> 3 Theon Greyjoy books A Feast for Crows
#> 4 Theon Greyjoy tvSeries Season 1
#> 5 Theon Greyjoy tvSeries Season 2
#> 6 Theon Greyjoy tvSeries Season 3
#> 7 Theon Greyjoy tvSeries Season 4
#> 8 Theon Greyjoy tvSeries Season 5
#> 9 Theon Greyjoy tvSeries Season 6
#> 10 Tyrion Lannister books A Feast for Crows
#> # ℹ 169 more rows
Maybe you want to build a table that lets you match title to name:
chars |>
select(name, title = titles) |>
unnest_longer(title)
#> # A tibble: 59 × 2
#> name title
#> <chr> <chr>
#> 1 Theon Greyjoy "Prince of Winterfell"
#> 2 Theon Greyjoy "Lord of the Iron Islands (by law of the green lands)"
#> 3 Tyrion Lannister "Acting Hand of the King (former)"
#> 4 Tyrion Lannister "Master of Coin (former)"
#> 5 Victarion Greyjoy "Lord Captain of the Iron Fleet"
#> 6 Victarion Greyjoy "Master of the Iron Victory"
#> 7 Will ""
#> 8 Areo Hotah "Captain of the Guard at Sunspear"
#> 9 Chett ""
#> 10 Cressen "Maester"
#> # ℹ 49 more rows
11.4 Geocoding with Google
Data: gmaps_cities
gmaps_cities
is a tibble with a list column of lists of various depths. To deal with this, you need to alternate between various unnest_longer()
and unnest_wider()
or hoist()
to get at different types of data.
gmaps_cities
#> # A tibble: 5 × 2
#> city json
#> <chr> <list>
#> 1 Houston <named list [2]>
#> 2 Washington <named list [2]>
#> 3 New York <named list [2]>
#> 4 Chicago <named list [2]>
#> 5 Arlington <named list [2]>
json is a list-column of named lists, so it makes sense to start with unnest_wider()
:
gmaps_cities |>
unnest_wider(json)
#> # A tibble: 5 × 3
#> city results status
#> <chr> <list> <chr>
#> 1 Houston <list [1]> OK
#> 2 Washington <list [2]> OK
#> 3 New York <list [1]> OK
#> 4 Chicago <list [1]> OK
#> 5 Arlington <list [2]> OK
The results
column is a list of lists. Most are length 1, but some are length 2, so we need to use unnest_longer()
to add these observations. These, can then be unnested wider to get at various columns.
repurrrsive::gmaps_cities |>
unnest_wider(json) |>
unnest_longer(results) |>
unnest_wider(results)
#> # A tibble: 7 × 7
#> city address_components formatted_address geometry place_id types status
#> <chr> <list> <chr> <list> <chr> <list> <chr>
#> 1 Hous… <list [4]> Houston, TX, USA <named list> ChIJAYW… <list> OK
#> 2 Wash… <list [2]> Washington, USA <named list> ChIJ-bD… <list> OK
#> 3 Wash… <list [4]> Washington, DC, … <named list> ChIJW-T… <list> OK
#> 4 New … <list [3]> New York, NY, USA <named list> ChIJOwg… <list> OK
#> 5 Chic… <list [4]> Chicago, IL, USA <named list> ChIJ7cv… <list> OK
#> 6 Arli… <list [4]> Arlington, TX, U… <named list> ChIJ05g… <list> OK
#> 7 Arli… <list [4]> Arlington, VA, U… <named list> ChIJD6e… <list> OK
To get at lat/lng data we need to go through location:
gmaps_cities |>
unnest_wider(json) |>
unnest_longer(results) |>
unnest_wider(results) |>
unnest_wider(geometry) |>
unnest_wider(location)
#> # A tibble: 7 × 11
#> city address_components formatted_address bounds lat lng
#> <chr> <list> <chr> <list> <dbl> <dbl>
#> 1 Houston <list [4]> Houston, TX, USA <named list> 29.8 -95.4
#> 2 Washington <list [2]> Washington, USA <named list> 47.8 -121.
#> 3 Washington <list [4]> Washington, DC, USA <named list> 38.9 -77.0
#> 4 New York <list [3]> New York, NY, USA <named list> 40.7 -74.0
#> 5 Chicago <list [4]> Chicago, IL, USA <named list> 41.9 -87.6
#> 6 Arlington <list [4]> Arlington, TX, USA <named list> 32.7 -97.1
#> 7 Arlington <list [4]> Arlington, VA, USA <named list> 38.9 -77.1
#> # ℹ 5 more variables: location_type <chr>, viewport <list>, place_id <chr>,
#> # types <list>, status <chr>
Or use hoist()
to dive deeply to get directly to lat
and lng
:
gmaps_cities |>
hoist(json,
lat = list("results", 1, "geometry", "location", "lat"),
lng = list("results", 1, "geometry", "location", "lng")
)
#> # A tibble: 5 × 4
#> city lat lng json
#> <chr> <dbl> <dbl> <list>
#> 1 Houston 29.8 -95.4 <named list [2]>
#> 2 Washington 47.8 -121. <named list [2]>
#> 3 New York 40.7 -74.0 <named list [2]>
#> 4 Chicago 41.9 -87.6 <named list [2]>
#> 5 Arlington 32.7 -97.1 <named list [2]>
11.5 Sharla Gelfand’s discography
Data: discog
discs <- tibble(disc = discog) |>
unnest_wider(disc) |>
mutate(date_added = as.POSIXct(strptime(date_added, "%Y-%m-%dT%H:%M:%S")))
discs
#> # A tibble: 155 × 5
#> instance_id date_added basic_information id rating
#> <int> <dttm> <list> <int> <int>
#> 1 354823933 2019-02-16 17:48:59 <named list [11]> 7496378 0
#> 2 354092601 2019-02-13 14:13:11 <named list [11]> 4490852 0
#> 3 354091476 2019-02-13 14:07:23 <named list [11]> 9827276 0
#> 4 351244906 2019-02-02 11:39:58 <named list [11]> 9769203 0
#> 5 351244801 2019-02-02 11:39:37 <named list [11]> 7237138 0
#> 6 351052065 2019-02-01 20:40:53 <named list [11]> 13117042 0
#> 7 350315345 2019-01-29 15:48:37 <named list [11]> 7113575 0
#> 8 350315103 2019-01-29 15:47:22 <named list [11]> 10540713 0
#> 9 350314507 2019-01-29 15:44:08 <named list [11]> 11260950 0
#> 10 350314047 2019-01-29 15:41:35 <named list [11]> 11726853 0
#> # ℹ 145 more rows
Need to drop id
column because this information is repeated in the basic_information
list. Can either get rid of id
or use hoist()
.
discs |>
select(!id) |>
unnest_wider(basic_information)
#> # A tibble: 155 × 14
#> instance_id date_added labels year master_url artists id thumb
#> <int> <dttm> <list> <int> <chr> <list> <int> <chr>
#> 1 354823933 2019-02-16 17:48:59 <list> 2015 <NA> <list> 7.50e6 "htt…
#> 2 354092601 2019-02-13 14:13:11 <list> 2013 https://ap… <list> 4.49e6 "htt…
#> 3 354091476 2019-02-13 14:07:23 <list> 2017 https://ap… <list> 9.83e6 "htt…
#> 4 351244906 2019-02-02 11:39:58 <list> 2017 https://ap… <list> 9.77e6 "htt…
#> 5 351244801 2019-02-02 11:39:37 <list> 2015 https://ap… <list> 7.24e6 "htt…
#> 6 351052065 2019-02-01 20:40:53 <list> 2019 https://ap… <list> 1.31e7 "htt…
#> 7 350315345 2019-01-29 15:48:37 <list> 2014 https://ap… <list> 7.11e6 "htt…
#> 8 350315103 2019-01-29 15:47:22 <list> 2015 https://ap… <list> 1.05e7 "htt…
#> 9 350314507 2019-01-29 15:44:08 <list> 2017 https://ap… <list> 1.13e7 ""
#> 10 350314047 2019-01-29 15:41:35 <list> 2017 <NA> <list> 1.17e7 "htt…
#> # ℹ 145 more rows
#> # ℹ 6 more variables: title <chr>, formats <list>, cover_image <chr>,
#> # resource_url <chr>, master_id <int>, rating <int>
# Or hoist
discs |>
hoist(basic_information,
title = "title",
year = "year",
label = list("labels", 1, "name"),
artist = list("artists", 1, "name")
)
#> # A tibble: 155 × 9
#> instance_id date_added title year label artist basic_information
#> <int> <dttm> <chr> <int> <chr> <chr> <list>
#> 1 354823933 2019-02-16 17:48:59 Demo 2015 Tobi… Mollot <named list [9]>
#> 2 354092601 2019-02-13 14:13:11 Observa… 2013 La V… Una B… <named list [9]>
#> 3 354091476 2019-02-13 14:07:23 I 2017 La V… S.H.I… <named list [9]>
#> 4 351244906 2019-02-02 11:39:58 Oído Ab… 2017 La V… Rata … <named list [9]>
#> 5 351244801 2019-02-02 11:39:37 A Cat's… 2015 Kato… Ivy (… <named list [9]>
#> 6 351052065 2019-02-01 20:40:53 Tashme 2019 High… Tashme <named list [9]>
#> 7 350315345 2019-01-29 15:48:37 Demo 2014 Mind… Desgr… <named list [9]>
#> 8 350315103 2019-01-29 15:47:22 Let The… 2015 Not … Phant… <named list [9]>
#> 9 350314507 2019-01-29 15:44:08 Sub Spa… 2017 Not … Sub S… <named list [9]>
#> 10 350314047 2019-01-29 15:41:35 Demo 2017 Pres… Small… <named list [9]>
#> # ℹ 145 more rows
#> # ℹ 2 more variables: id <int>, rating <int>