11  tidyr: Rectandling

Published

April 1, 2023

Modified

January 8, 2024

This vignette discusses the process of rectangling data: taking a deeply nested list (often sourced from wild caught JSON or XML) and taming it into a tidy data set of rows and columns. It uses three main functions:

A very large number of data rectangling problems can be solved by combining jsonlite::read_json() with these functions and a splash of dplyr.

The vignette uses the repurrrsive package, which provides a number of deeply nested lists originally mostly captured from web APIs.

11.1 GitHub users

Data: gh_users

Each user is a named list, where each element represents a column.

users <- tibble(user = gh_users)
users
#> # A tibble: 6 × 1
#>   user             
#>   <list>           
#> 1 <named list [30]>
#> 2 <named list [30]>
#> 3 <named list [30]>
#> 4 <named list [30]>
#> 5 <named list [30]>
#> 6 <named list [30]>

# names in list
names(users$user[[1]])
#>  [1] "login"               "id"                  "avatar_url"         
#>  [4] "gravatar_id"         "url"                 "html_url"           
#>  [7] "followers_url"       "following_url"       "gists_url"          
#> [10] "starred_url"         "subscriptions_url"   "organizations_url"  
#> [13] "repos_url"           "events_url"          "received_events_url"
#> [16] "type"                "site_admin"          "name"               
#> [19] "company"             "blog"                "location"           
#> [22] "email"               "hireable"            "bio"                
#> [25] "public_repos"        "public_gists"        "followers"          
#> [28] "following"           "created_at"          "updated_at"

unnest_wider(): takes every component and makes a new column.

users |> unnest_wider(user)
#> # A tibble: 6 × 30
#>   login     id avatar_url gravatar_id url   html_url followers_url following_url
#>   <chr>  <int> <chr>      <chr>       <chr> <chr>    <chr>         <chr>        
#> 1 gabo… 6.60e5 https://a… ""          http… https:/… https://api.… https://api.…
#> 2 jenn… 5.99e5 https://a… ""          http… https:/… https://api.… https://api.…
#> 3 jtle… 1.57e6 https://a… ""          http… https:/… https://api.… https://api.…
#> 4 juli… 1.25e7 https://a… ""          http… https:/… https://api.… https://api.…
#> 5 leep… 3.51e6 https://a… ""          http… https:/… https://api.… https://api.…
#> 6 masa… 8.36e6 https://a… ""          http… https:/… https://api.… https://api.…
#> # ℹ 22 more variables: gists_url <chr>, starred_url <chr>,
#> #   subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>,
#> #   events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>,
#> #   name <chr>, company <chr>, blog <chr>, location <chr>, email <chr>,
#> #   hireable <lgl>, bio <chr>, public_repos <int>, public_gists <int>,
#> #   followers <int>, following <int>, created_at <chr>, updated_at <chr>

hoist(): pulls out selected components using the same syntax as purrr::pluck(). It moves components out of the inner list into the top-level data frame as columns.

users |> hoist(user, 
  followers = "followers", 
  login = "login", 
  url = "html_url"
)
#> # A tibble: 6 × 4
#>   followers login       url                            user             
#>       <int> <chr>       <chr>                          <list>           
#> 1       303 gaborcsardi https://github.com/gaborcsardi <named list [27]>
#> 2       780 jennybc     https://github.com/jennybc     <named list [27]>
#> 3      3958 jtleek      https://github.com/jtleek      <named list [27]>
#> 4       115 juliasilge  https://github.com/juliasilge  <named list [27]>
#> 5       213 leeper      https://github.com/leeper      <named list [27]>
#> 6        34 masalmon    https://github.com/masalmon    <named list [27]>

11.2 GitHub repos

Data: gh_repos

gh_repos is a list of repositories that belong to six different users.

repos <- tibble(repo = gh_repos)
repos
#> # A tibble: 6 × 1
#>   repo       
#>   <list>     
#> 1 <list [30]>
#> 2 <list [30]>
#> 3 <list [30]>
#> 4 <list [26]>
#> 5 <list [30]>
#> 6 <list [30]>

First need to use unnest_longer() because the first level of the list is a list of observations.

repos <- repos |> unnest_longer(repo)
repos
#> # A tibble: 176 × 1
#>    repo             
#>    <list>           
#>  1 <named list [68]>
#>  2 <named list [68]>
#>  3 <named list [68]>
#>  4 <named list [68]>
#>  5 <named list [68]>
#>  6 <named list [68]>
#>  7 <named list [68]>
#>  8 <named list [68]>
#>  9 <named list [68]>
#> 10 <named list [68]>
#> # ℹ 166 more rows

Then we can use unnest_wider() or hoist(). This uses login = c("owner", "login") to reach two levels deep inside of a list.

repos |> hoist(repo, 
  login = c("owner", "login"), 
  name = "name",
  homepage = "homepage",
  watchers = "watchers_count"
)
#> # A tibble: 176 × 5
#>    login       name        homepage watchers repo             
#>    <chr>       <chr>       <chr>       <int> <list>           
#>  1 gaborcsardi after       <NA>            5 <named list [65]>
#>  2 gaborcsardi argufy      <NA>           19 <named list [65]>
#>  3 gaborcsardi ask         <NA>            5 <named list [65]>
#>  4 gaborcsardi baseimports <NA>            0 <named list [65]>
#>  5 gaborcsardi citest      <NA>            0 <named list [65]>
#>  6 gaborcsardi clisymbols  ""             18 <named list [65]>
#>  7 gaborcsardi cmaker      <NA>            0 <named list [65]>
#>  8 gaborcsardi cmark       <NA>            0 <named list [65]>
#>  9 gaborcsardi conditions  <NA>            0 <named list [65]>
#> 10 gaborcsardi crayon      <NA>           52 <named list [65]>
#> # ℹ 166 more rows

11.3 Game of Thrones characters

Data: got_chars

got_chars has a similar structure to gh_users: it is a list of named lists, where each element of the inner list describes some attribute of a GoT character.

chars <- tibble(char = got_chars)
chars
#> # A tibble: 30 × 1
#>    char             
#>    <list>           
#>  1 <named list [18]>
#>  2 <named list [18]>
#>  3 <named list [18]>
#>  4 <named list [18]>
#>  5 <named list [18]>
#>  6 <named list [18]>
#>  7 <named list [18]>
#>  8 <named list [18]>
#>  9 <named list [18]>
#> 10 <named list [18]>
#> # ℹ 20 more rows

chars <- chars |> unnest_wider(char)
chars
#> # A tibble: 30 × 18
#>    url           id name  gender culture born  died  alive titles aliases father
#>    <chr>      <int> <chr> <chr>  <chr>   <chr> <chr> <lgl> <list> <list>  <chr> 
#>  1 https://w…  1022 Theo… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
#>  2 https://w…  1052 Tyri… Male   ""      "In … ""    TRUE  <chr>  <chr>   ""    
#>  3 https://w…  1074 Vict… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
#>  4 https://w…  1109 Will  Male   ""      ""    "In … FALSE <chr>  <chr>   ""    
#>  5 https://w…  1166 Areo… Male   "Norvo… "In … ""    TRUE  <chr>  <chr>   ""    
#>  6 https://w…  1267 Chett Male   ""      "At … "In … FALSE <chr>  <chr>   ""    
#>  7 https://w…  1295 Cres… Male   ""      "In … "In … FALSE <chr>  <chr>   ""    
#>  8 https://w…   130 Aria… Female "Dorni… "In … ""    TRUE  <chr>  <chr>   ""    
#>  9 https://w…  1303 Daen… Female "Valyr… "In … ""    TRUE  <chr>  <chr>   ""    
#> 10 https://w…  1319 Davo… Male   "Weste… "In … ""    TRUE  <chr>  <chr>   ""    
#> # ℹ 20 more rows
#> # ℹ 7 more variables: mother <chr>, spouse <chr>, allegiances <list>,
#> #   books <list>, povBooks <list>, tvSeries <list>, playedBy <list>

The difference with gh_users is that many of the resulting columns are themselves list columns. What you do next is an issue of what you want to do in your analysis.

Maybe you want a row for every book and TV series that the character appears in:

chars |> 
  select(name, books, tvSeries) |> 
  pivot_longer(c(books, tvSeries),
               names_to = "media",
               values_to = "value") |> 
  unnest_longer(value)
#> # A tibble: 179 × 3
#>    name             media    value            
#>    <chr>            <chr>    <chr>            
#>  1 Theon Greyjoy    books    A Game of Thrones
#>  2 Theon Greyjoy    books    A Storm of Swords
#>  3 Theon Greyjoy    books    A Feast for Crows
#>  4 Theon Greyjoy    tvSeries Season 1         
#>  5 Theon Greyjoy    tvSeries Season 2         
#>  6 Theon Greyjoy    tvSeries Season 3         
#>  7 Theon Greyjoy    tvSeries Season 4         
#>  8 Theon Greyjoy    tvSeries Season 5         
#>  9 Theon Greyjoy    tvSeries Season 6         
#> 10 Tyrion Lannister books    A Feast for Crows
#> # ℹ 169 more rows

Maybe you want to build a table that lets you match title to name:

chars |> 
  select(name, title = titles) |> 
  unnest_longer(title)
#> # A tibble: 59 × 2
#>    name              title                                                 
#>    <chr>             <chr>                                                 
#>  1 Theon Greyjoy     "Prince of Winterfell"                                
#>  2 Theon Greyjoy     "Lord of the Iron Islands (by law of the green lands)"
#>  3 Tyrion Lannister  "Acting Hand of the King (former)"                    
#>  4 Tyrion Lannister  "Master of Coin (former)"                             
#>  5 Victarion Greyjoy "Lord Captain of the Iron Fleet"                      
#>  6 Victarion Greyjoy "Master of the Iron Victory"                          
#>  7 Will              ""                                                    
#>  8 Areo Hotah        "Captain of the Guard at Sunspear"                    
#>  9 Chett             ""                                                    
#> 10 Cressen           "Maester"                                             
#> # ℹ 49 more rows

11.4 Geocoding with Google

Data: gmaps_cities

gmaps_cities is a tibble with a list column of lists of various depths. To deal with this, you need to alternate between various unnest_longer() and unnest_wider() or hoist() to get at different types of data.

gmaps_cities
#> # A tibble: 5 × 2
#>   city       json            
#>   <chr>      <list>          
#> 1 Houston    <named list [2]>
#> 2 Washington <named list [2]>
#> 3 New York   <named list [2]>
#> 4 Chicago    <named list [2]>
#> 5 Arlington  <named list [2]>

json is a list-column of named lists, so it makes sense to start with unnest_wider():

gmaps_cities |>
  unnest_wider(json)
#> # A tibble: 5 × 3
#>   city       results    status
#>   <chr>      <list>     <chr> 
#> 1 Houston    <list [1]> OK    
#> 2 Washington <list [2]> OK    
#> 3 New York   <list [1]> OK    
#> 4 Chicago    <list [1]> OK    
#> 5 Arlington  <list [2]> OK

The results column is a list of lists. Most are length 1, but some are length 2, so we need to use unnest_longer() to add these observations. These, can then be unnested wider to get at various columns.

repurrrsive::gmaps_cities |>
  unnest_wider(json) |> 
  unnest_longer(results) |> 
  unnest_wider(results)
#> # A tibble: 7 × 7
#>   city  address_components formatted_address geometry     place_id types  status
#>   <chr> <list>             <chr>             <list>       <chr>    <list> <chr> 
#> 1 Hous… <list [4]>         Houston, TX, USA  <named list> ChIJAYW… <list> OK    
#> 2 Wash… <list [2]>         Washington, USA   <named list> ChIJ-bD… <list> OK    
#> 3 Wash… <list [4]>         Washington, DC, … <named list> ChIJW-T… <list> OK    
#> 4 New … <list [3]>         New York, NY, USA <named list> ChIJOwg… <list> OK    
#> 5 Chic… <list [4]>         Chicago, IL, USA  <named list> ChIJ7cv… <list> OK    
#> 6 Arli… <list [4]>         Arlington, TX, U… <named list> ChIJ05g… <list> OK    
#> 7 Arli… <list [4]>         Arlington, VA, U… <named list> ChIJD6e… <list> OK

To get at lat/lng data we need to go through location:

gmaps_cities |>
  unnest_wider(json) |>
  unnest_longer(results) |>
  unnest_wider(results) |>
  unnest_wider(geometry) |>
  unnest_wider(location)
#> # A tibble: 7 × 11
#>   city       address_components formatted_address   bounds         lat    lng
#>   <chr>      <list>             <chr>               <list>       <dbl>  <dbl>
#> 1 Houston    <list [4]>         Houston, TX, USA    <named list>  29.8  -95.4
#> 2 Washington <list [2]>         Washington, USA     <named list>  47.8 -121. 
#> 3 Washington <list [4]>         Washington, DC, USA <named list>  38.9  -77.0
#> 4 New York   <list [3]>         New York, NY, USA   <named list>  40.7  -74.0
#> 5 Chicago    <list [4]>         Chicago, IL, USA    <named list>  41.9  -87.6
#> 6 Arlington  <list [4]>         Arlington, TX, USA  <named list>  32.7  -97.1
#> 7 Arlington  <list [4]>         Arlington, VA, USA  <named list>  38.9  -77.1
#> # ℹ 5 more variables: location_type <chr>, viewport <list>, place_id <chr>,
#> #   types <list>, status <chr>

Or use hoist() to dive deeply to get directly to lat and lng:

gmaps_cities |>
  hoist(json,
    lat = list("results", 1, "geometry", "location", "lat"),
    lng = list("results", 1, "geometry", "location", "lng")
  )
#> # A tibble: 5 × 4
#>   city         lat    lng json            
#>   <chr>      <dbl>  <dbl> <list>          
#> 1 Houston     29.8  -95.4 <named list [2]>
#> 2 Washington  47.8 -121.  <named list [2]>
#> 3 New York    40.7  -74.0 <named list [2]>
#> 4 Chicago     41.9  -87.6 <named list [2]>
#> 5 Arlington   32.7  -97.1 <named list [2]>

11.5 Sharla Gelfand’s discography

Data: discog

discs <- tibble(disc = discog) |> 
  unnest_wider(disc) |> 
  mutate(date_added = as.POSIXct(strptime(date_added, "%Y-%m-%dT%H:%M:%S"))) 
discs
#> # A tibble: 155 × 5
#>    instance_id date_added          basic_information       id rating
#>          <int> <dttm>              <list>               <int>  <int>
#>  1   354823933 2019-02-16 17:48:59 <named list [11]>  7496378      0
#>  2   354092601 2019-02-13 14:13:11 <named list [11]>  4490852      0
#>  3   354091476 2019-02-13 14:07:23 <named list [11]>  9827276      0
#>  4   351244906 2019-02-02 11:39:58 <named list [11]>  9769203      0
#>  5   351244801 2019-02-02 11:39:37 <named list [11]>  7237138      0
#>  6   351052065 2019-02-01 20:40:53 <named list [11]> 13117042      0
#>  7   350315345 2019-01-29 15:48:37 <named list [11]>  7113575      0
#>  8   350315103 2019-01-29 15:47:22 <named list [11]> 10540713      0
#>  9   350314507 2019-01-29 15:44:08 <named list [11]> 11260950      0
#> 10   350314047 2019-01-29 15:41:35 <named list [11]> 11726853      0
#> # ℹ 145 more rows

Need to drop id column because this information is repeated in the basic_information list. Can either get rid of id or use hoist().

discs |> 
  select(!id) |> 
  unnest_wider(basic_information)
#> # A tibble: 155 × 14
#>    instance_id date_added          labels  year master_url  artists     id thumb
#>          <int> <dttm>              <list> <int> <chr>       <list>   <int> <chr>
#>  1   354823933 2019-02-16 17:48:59 <list>  2015 <NA>        <list>  7.50e6 "htt…
#>  2   354092601 2019-02-13 14:13:11 <list>  2013 https://ap… <list>  4.49e6 "htt…
#>  3   354091476 2019-02-13 14:07:23 <list>  2017 https://ap… <list>  9.83e6 "htt…
#>  4   351244906 2019-02-02 11:39:58 <list>  2017 https://ap… <list>  9.77e6 "htt…
#>  5   351244801 2019-02-02 11:39:37 <list>  2015 https://ap… <list>  7.24e6 "htt…
#>  6   351052065 2019-02-01 20:40:53 <list>  2019 https://ap… <list>  1.31e7 "htt…
#>  7   350315345 2019-01-29 15:48:37 <list>  2014 https://ap… <list>  7.11e6 "htt…
#>  8   350315103 2019-01-29 15:47:22 <list>  2015 https://ap… <list>  1.05e7 "htt…
#>  9   350314507 2019-01-29 15:44:08 <list>  2017 https://ap… <list>  1.13e7 ""   
#> 10   350314047 2019-01-29 15:41:35 <list>  2017 <NA>        <list>  1.17e7 "htt…
#> # ℹ 145 more rows
#> # ℹ 6 more variables: title <chr>, formats <list>, cover_image <chr>,
#> #   resource_url <chr>, master_id <int>, rating <int>

# Or hoist
discs |> 
  hoist(basic_information,
    title = "title",
    year = "year",
    label = list("labels", 1, "name"),
    artist = list("artists", 1, "name")
  )
#> # A tibble: 155 × 9
#>    instance_id date_added          title     year label artist basic_information
#>          <int> <dttm>              <chr>    <int> <chr> <chr>  <list>           
#>  1   354823933 2019-02-16 17:48:59 Demo      2015 Tobi… Mollot <named list [9]> 
#>  2   354092601 2019-02-13 14:13:11 Observa…  2013 La V… Una B… <named list [9]> 
#>  3   354091476 2019-02-13 14:07:23 I         2017 La V… S.H.I… <named list [9]> 
#>  4   351244906 2019-02-02 11:39:58 Oído Ab…  2017 La V… Rata … <named list [9]> 
#>  5   351244801 2019-02-02 11:39:37 A Cat's…  2015 Kato… Ivy (… <named list [9]> 
#>  6   351052065 2019-02-01 20:40:53 Tashme    2019 High… Tashme <named list [9]> 
#>  7   350315345 2019-01-29 15:48:37 Demo      2014 Mind… Desgr… <named list [9]> 
#>  8   350315103 2019-01-29 15:47:22 Let The…  2015 Not … Phant… <named list [9]> 
#>  9   350314507 2019-01-29 15:44:08 Sub Spa…  2017 Not … Sub S… <named list [9]> 
#> 10   350314047 2019-01-29 15:41:35 Demo      2017 Pres… Small… <named list [9]> 
#> # ℹ 145 more rows
#> # ℹ 2 more variables: id <int>, rating <int>