10  tidyr: Pivoting

Published

March 31, 2023

Modified

January 8, 2024

This vignette describes the usage of pivot_longer() and pivot_wider() as tidying methods that supersede gather() and spread(). The two new functions are meant to be more intuitive than gather() and spread(). They can work with multiple value variables that may have different types and can use specs to model the shape of the outcome.

10.1 Longer

pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.

pivot_longer() arguments:

  • data
  • cols: Columns to pivot into longer format.
  • names_to: The new column or columns to create from the information stored in the column names of data specified by cols.
  • values_to: The name of the column to create from the data stored in cell values.
  • names_prefix: A regular expression used to remove matching text from the start of each variable name.
  • names_transform: Either a single function applied to all columns or a list of column name-function pairs.
  • names_sep, names_pattern: Control how the column name is broken up when multiple names are passed to names_to.
    • names_sep: Either a numeric vector of where to split or regular expression.
    • names_pattern: A regular expression containing matching groups.

10.1.1 String data in column names

Data: relig_income

relig_income
#> # A tibble: 18 × 11
#>    religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
#>    <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
#>  1 Agnostic      27        34        60        81        76       137        122
#>  2 Atheist       12        27        37        52        35        70         73
#>  3 Buddhist      27        21        30        34        33        58         62
#>  4 Catholic     418       617       732       670       638      1116        949
#>  5 Don’t k…      15        14        15        11        10        35         21
#>  6 Evangel…     575       869      1064       982       881      1486        949
#>  7 Hindu          1         9         7         9        11        34         47
#>  8 Histori…     228       244       236       238       197       223        131
#>  9 Jehovah…      20        27        24        24        21        30         15
#> 10 Jewish        19        19        25        25        30        95         69
#> 11 Mainlin…     289       495       619       655       651      1107        939
#> 12 Mormon        29        40        48        51        56       112         85
#> 13 Muslim         6         7         9        10         9        23         16
#> 14 Orthodox      13        17        23        32        32        47         38
#> 15 Other C…       9         7        11        13        13        14         18
#> 16 Other F…      20        33        40        46        49        63         46
#> 17 Other W…       5         2         3         4         2         7          3
#> 18 Unaffil…     217       299       374       365       341       528        407
#> # ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#> #   `Don't know/refused` <dbl>

# Pivot all columns except religion
relig_income |> 
  pivot_longer(
    cols = !religion, 
    names_to = "income", 
    values_to = "count"
  )
#> # A tibble: 180 × 3
#>    religion income             count
#>    <chr>    <chr>              <dbl>
#>  1 Agnostic <$10k                 27
#>  2 Agnostic $10-20k               34
#>  3 Agnostic $20-30k               60
#>  4 Agnostic $30-40k               81
#>  5 Agnostic $40-50k               76
#>  6 Agnostic $50-75k              137
#>  7 Agnostic $75-100k             122
#>  8 Agnostic $100-150k            109
#>  9 Agnostic >150k                 84
#> 10 Agnostic Don't know/refused    96
#> # ℹ 170 more rows

10.1.2 Numeric data in column names

Data: billboard

Use of arguments names_prefix to strip away prefix in column names of wk and names_transform to convert character to integer.

billboard
#> # A tibble: 317 × 79
#>    artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
#>    <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
#>  2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
#>  3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
#>  4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
#>  5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
#>  6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
#>  7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
#>  8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
#>  9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
#> 10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
#> # ℹ 307 more rows
#> # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#> #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#> #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#> #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#> #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#> #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

billboard |>
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    names_prefix = "wk",
    names_transform = as.integer,
    values_to = "rank",
    values_drop_na = TRUE,
  )
#> # A tibble: 5,307 × 5
#>    artist  track                   date.entered  week  rank
#>    <chr>   <chr>                   <date>       <int> <dbl>
#>  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26       1    87
#>  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26       2    82
#>  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26       3    72
#>  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26       4    77
#>  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26       5    87
#>  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26       6    94
#>  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26       7    99
#>  8 2Ge+her The Hardest Part Of ... 2000-09-02       1    91
#>  9 2Ge+her The Hardest Part Of ... 2000-09-02       2    87
#> 10 2Ge+her The Hardest Part Of ... 2000-09-02       3    92
#> # ℹ 5,297 more rows

10.1.3 Many variables in column names

Data who

columns from new_sp_m014 to newrel_f65 encode four variables in their names:

  • The new_/new prefix indicates these are counts of new cases.
  • sp/rel/ep describe how the case was diagnosed.
  • m/f gives the gender.
  • 014/1524/2535/3544/4554/65 supplies the age range.

To deal with this:

  1. Specifying multiple column names in names_to.
  2. Separate column values with either names_sep or names_pattern.
  3. Use of readr functions to convert the gender and age to factors. This is faster than using mutate() afterwards.
who
#> # A tibble: 7,240 × 60
#>    country  iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
#>    <chr>    <chr> <chr> <dbl>       <dbl>        <dbl>        <dbl>        <dbl>
#>  1 Afghani… AF    AFG    1980          NA           NA           NA           NA
#>  2 Afghani… AF    AFG    1981          NA           NA           NA           NA
#>  3 Afghani… AF    AFG    1982          NA           NA           NA           NA
#>  4 Afghani… AF    AFG    1983          NA           NA           NA           NA
#>  5 Afghani… AF    AFG    1984          NA           NA           NA           NA
#>  6 Afghani… AF    AFG    1985          NA           NA           NA           NA
#>  7 Afghani… AF    AFG    1986          NA           NA           NA           NA
#>  8 Afghani… AF    AFG    1987          NA           NA           NA           NA
#>  9 Afghani… AF    AFG    1988          NA           NA           NA           NA
#> 10 Afghani… AF    AFG    1989          NA           NA           NA           NA
#> # ℹ 7,230 more rows
#> # ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
#> #   new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
#> #   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
#> #   new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
#> #   new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
#> #   new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>, …

who |>
  pivot_longer(
    cols = new_sp_m014:newrel_f65,
    names_to = c("diagnosis", "gender", "age"), 
    names_pattern = "new_?(.*)_(.)(.*)",
    names_transform = list(
      gender = ~ readr::parse_factor(.x, levels = c("f", "m")),
      age = ~ readr::parse_factor(
        .x,
        levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"), 
        ordered = TRUE
      )
    ),
    values_to = "count",
)
#> # A tibble: 405,440 × 8
#>    country     iso2  iso3   year diagnosis gender age   count
#>    <chr>       <chr> <chr> <dbl> <chr>     <fct>  <ord> <dbl>
#>  1 Afghanistan AF    AFG    1980 sp        m      014      NA
#>  2 Afghanistan AF    AFG    1980 sp        m      1524     NA
#>  3 Afghanistan AF    AFG    1980 sp        m      2534     NA
#>  4 Afghanistan AF    AFG    1980 sp        m      3544     NA
#>  5 Afghanistan AF    AFG    1980 sp        m      4554     NA
#>  6 Afghanistan AF    AFG    1980 sp        m      5564     NA
#>  7 Afghanistan AF    AFG    1980 sp        m      65       NA
#>  8 Afghanistan AF    AFG    1980 sp        f      014      NA
#>  9 Afghanistan AF    AFG    1980 sp        f      1524     NA
#> 10 Afghanistan AF    AFG    1980 sp        f      2534     NA
#> # ℹ 405,430 more rows

10.1.4 Multiple observations per row

Data: household

Use of ".value" in names_to to indicate that part of the column name specifies the “value” being measured and so should be used as a variable name.

household
#> # A tibble: 5 × 5
#>   family dob_child1 dob_child2 name_child1 name_child2
#>    <int> <date>     <date>     <chr>       <chr>      
#> 1      1 1998-11-26 2000-01-29 Susan       Jose       
#> 2      2 1996-06-22 NA         Mark        <NA>       
#> 3      3 2002-07-11 2004-04-05 Sam         Seth       
#> 4      4 2004-10-10 2009-08-27 Craig       Khai       
#> 5      5 2000-12-05 2005-02-28 Parker      Gracie

household |>
  pivot_longer(
    cols = !family, 
    names_to = c(".value", "child"), 
    names_sep = "_", 
    values_drop_na = TRUE
  )
#> # A tibble: 9 × 4
#>   family child  dob        name  
#>    <int> <chr>  <date>     <chr> 
#> 1      1 child1 1998-11-26 Susan 
#> 2      1 child2 2000-01-29 Jose  
#> 3      2 child1 1996-06-22 Mark  
#> 4      3 child1 2002-07-11 Sam   
#> 5      3 child2 2004-04-05 Seth  
#> 6      4 child1 2004-10-10 Craig 
#> 7      4 child2 2009-08-27 Khai  
#> 8      5 child1 2000-12-05 Parker
#> 9      5 child2 2005-02-28 Gracie

Data: anscombe

Use of cols_vary = "slowest" to group values from columns x1 and y1 together,.

anscombe
#>    x1 x2 x3 x4    y1   y2    y3    y4
#> 1  10 10 10  8  8.04 9.14  7.46  6.58
#> 2   8  8  8  8  6.95 8.14  6.77  5.76
#> 3  13 13 13  8  7.58 8.74 12.74  7.71
#> 4   9  9  9  8  8.81 8.77  7.11  8.84
#> 5  11 11 11  8  8.33 9.26  7.81  8.47
#> 6  14 14 14  8  9.96 8.10  8.84  7.04
#> 7   6  6  6  8  7.24 6.13  6.08  5.25
#> 8   4  4  4 19  4.26 3.10  5.39 12.50
#> 9  12 12 12  8 10.84 9.13  8.15  5.56
#> 10  7  7  7  8  4.82 7.26  6.42  7.91
#> 11  5  5  5  8  5.68 4.74  5.73  6.89

anscombe |>
  pivot_longer(
    cols = everything(), 
    cols_vary = "slowest",
    names_to = c(".value", "set"), 
    names_pattern = "(.)(.)"
  )
#> # A tibble: 44 × 3
#>    set       x     y
#>    <chr> <dbl> <dbl>
#>  1 1        10  8.04
#>  2 1         8  6.95
#>  3 1        13  7.58
#>  4 1         9  8.81
#>  5 1        11  8.33
#>  6 1        14  9.96
#>  7 1         6  7.24
#>  8 1         4  4.26
#>  9 1        12 10.8 
#> 10 1         7  4.82
#> # ℹ 34 more rows

10.2 Wider

pivot_wider() makes a dataset wider by increasing the number of columns and decreasing the number of rows.

pivot_wider() arguments:

  • data
  • names_from: Column used to create column names.
  • values_from: Column used to create cell values.
  • values_fill: Default value instead of NA.
  • values_fn: A function applied to the value in each cell in the output.
    • You will typically use this when the combination of id_cols and names_from columns does not uniquely identify an observation.
  • names_prefix: String added to the start of every variable name.
  • names_sep: If names_from or values_from contains multiple variables, this will be used to join their values together into a single string to use as a column name.
  • names_glue: Instead of names_sep and names_prefix, you can supply a glue specification that uses the names_from columns (and special .value) to create custom column names.
  • names_expand: Whether to make implicit factor levels explicit.
  • id_cols: A set of columns that uniquely identify each observation.
  • id_expand: Whether to make implicit factor levels explicit in id_cols.
  • unused_fn: A function applied to summarize the values from the unused columns (i.e. columns not identified by id_cols, names_from, or values_from).

10.2.1 Capture-recapture data

Data: fish_encounters

fish_encounters
#> # A tibble: 114 × 3
#>    fish  station  seen
#>    <fct> <fct>   <int>
#>  1 4842  Release     1
#>  2 4842  I80_1       1
#>  3 4842  Lisbon      1
#>  4 4842  Rstr        1
#>  5 4842  Base_TD     1
#>  6 4842  BCE         1
#>  7 4842  BCW         1
#>  8 4842  BCE2        1
#>  9 4842  BCW2        1
#> 10 4842  MAE         1
#> # ℹ 104 more rows

fish_encounters |>
  pivot_wider(
    names_from = station, 
    values_from = seen,
    values_fill = 0
  )
#> # A tibble: 19 × 12
#>    fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
#>    <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
#>  1 4842        1     1      1     1       1     1     1     1     1     1     1
#>  2 4843        1     1      1     1       1     1     1     1     1     1     1
#>  3 4844        1     1      1     1       1     1     1     1     1     1     1
#>  4 4845        1     1      1     1       1     0     0     0     0     0     0
#>  5 4847        1     1      1     0       0     0     0     0     0     0     0
#>  6 4848        1     1      1     1       0     0     0     0     0     0     0
#>  7 4849        1     1      0     0       0     0     0     0     0     0     0
#>  8 4850        1     1      0     1       1     1     1     0     0     0     0
#>  9 4851        1     1      0     0       0     0     0     0     0     0     0
#> 10 4854        1     1      0     0       0     0     0     0     0     0     0
#> 11 4855        1     1      1     1       1     0     0     0     0     0     0
#> 12 4857        1     1      1     1       1     1     1     1     1     0     0
#> 13 4858        1     1      1     1       1     1     1     1     1     1     1
#> 14 4859        1     1      1     1       1     0     0     0     0     0     0
#> 15 4861        1     1      1     1       1     1     1     1     1     1     1
#> 16 4862        1     1      1     1       1     1     1     1     1     0     0
#> 17 4863        1     1      0     0       0     0     0     0     0     0     0
#> 18 4864        1     1      0     0       0     0     0     0     0     0     0
#> 19 4865        1     1      1     0       0     0     0     0     0     0     0

10.2.2 Aggregation

Data: warpbreaks

Use of values_fn to create aggregate when pivot_wider() matches multiple cells.

warpbreaks <- warpbreaks |>
  as_tibble() |>
  select(wool, tension, breaks)
warpbreaks
#> # A tibble: 54 × 3
#>    wool  tension breaks
#>    <fct> <fct>    <dbl>
#>  1 A     L           26
#>  2 A     L           30
#>  3 A     L           54
#>  4 A     L           25
#>  5 A     L           70
#>  6 A     L           52
#>  7 A     L           51
#>  8 A     L           26
#>  9 A     L           67
#> 10 A     M           18
#> # ℹ 44 more rows

# Multiple matches creates list columns
warpbreaks |>
  pivot_wider(
    names_from = wool, 
    values_from = breaks
  )
#> Warning: Values from `breaks` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#>   {data} %>%
#>   dplyr::group_by(tension, wool) %>%
#>   dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
#>   dplyr::filter(n > 1L)
#> # A tibble: 3 × 3
#>   tension A         B        
#>   <fct>   <list>    <list>   
#> 1 L       <dbl [9]> <dbl [9]>
#> 2 M       <dbl [9]> <dbl [9]>
#> 3 H       <dbl [9]> <dbl [9]>

# values_fn to summarise values
warpbreaks |>
  pivot_wider(
    names_from = wool, 
    values_from = breaks,
    values_fn = mean
  )
#> # A tibble: 3 × 3
#>   tension     A     B
#>   <fct>   <dbl> <dbl>
#> 1 L        44.6  28.2
#> 2 M        24    28.8
#> 3 H        24.6  18.8

10.2.3 Generate column name from multiple variables

Data: production

Take tidy data and widen to have one column for each combination of two columns. Do this by specifying multiple variables for names_from.

Control how the column names are constructed with names_sep and names_prefix, or the workhorse names_glue

# Create tidy data
production <- 
  expand_grid(
    product = c("A", "B"), 
    country = c("AI", "EI"), 
    year = 2000:2014
  ) %>%
  filter((product == "A" & country == "AI") | product == "B") %>% 
  mutate(production = rnorm(nrow(.)))
production
#> # A tibble: 45 × 4
#>    product country  year production
#>    <chr>   <chr>   <int>      <dbl>
#>  1 A       AI       2000      0.961
#>  2 A       AI       2001      0.608
#>  3 A       AI       2002      0.991
#>  4 A       AI       2003      0.309
#>  5 A       AI       2004     -0.197
#>  6 A       AI       2005      0.526
#>  7 A       AI       2006     -1.33 
#>  8 A       AI       2007     -0.838
#>  9 A       AI       2008     -1.66 
#> 10 A       AI       2009      1.11 
#> # ℹ 35 more rows

# With names_sep and names_prefix
production |> 
  pivot_wider(
    names_from = c(product, country), 
    values_from = production,
    names_sep = ".",
    names_prefix = "prod."
  )
#> # A tibble: 15 × 4
#>     year prod.A.AI prod.B.AI prod.B.EI
#>    <int>     <dbl>     <dbl>     <dbl>
#>  1  2000     0.961   -0.397     0.778 
#>  2  2001     0.608    1.77     -0.116 
#>  3  2002     0.991    0.222     1.34  
#>  4  2003     0.309   -1.16      0.292 
#>  5  2004    -0.197    0.135    -0.699 
#>  6  2005     0.526   -0.360    -0.126 
#>  7  2006    -1.33     0.217     1.48  
#>  8  2007    -0.838   -0.0331    0.0260
#>  9  2008    -1.66    -1.03      0.806 
#> 10  2009     1.11    -0.195     1.77  
#> 11  2010    -0.659   -1.38     -0.361 
#> 12  2011     1.08    -0.485    -0.591 
#> 13  2012     1.08    -0.680     0.648 
#> 14  2013    -0.674    0.535     0.202 
#> 15  2014    -0.214   -1.20     -1.09

# With names_glue
production |> 
  pivot_wider(
    names_from = c(product, country), 
    values_from = production,
    names_glue = "prod_{product}_{country}"
  )
#> # A tibble: 15 × 4
#>     year prod_A_AI prod_B_AI prod_B_EI
#>    <int>     <dbl>     <dbl>     <dbl>
#>  1  2000     0.961   -0.397     0.778 
#>  2  2001     0.608    1.77     -0.116 
#>  3  2002     0.991    0.222     1.34  
#>  4  2003     0.309   -1.16      0.292 
#>  5  2004    -0.197    0.135    -0.699 
#>  6  2005     0.526   -0.360    -0.126 
#>  7  2006    -1.33     0.217     1.48  
#>  8  2007    -0.838   -0.0331    0.0260
#>  9  2008    -1.66    -1.03      0.806 
#> 10  2009     1.11    -0.195     1.77  
#> 11  2010    -0.659   -1.38     -0.361 
#> 12  2011     1.08    -0.485    -0.591 
#> 13  2012     1.08    -0.680     0.648 
#> 14  2013    -0.674    0.535     0.202 
#> 15  2014    -0.214   -1.20     -1.09

10.2.4 Tidy census

Data: us_rent_income

us_rent_income
#> # A tibble: 104 × 5
#>    GEOID NAME       variable estimate   moe
#>    <chr> <chr>      <chr>       <dbl> <dbl>
#>  1 01    Alabama    income      24476   136
#>  2 01    Alabama    rent          747     3
#>  3 02    Alaska     income      32940   508
#>  4 02    Alaska     rent         1200    13
#>  5 04    Arizona    income      27517   148
#>  6 04    Arizona    rent          972     4
#>  7 05    Arkansas   income      23789   165
#>  8 05    Arkansas   rent          709     5
#>  9 06    California income      29454   109
#> 10 06    California rent         1358     3
#> # ℹ 94 more rows

us_rent_income |> 
  pivot_wider(
    names_from = variable, 
    values_from = c(estimate, moe)
  )
#> # A tibble: 52 × 6
#>    GEOID NAME                 estimate_income estimate_rent moe_income moe_rent
#>    <chr> <chr>                          <dbl>         <dbl>      <dbl>    <dbl>
#>  1 01    Alabama                        24476           747        136        3
#>  2 02    Alaska                         32940          1200        508       13
#>  3 04    Arizona                        27517           972        148        4
#>  4 05    Arkansas                       23789           709        165        5
#>  5 06    California                     29454          1358        109        3
#>  6 08    Colorado                       32401          1125        109        5
#>  7 09    Connecticut                    35326          1123        195        5
#>  8 10    Delaware                       31560          1076        247       10
#>  9 11    District of Columbia           43198          1424        681       17
#> 10 12    Florida                        25952          1077         70        3
#> # ℹ 42 more rows

See Section 10.4.2 for a different way to do this with a spec.

10.2.5 Implicit missing values

Data: Factor of weekdays

When you want to turn a column of factor data into column names. pivot_wider() defaults to generating columns from the data that exists, but you might want to include columns from all possible factor levels. Use names_expand to make implicit factor levels explicit.

# Data
weekdays <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
daily <- tibble(
  day = factor(c("Tue", "Thu", "Fri", "Mon"), levels = weekdays),
  value = c(2, 3, 1, 5)
)
daily
#> # A tibble: 4 × 2
#>   day   value
#>   <fct> <dbl>
#> 1 Tue       2
#> 2 Thu       3
#> 3 Fri       1
#> 4 Mon       5

daily |> 
  pivot_wider(
    names_from = day, 
    values_from = value, 
    names_expand = TRUE
  )
#> # A tibble: 1 × 7
#>     Mon   Tue   Wed   Thu   Fri   Sat   Sun
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     5     2    NA     3     1    NA    NA

Data: Factor and year data

If multiple names_from columns are provided, names_expand will generate a Cartesian product of all possible combinations of the names_from values.

# Data
percentages <- tibble(
  year = c(2018, 2019, 2020, 2020),
  type = factor(c("A", "B", "A", "B"), levels = c("A", "B")),
  percentage = c(100, 100, 40, 60)
)
percentages
#> # A tibble: 4 × 3
#>    year type  percentage
#>   <dbl> <fct>      <dbl>
#> 1  2018 A            100
#> 2  2019 B            100
#> 3  2020 A             40
#> 4  2020 B             60

percentages |> 
  pivot_wider(
    names_from = c(type, year),
    values_from = percentage,
    names_expand = TRUE,
    values_fill = 0
  )
#> # A tibble: 1 × 6
#>   A_2018 A_2019 A_2020 B_2018 B_2019 B_2020
#>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1    100      0     40      0    100     60

Data: Missing factor levels in id_cols

Missing rows in factor instead of missing columns as above. Need to use id_expand = TRUE.

daily <- mutate(daily, type = factor(c("A", "B", "B", "A")))
daily
#> # A tibble: 4 × 3
#>   day   value type 
#>   <fct> <dbl> <fct>
#> 1 Tue       2 A    
#> 2 Thu       3 B    
#> 3 Fri       1 B    
#> 4 Mon       5 A

daily |> 
  pivot_wider(
    names_from = type, 
    values_from = value,
    values_fill = 0,
    id_expand = TRUE
  )
#> # A tibble: 7 × 3
#>   day       A     B
#>   <fct> <dbl> <dbl>
#> 1 Mon       5     0
#> 2 Tue       2     0
#> 3 Wed       0     0
#> 4 Thu       0     3
#> 5 Fri       0     1
#> 6 Sat       0     0
#> 7 Sun       0     0

10.2.6 Unused columns

Issue of retaining some data in a column that is not related to the pivoting process. Use of unused_fn to apply a function to unused columns instead of dropping them. You can retain the information by creating a list column.

# Data
updates <- tibble(
  county = c("Wake", "Wake", "Wake", "Guilford", "Guilford"),
  date = c(as.Date("2023-01-01") + 0:2, as.Date("2023-01-03") + 0:1),
  system = c("A", "B", "C", "A", "C"),
  value = c(3.2, 4, 5.5, 2, 1.2)
)
updates
#> # A tibble: 5 × 4
#>   county   date       system value
#>   <chr>    <date>     <chr>  <dbl>
#> 1 Wake     2023-01-01 A        3.2
#> 2 Wake     2023-01-02 B        4  
#> 3 Wake     2023-01-03 C        5.5
#> 4 Guilford 2023-01-03 A        2  
#> 5 Guilford 2023-01-04 C        1.2

# max
updates |> 
  pivot_wider(
    id_cols = county, 
    names_from = system, 
    values_from = value,
    unused_fn = list(date = max)
  )
#> # A tibble: 2 × 5
#>   county       A     B     C date      
#>   <chr>    <dbl> <dbl> <dbl> <date>    
#> 1 Wake       3.2     4   5.5 2023-01-03
#> 2 Guilford   2      NA   1.2 2023-01-04

# list column
updates |> 
  pivot_wider(
    id_cols = county, 
    names_from = system, 
    values_from = value,
    unused_fn = list(date = list)
  )
#> # A tibble: 2 × 5
#>   county       A     B     C date      
#>   <chr>    <dbl> <dbl> <dbl> <list>    
#> 1 Wake       3.2     4   5.5 <date [3]>
#> 2 Guilford   2      NA   1.2 <date [2]>

10.2.7 Contact list

Contact list with variable number of fields per contact. Can use cumsum() to create unique id based on first value for each person (name).

# Data
contacts <- tribble(
  ~field, ~value,
  "name", "Jiena McLellan",
  "company", "Toyota", 
  "name", "John Smith", 
  "company", "google", 
  "email", "john@google.com",
  "name", "Huxley Ratcliffe"
)
# Create unique id
contacts <- contacts |> 
  mutate(person_id = cumsum(field == "name"))
contacts
#> # A tibble: 6 × 3
#>   field   value            person_id
#>   <chr>   <chr>                <int>
#> 1 name    Jiena McLellan           1
#> 2 company Toyota                   1
#> 3 name    John Smith               2
#> 4 company google                   2
#> 5 email   john@google.com          2
#> 6 name    Huxley Ratcliffe         3

contacts |> 
  pivot_wider(
    names_from = field, 
    values_from = value
  )
#> # A tibble: 3 × 4
#>   person_id name             company email          
#>       <int> <chr>            <chr>   <chr>          
#> 1         1 Jiena McLellan   Toyota  <NA>           
#> 2         2 John Smith       google  john@google.com
#> 3         3 Huxley Ratcliffe <NA>    <NA>

10.3 Longer, then wider

10.3.1 World bank

Data: world_bank_pop

Problems:

  1. Year variables are column names: pivot_longer().
  2. indicator column contains multiple variables.
  3. This creates variable column with two variables that should be widened: pivot_wider()
  4. Make new column names lowercase: dplyr::rename_with()
world_bank_pop
#> # A tibble: 1,064 × 20
#>    country indicator      `2000`  `2001`  `2002`  `2003`  `2004`  `2005`  `2006`
#>    <chr>   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 ABW     SP.URB.TOTL    4.16e4 4.20e+4 4.22e+4 4.23e+4 4.23e+4 4.24e+4 4.26e+4
#>  2 ABW     SP.URB.GROW    1.66e0 9.56e-1 4.01e-1 1.97e-1 9.46e-2 1.94e-1 3.67e-1
#>  3 ABW     SP.POP.TOTL    8.91e4 9.07e+4 9.18e+4 9.27e+4 9.35e+4 9.45e+4 9.56e+4
#>  4 ABW     SP.POP.GROW    2.54e0 1.77e+0 1.19e+0 9.97e-1 9.01e-1 1.00e+0 1.18e+0
#>  5 AFE     SP.URB.TOTL    1.16e8 1.20e+8 1.24e+8 1.29e+8 1.34e+8 1.39e+8 1.44e+8
#>  6 AFE     SP.URB.GROW    3.60e0 3.66e+0 3.72e+0 3.71e+0 3.74e+0 3.81e+0 3.81e+0
#>  7 AFE     SP.POP.TOTL    4.02e8 4.12e+8 4.23e+8 4.34e+8 4.45e+8 4.57e+8 4.70e+8
#>  8 AFE     SP.POP.GROW    2.58e0 2.59e+0 2.61e+0 2.62e+0 2.64e+0 2.67e+0 2.70e+0
#>  9 AFG     SP.URB.TOTL    4.31e6 4.36e+6 4.67e+6 5.06e+6 5.30e+6 5.54e+6 5.83e+6
#> 10 AFG     SP.URB.GROW    1.86e0 1.15e+0 6.86e+0 7.95e+0 4.59e+0 4.47e+0 5.03e+0
#> # ℹ 1,054 more rows
#> # ℹ 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
#> #   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
#> #   `2016` <dbl>, `2017` <dbl>

# 1. Pivot longer
world_bank_pop |> 
  pivot_longer(
    cols = `2000`:`2017`, 
    names_to = "year", 
    values_to = "value"
  ) |> 
# 2. separate indicator column
  separate(indicator, c(NA, "area", "variable")) |> 
# 3. pivot variable column wider
  pivot_wider(
    names_from = variable, 
    values_from = value,
  ) |> 
# 4. Make new column names lowercase
  rename_with(tolower)
#> # A tibble: 9,576 × 5
#>    country area  year   totl   grow
#>    <chr>   <chr> <chr> <dbl>  <dbl>
#>  1 ABW     URB   2000  41625 1.66  
#>  2 ABW     URB   2001  42025 0.956 
#>  3 ABW     URB   2002  42194 0.401 
#>  4 ABW     URB   2003  42277 0.197 
#>  5 ABW     URB   2004  42317 0.0946
#>  6 ABW     URB   2005  42399 0.194 
#>  7 ABW     URB   2006  42555 0.367 
#>  8 ABW     URB   2007  42729 0.408 
#>  9 ABW     URB   2008  42906 0.413 
#> 10 ABW     URB   2009  43079 0.402 
#> # ℹ 9,566 more rows

10.3.2 Multi-choice

Going from data that has choice number as columns to whether an individual selected that choice or not.

  1. Make data longer, while eliminating NAs to only have choices that were made.
  2. Add logical column with TRUEs.
  3. Pivot data wider and fill in missing observations with FALSE.
# Data
multi <- tribble(
  ~id, ~choice1, ~choice2, ~choice3,
  1, "A", "B", "C",
  2, "C", "B",  NA,
  3, "D",  NA,  NA,
  4, "B", "D",  NA
)
multi
#> # A tibble: 4 × 4
#>      id choice1 choice2 choice3
#>   <dbl> <chr>   <chr>   <chr>  
#> 1     1 A       B       C      
#> 2     2 C       B       <NA>   
#> 3     3 D       <NA>    <NA>   
#> 4     4 B       D       <NA>

# 1. Pivot longer
multi |> 
  pivot_longer(
    cols = !id, 
    values_drop_na = TRUE
  ) |>
# 2. Add col whether choice was made
  mutate(checked = TRUE) |> 
# 3. Pivot wider
  pivot_wider(
    id_cols = id,
    names_from = value, 
    values_from = checked, 
    values_fill = FALSE
  )
#> # A tibble: 4 × 5
#>      id A     B     C     D    
#>   <dbl> <lgl> <lgl> <lgl> <lgl>
#> 1     1 TRUE  TRUE  TRUE  FALSE
#> 2     2 FALSE TRUE  TRUE  FALSE
#> 3     3 FALSE FALSE FALSE TRUE 
#> 4     4 FALSE TRUE  FALSE TRUE

10.4 Manual specs

To gain more control over pivoting, you can instead create a “spec” data frame that describes exactly how data stored in the column names becomes variables (and vice versa). This section focuses on the use of build_longer_spec(), pivot_longer_spec(), build_wider_spec(), and pivot_wider_spec().

A spec data frame created by build_longer_spec() or build_wider_spec() with special columns called .name for the name of the columns and .value for the name of the column that the values in the cells will go into. There is also one column in spec for each column present in the long format of the data that is not present in the wide format of the data. This corresponds to the names_to argument in pivot_longer() and the names_from argument in pivot_wider().

relig_income |> 
  build_longer_spec(
    cols = !religion, 
    names_to = "income",
    values_to = "count"
  )
#> # A tibble: 10 × 3
#>    .name              .value income            
#>    <chr>              <chr>  <chr>             
#>  1 <$10k              count  <$10k             
#>  2 $10-20k            count  $10-20k           
#>  3 $20-30k            count  $20-30k           
#>  4 $30-40k            count  $30-40k           
#>  5 $40-50k            count  $40-50k           
#>  6 $50-75k            count  $50-75k           
#>  7 $75-100k           count  $75-100k          
#>  8 $100-150k          count  $100-150k         
#>  9 >150k              count  >150k             
#> 10 Don't know/refused count  Don't know/refused

10.4.1 Longer

Data: relig_income

With a spec the process for pivoting is two steps. Creating a spec and then pivoting the spec. This does the same as Section 10.1.1.

# Build spec
spec <- relig_income |> 
  build_longer_spec(
    cols = !religion, 
    names_to = "income",
    values_to = "count"
  )
spec
#> # A tibble: 10 × 3
#>    .name              .value income            
#>    <chr>              <chr>  <chr>             
#>  1 <$10k              count  <$10k             
#>  2 $10-20k            count  $10-20k           
#>  3 $20-30k            count  $20-30k           
#>  4 $30-40k            count  $30-40k           
#>  5 $40-50k            count  $40-50k           
#>  6 $50-75k            count  $50-75k           
#>  7 $75-100k           count  $75-100k          
#>  8 $100-150k          count  $100-150k         
#>  9 >150k              count  >150k             
#> 10 Don't know/refused count  Don't know/refused

# Pivot with spec
pivot_longer_spec(relig_income, spec)
#> # A tibble: 180 × 3
#>    religion income             count
#>    <chr>    <chr>              <dbl>
#>  1 Agnostic <$10k                 27
#>  2 Agnostic $10-20k               34
#>  3 Agnostic $20-30k               60
#>  4 Agnostic $30-40k               81
#>  5 Agnostic $40-50k               76
#>  6 Agnostic $50-75k              137
#>  7 Agnostic $75-100k             122
#>  8 Agnostic $100-150k            109
#>  9 Agnostic >150k                 84
#> 10 Agnostic Don't know/refused    96
#> # ℹ 170 more rows

10.4.2 Wider

Data: us_rent_income

Ability to improve the widening of the data from Section 10.2.4 through the use of a spec. Change the names of the columns from the default of estimate_income, estimate_rent, moe_income, and moe_rent to income, rent, income_moe, and rent_moe.

# Default spec
spec1 <- us_rent_income |> 
  build_wider_spec(
    names_from = variable, 
    values_from = c(estimate, moe)
  )
spec1
#> # A tibble: 4 × 3
#>   .name           .value   variable
#>   <chr>           <chr>    <chr>   
#> 1 estimate_income estimate income  
#> 2 estimate_rent   estimate rent    
#> 3 moe_income      moe      income  
#> 4 moe_rent        moe      rent

# Modify spec with mutate on .names
spec2 <- spec1 |>
  mutate(
    .name = paste0(variable, ifelse(.value == "moe", "_moe", ""))
  )
spec2
#> # A tibble: 4 × 3
#>   .name      .value   variable
#>   <chr>      <chr>    <chr>   
#> 1 income     estimate income  
#> 2 rent       estimate rent    
#> 3 income_moe moe      income  
#> 4 rent_moe   moe      rent

# Pivot with spec
us_rent_income |> 
  pivot_wider_spec(spec2)
#> # A tibble: 52 × 6
#>    GEOID NAME                 income  rent income_moe rent_moe
#>    <chr> <chr>                 <dbl> <dbl>      <dbl>    <dbl>
#>  1 01    Alabama               24476   747        136        3
#>  2 02    Alaska                32940  1200        508       13
#>  3 04    Arizona               27517   972        148        4
#>  4 05    Arkansas              23789   709        165        5
#>  5 06    California            29454  1358        109        3
#>  6 08    Colorado              32401  1125        109        5
#>  7 09    Connecticut           35326  1123        195        5
#>  8 10    Delaware              31560  1076        247       10
#>  9 11    District of Columbia  43198  1424        681       17
#> 10 12    Florida               25952  1077         70        3
#> # ℹ 42 more rows

10.4.3 By hand

Data: construction

The column names in the data are two different kinds of variables: units and region. This can be described by creating a spec by hand.

spec <- tribble(
  ~.name,            ~.value, ~units,  ~region,     
  "1 unit",          "n",     "1",     NA,          
  "2 to 4 units",    "n",     "2-4",   NA,          
  "5 units or more", "n",     "5+",    NA,          
  "Northeast",       "n",     NA,      "Northeast", 
  "Midwest",         "n",     NA,      "Midwest",   
  "South",           "n",     NA,      "South",     
  "West",            "n",     NA,      "West",      
)

construction |> pivot_longer_spec(spec)
#> # A tibble: 63 × 5
#>     Year Month    units region        n
#>    <dbl> <chr>    <chr> <chr>     <dbl>
#>  1  2018 January  1     <NA>        859
#>  2  2018 January  2-4   <NA>         NA
#>  3  2018 January  5+    <NA>        348
#>  4  2018 January  <NA>  Northeast   114
#>  5  2018 January  <NA>  Midwest     169
#>  6  2018 January  <NA>  South       596
#>  7  2018 January  <NA>  West        339
#>  8  2018 February 1     <NA>        882
#>  9  2018 February 2-4   <NA>         NA
#> 10  2018 February 5+    <NA>        400
#> # ℹ 53 more rows

Since there is no overlap between the units and region data, it would actually be better to place them in two separate tables.

10.4.4 Theory

The same spec is used for pivot_longer() and pivot_wider(). You can see this through the symmetry of the two functions.

construction |> 
  pivot_longer_spec(spec) |> 
  pivot_wider_spec(spec)
#> # A tibble: 9 × 9
#>    Year Month  `1 unit` `2 to 4 units` `5 units or more` Northeast Midwest South
#>   <dbl> <chr>     <dbl>          <dbl>             <dbl>     <dbl>   <dbl> <dbl>
#> 1  2018 Janua…      859             NA               348       114     169   596
#> 2  2018 Febru…      882             NA               400       138     160   655
#> 3  2018 March       862             NA               356       150     154   595
#> 4  2018 April       797             NA               447       144     196   613
#> 5  2018 May         875             NA               364        90     169   673
#> 6  2018 June        867             NA               342        76     170   610
#> 7  2018 July        829             NA               360       108     183   594
#> 8  2018 August      939             NA               286        90     205   649
#> 9  2018 Septe…      835             NA               304       117     175   560
#> # ℹ 1 more variable: West <dbl>