10 tidyr: Pivoting
This vignette describes the usage of pivot_longer()
and pivot_wider()
as tidying methods that supersede gather()
and spread()
. The two new functions are meant to be more intuitive than gather()
and spread()
. They can work with multiple value variables that may have different types and can use specs to model the shape of the outcome.
10.1 Longer
pivot_longer()
makes datasets longer by increasing the number of rows and decreasing the number of columns.
pivot_longer()
arguments:
data
-
cols
: Columns to pivot into longer format. -
names_to
: The new column or columns to create from the information stored in the column names ofdata
specified bycols
. -
values_to
: The name of the column to create from the data stored in cell values. -
names_prefix
: A regular expression used to remove matching text from the start of each variable name. -
names_transform
: Either a single function applied to all columns or a list of column name-function pairs. -
names_sep
,names_pattern
: Control how the column name is broken up when multiple names are passed tonames_to
.-
names_sep
: Either a numeric vector of where to split or regular expression. -
names_pattern
: A regular expression containing matching groups.
-
10.1.1 String data in column names
Data: relig_income
relig_income
#> # A tibble: 18 × 11
#> religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Agnostic 27 34 60 81 76 137 122
#> 2 Atheist 12 27 37 52 35 70 73
#> 3 Buddhist 27 21 30 34 33 58 62
#> 4 Catholic 418 617 732 670 638 1116 949
#> 5 Don’t k… 15 14 15 11 10 35 21
#> 6 Evangel… 575 869 1064 982 881 1486 949
#> 7 Hindu 1 9 7 9 11 34 47
#> 8 Histori… 228 244 236 238 197 223 131
#> 9 Jehovah… 20 27 24 24 21 30 15
#> 10 Jewish 19 19 25 25 30 95 69
#> 11 Mainlin… 289 495 619 655 651 1107 939
#> 12 Mormon 29 40 48 51 56 112 85
#> 13 Muslim 6 7 9 10 9 23 16
#> 14 Orthodox 13 17 23 32 32 47 38
#> 15 Other C… 9 7 11 13 13 14 18
#> 16 Other F… 20 33 40 46 49 63 46
#> 17 Other W… 5 2 3 4 2 7 3
#> 18 Unaffil… 217 299 374 365 341 528 407
#> # ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#> # `Don't know/refused` <dbl>
# Pivot all columns except religion
relig_income |>
pivot_longer(
cols = !religion,
names_to = "income",
values_to = "count"
)
#> # A tibble: 180 × 3
#> religion income count
#> <chr> <chr> <dbl>
#> 1 Agnostic <$10k 27
#> 2 Agnostic $10-20k 34
#> 3 Agnostic $20-30k 60
#> 4 Agnostic $30-40k 81
#> 5 Agnostic $40-50k 76
#> 6 Agnostic $50-75k 137
#> 7 Agnostic $75-100k 122
#> 8 Agnostic $100-150k 109
#> 9 Agnostic >150k 84
#> 10 Agnostic Don't know/refused 96
#> # ℹ 170 more rows
10.1.2 Numeric data in column names
Data: billboard
Use of arguments names_prefix
to strip away prefix in column names of wk
and names_transform
to convert character to integer.
billboard
#> # A tibble: 317 × 79
#> artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
#> 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
#> 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
#> 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
#> 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
#> 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
#> 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
#> 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
#> 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
#> 10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
#> # ℹ 307 more rows
#> # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#> # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#> # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#> # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#> # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#> # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
billboard |>
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
names_prefix = "wk",
names_transform = as.integer,
values_to = "rank",
values_drop_na = TRUE,
)
#> # A tibble: 5,307 × 5
#> artist track date.entered week rank
#> <chr> <chr> <date> <int> <dbl>
#> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
#> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
#> 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
#> 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
#> 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
#> 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
#> 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
#> 8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
#> 9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
#> 10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
#> # ℹ 5,297 more rows
10.1.3 Many variables in column names
Data who
columns from new_sp_m014
to newrel_f65
encode four variables in their names:
- The
new_
/new
prefix indicates these are counts of new cases. -
sp
/rel
/ep
describe how the case was diagnosed. -
m
/f
gives the gender. -
014
/1524
/2535
/3544
/4554
/65
supplies the age range.
To deal with this:
- Specifying multiple column names in
names_to
. - Separate column values with either
names_sep
ornames_pattern
. - Use of
readr
functions to convert the gender and age to factors. This is faster than usingmutate()
afterwards.
who
#> # A tibble: 7,240 × 60
#> country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Afghani… AF AFG 1980 NA NA NA NA
#> 2 Afghani… AF AFG 1981 NA NA NA NA
#> 3 Afghani… AF AFG 1982 NA NA NA NA
#> 4 Afghani… AF AFG 1983 NA NA NA NA
#> 5 Afghani… AF AFG 1984 NA NA NA NA
#> 6 Afghani… AF AFG 1985 NA NA NA NA
#> 7 Afghani… AF AFG 1986 NA NA NA NA
#> 8 Afghani… AF AFG 1987 NA NA NA NA
#> 9 Afghani… AF AFG 1988 NA NA NA NA
#> 10 Afghani… AF AFG 1989 NA NA NA NA
#> # ℹ 7,230 more rows
#> # ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
#> # new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
#> # new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
#> # new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
#> # new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
#> # new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>, …
who |>
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
names_transform = list(
gender = ~ readr::parse_factor(.x, levels = c("f", "m")),
age = ~ readr::parse_factor(
.x,
levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"),
ordered = TRUE
)
),
values_to = "count",
)
#> # A tibble: 405,440 × 8
#> country iso2 iso3 year diagnosis gender age count
#> <chr> <chr> <chr> <dbl> <chr> <fct> <ord> <dbl>
#> 1 Afghanistan AF AFG 1980 sp m 014 NA
#> 2 Afghanistan AF AFG 1980 sp m 1524 NA
#> 3 Afghanistan AF AFG 1980 sp m 2534 NA
#> 4 Afghanistan AF AFG 1980 sp m 3544 NA
#> 5 Afghanistan AF AFG 1980 sp m 4554 NA
#> 6 Afghanistan AF AFG 1980 sp m 5564 NA
#> 7 Afghanistan AF AFG 1980 sp m 65 NA
#> 8 Afghanistan AF AFG 1980 sp f 014 NA
#> 9 Afghanistan AF AFG 1980 sp f 1524 NA
#> 10 Afghanistan AF AFG 1980 sp f 2534 NA
#> # ℹ 405,430 more rows
10.1.4 Multiple observations per row
Data: household
Use of ".value"
in names_to
to indicate that part of the column name specifies the “value” being measured and so should be used as a variable name.
household
#> # A tibble: 5 × 5
#> family dob_child1 dob_child2 name_child1 name_child2
#> <int> <date> <date> <chr> <chr>
#> 1 1 1998-11-26 2000-01-29 Susan Jose
#> 2 2 1996-06-22 NA Mark <NA>
#> 3 3 2002-07-11 2004-04-05 Sam Seth
#> 4 4 2004-10-10 2009-08-27 Craig Khai
#> 5 5 2000-12-05 2005-02-28 Parker Gracie
household |>
pivot_longer(
cols = !family,
names_to = c(".value", "child"),
names_sep = "_",
values_drop_na = TRUE
)
#> # A tibble: 9 × 4
#> family child dob name
#> <int> <chr> <date> <chr>
#> 1 1 child1 1998-11-26 Susan
#> 2 1 child2 2000-01-29 Jose
#> 3 2 child1 1996-06-22 Mark
#> 4 3 child1 2002-07-11 Sam
#> 5 3 child2 2004-04-05 Seth
#> 6 4 child1 2004-10-10 Craig
#> 7 4 child2 2009-08-27 Khai
#> 8 5 child1 2000-12-05 Parker
#> 9 5 child2 2005-02-28 Gracie
Data: anscombe
Use of cols_vary = "slowest"
to group values from columns x1
and y1
together,.
anscombe
#> x1 x2 x3 x4 y1 y2 y3 y4
#> 1 10 10 10 8 8.04 9.14 7.46 6.58
#> 2 8 8 8 8 6.95 8.14 6.77 5.76
#> 3 13 13 13 8 7.58 8.74 12.74 7.71
#> 4 9 9 9 8 8.81 8.77 7.11 8.84
#> 5 11 11 11 8 8.33 9.26 7.81 8.47
#> 6 14 14 14 8 9.96 8.10 8.84 7.04
#> 7 6 6 6 8 7.24 6.13 6.08 5.25
#> 8 4 4 4 19 4.26 3.10 5.39 12.50
#> 9 12 12 12 8 10.84 9.13 8.15 5.56
#> 10 7 7 7 8 4.82 7.26 6.42 7.91
#> 11 5 5 5 8 5.68 4.74 5.73 6.89
anscombe |>
pivot_longer(
cols = everything(),
cols_vary = "slowest",
names_to = c(".value", "set"),
names_pattern = "(.)(.)"
)
#> # A tibble: 44 × 3
#> set x y
#> <chr> <dbl> <dbl>
#> 1 1 10 8.04
#> 2 1 8 6.95
#> 3 1 13 7.58
#> 4 1 9 8.81
#> 5 1 11 8.33
#> 6 1 14 9.96
#> 7 1 6 7.24
#> 8 1 4 4.26
#> 9 1 12 10.8
#> 10 1 7 4.82
#> # ℹ 34 more rows
10.2 Wider
pivot_wider()
makes a dataset wider by increasing the number of columns and decreasing the number of rows.
pivot_wider()
arguments:
data
-
names_from
: Column used to create column names. -
values_from
: Column used to create cell values. -
values_fill
: Default value instead ofNA
. -
values_fn
: A function applied to the value in each cell in the output.- You will typically use this when the combination of
id_cols
andnames_from
columns does not uniquely identify an observation.
- You will typically use this when the combination of
-
names_prefix
: String added to the start of every variable name. -
names_sep
: Ifnames_from
orvalues_from
contains multiple variables, this will be used to join their values together into a single string to use as a column name. -
names_glue
: Instead ofnames_sep
andnames_prefix
, you can supply a glue specification that uses thenames_from
columns (and special.value
) to create custom column names. -
names_expand
: Whether to make implicit factor levels explicit. -
id_cols
: A set of columns that uniquely identify each observation. -
id_expand
: Whether to make implicit factor levels explicit inid_cols
. -
unused_fn
: A function applied to summarize the values from the unused columns (i.e. columns not identified byid_cols
,names_from
, orvalues_from
).
10.2.1 Capture-recapture data
Data: fish_encounters
fish_encounters
#> # A tibble: 114 × 3
#> fish station seen
#> <fct> <fct> <int>
#> 1 4842 Release 1
#> 2 4842 I80_1 1
#> 3 4842 Lisbon 1
#> 4 4842 Rstr 1
#> 5 4842 Base_TD 1
#> 6 4842 BCE 1
#> 7 4842 BCW 1
#> 8 4842 BCE2 1
#> 9 4842 BCW2 1
#> 10 4842 MAE 1
#> # ℹ 104 more rows
fish_encounters |>
pivot_wider(
names_from = station,
values_from = seen,
values_fill = 0
)
#> # A tibble: 19 × 12
#> fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
#> <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 4842 1 1 1 1 1 1 1 1 1 1 1
#> 2 4843 1 1 1 1 1 1 1 1 1 1 1
#> 3 4844 1 1 1 1 1 1 1 1 1 1 1
#> 4 4845 1 1 1 1 1 0 0 0 0 0 0
#> 5 4847 1 1 1 0 0 0 0 0 0 0 0
#> 6 4848 1 1 1 1 0 0 0 0 0 0 0
#> 7 4849 1 1 0 0 0 0 0 0 0 0 0
#> 8 4850 1 1 0 1 1 1 1 0 0 0 0
#> 9 4851 1 1 0 0 0 0 0 0 0 0 0
#> 10 4854 1 1 0 0 0 0 0 0 0 0 0
#> 11 4855 1 1 1 1 1 0 0 0 0 0 0
#> 12 4857 1 1 1 1 1 1 1 1 1 0 0
#> 13 4858 1 1 1 1 1 1 1 1 1 1 1
#> 14 4859 1 1 1 1 1 0 0 0 0 0 0
#> 15 4861 1 1 1 1 1 1 1 1 1 1 1
#> 16 4862 1 1 1 1 1 1 1 1 1 0 0
#> 17 4863 1 1 0 0 0 0 0 0 0 0 0
#> 18 4864 1 1 0 0 0 0 0 0 0 0 0
#> 19 4865 1 1 1 0 0 0 0 0 0 0 0
10.2.2 Aggregation
Data: warpbreaks
Use of values_fn
to create aggregate when pivot_wider()
matches multiple cells.
warpbreaks <- warpbreaks |>
as_tibble() |>
select(wool, tension, breaks)
warpbreaks
#> # A tibble: 54 × 3
#> wool tension breaks
#> <fct> <fct> <dbl>
#> 1 A L 26
#> 2 A L 30
#> 3 A L 54
#> 4 A L 25
#> 5 A L 70
#> 6 A L 52
#> 7 A L 51
#> 8 A L 26
#> 9 A L 67
#> 10 A M 18
#> # ℹ 44 more rows
# Multiple matches creates list columns
warpbreaks |>
pivot_wider(
names_from = wool,
values_from = breaks
)
#> Warning: Values from `breaks` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#> {data} %>%
#> dplyr::group_by(tension, wool) %>%
#> dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
#> dplyr::filter(n > 1L)
#> # A tibble: 3 × 3
#> tension A B
#> <fct> <list> <list>
#> 1 L <dbl [9]> <dbl [9]>
#> 2 M <dbl [9]> <dbl [9]>
#> 3 H <dbl [9]> <dbl [9]>
# values_fn to summarise values
warpbreaks |>
pivot_wider(
names_from = wool,
values_from = breaks,
values_fn = mean
)
#> # A tibble: 3 × 3
#> tension A B
#> <fct> <dbl> <dbl>
#> 1 L 44.6 28.2
#> 2 M 24 28.8
#> 3 H 24.6 18.8
10.2.3 Generate column name from multiple variables
Data: production
Take tidy data and widen to have one column for each combination of two columns. Do this by specifying multiple variables for names_from
.
Control how the column names are constructed with names_sep
and names_prefix
, or the workhorse names_glue
# Create tidy data
production <-
expand_grid(
product = c("A", "B"),
country = c("AI", "EI"),
year = 2000:2014
) %>%
filter((product == "A" & country == "AI") | product == "B") %>%
mutate(production = rnorm(nrow(.)))
production
#> # A tibble: 45 × 4
#> product country year production
#> <chr> <chr> <int> <dbl>
#> 1 A AI 2000 0.961
#> 2 A AI 2001 0.608
#> 3 A AI 2002 0.991
#> 4 A AI 2003 0.309
#> 5 A AI 2004 -0.197
#> 6 A AI 2005 0.526
#> 7 A AI 2006 -1.33
#> 8 A AI 2007 -0.838
#> 9 A AI 2008 -1.66
#> 10 A AI 2009 1.11
#> # ℹ 35 more rows
# With names_sep and names_prefix
production |>
pivot_wider(
names_from = c(product, country),
values_from = production,
names_sep = ".",
names_prefix = "prod."
)
#> # A tibble: 15 × 4
#> year prod.A.AI prod.B.AI prod.B.EI
#> <int> <dbl> <dbl> <dbl>
#> 1 2000 0.961 -0.397 0.778
#> 2 2001 0.608 1.77 -0.116
#> 3 2002 0.991 0.222 1.34
#> 4 2003 0.309 -1.16 0.292
#> 5 2004 -0.197 0.135 -0.699
#> 6 2005 0.526 -0.360 -0.126
#> 7 2006 -1.33 0.217 1.48
#> 8 2007 -0.838 -0.0331 0.0260
#> 9 2008 -1.66 -1.03 0.806
#> 10 2009 1.11 -0.195 1.77
#> 11 2010 -0.659 -1.38 -0.361
#> 12 2011 1.08 -0.485 -0.591
#> 13 2012 1.08 -0.680 0.648
#> 14 2013 -0.674 0.535 0.202
#> 15 2014 -0.214 -1.20 -1.09
# With names_glue
production |>
pivot_wider(
names_from = c(product, country),
values_from = production,
names_glue = "prod_{product}_{country}"
)
#> # A tibble: 15 × 4
#> year prod_A_AI prod_B_AI prod_B_EI
#> <int> <dbl> <dbl> <dbl>
#> 1 2000 0.961 -0.397 0.778
#> 2 2001 0.608 1.77 -0.116
#> 3 2002 0.991 0.222 1.34
#> 4 2003 0.309 -1.16 0.292
#> 5 2004 -0.197 0.135 -0.699
#> 6 2005 0.526 -0.360 -0.126
#> 7 2006 -1.33 0.217 1.48
#> 8 2007 -0.838 -0.0331 0.0260
#> 9 2008 -1.66 -1.03 0.806
#> 10 2009 1.11 -0.195 1.77
#> 11 2010 -0.659 -1.38 -0.361
#> 12 2011 1.08 -0.485 -0.591
#> 13 2012 1.08 -0.680 0.648
#> 14 2013 -0.674 0.535 0.202
#> 15 2014 -0.214 -1.20 -1.09
10.2.4 Tidy census
Data: us_rent_income
us_rent_income
#> # A tibble: 104 × 5
#> GEOID NAME variable estimate moe
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 01 Alabama income 24476 136
#> 2 01 Alabama rent 747 3
#> 3 02 Alaska income 32940 508
#> 4 02 Alaska rent 1200 13
#> 5 04 Arizona income 27517 148
#> 6 04 Arizona rent 972 4
#> 7 05 Arkansas income 23789 165
#> 8 05 Arkansas rent 709 5
#> 9 06 California income 29454 109
#> 10 06 California rent 1358 3
#> # ℹ 94 more rows
us_rent_income |>
pivot_wider(
names_from = variable,
values_from = c(estimate, moe)
)
#> # A tibble: 52 × 6
#> GEOID NAME estimate_income estimate_rent moe_income moe_rent
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 01 Alabama 24476 747 136 3
#> 2 02 Alaska 32940 1200 508 13
#> 3 04 Arizona 27517 972 148 4
#> 4 05 Arkansas 23789 709 165 5
#> 5 06 California 29454 1358 109 3
#> 6 08 Colorado 32401 1125 109 5
#> 7 09 Connecticut 35326 1123 195 5
#> 8 10 Delaware 31560 1076 247 10
#> 9 11 District of Columbia 43198 1424 681 17
#> 10 12 Florida 25952 1077 70 3
#> # ℹ 42 more rows
See Section 10.4.2 for a different way to do this with a spec
.
10.2.5 Implicit missing values
Data: Factor of weekdays
When you want to turn a column of factor data into column names. pivot_wider()
defaults to generating columns from the data that exists, but you might want to include columns from all possible factor levels. Use names_expand
to make implicit factor levels explicit.
# Data
weekdays <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
daily <- tibble(
day = factor(c("Tue", "Thu", "Fri", "Mon"), levels = weekdays),
value = c(2, 3, 1, 5)
)
daily
#> # A tibble: 4 × 2
#> day value
#> <fct> <dbl>
#> 1 Tue 2
#> 2 Thu 3
#> 3 Fri 1
#> 4 Mon 5
daily |>
pivot_wider(
names_from = day,
values_from = value,
names_expand = TRUE
)
#> # A tibble: 1 × 7
#> Mon Tue Wed Thu Fri Sat Sun
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5 2 NA 3 1 NA NA
Data: Factor and year data
If multiple names_from
columns are provided, names_expand
will generate a Cartesian product of all possible combinations of the names_from values.
# Data
percentages <- tibble(
year = c(2018, 2019, 2020, 2020),
type = factor(c("A", "B", "A", "B"), levels = c("A", "B")),
percentage = c(100, 100, 40, 60)
)
percentages
#> # A tibble: 4 × 3
#> year type percentage
#> <dbl> <fct> <dbl>
#> 1 2018 A 100
#> 2 2019 B 100
#> 3 2020 A 40
#> 4 2020 B 60
percentages |>
pivot_wider(
names_from = c(type, year),
values_from = percentage,
names_expand = TRUE,
values_fill = 0
)
#> # A tibble: 1 × 6
#> A_2018 A_2019 A_2020 B_2018 B_2019 B_2020
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 100 0 40 0 100 60
Data: Missing factor levels in id_cols
Missing rows in factor instead of missing columns as above. Need to use id_expand = TRUE
.
daily <- mutate(daily, type = factor(c("A", "B", "B", "A")))
daily
#> # A tibble: 4 × 3
#> day value type
#> <fct> <dbl> <fct>
#> 1 Tue 2 A
#> 2 Thu 3 B
#> 3 Fri 1 B
#> 4 Mon 5 A
daily |>
pivot_wider(
names_from = type,
values_from = value,
values_fill = 0,
id_expand = TRUE
)
#> # A tibble: 7 × 3
#> day A B
#> <fct> <dbl> <dbl>
#> 1 Mon 5 0
#> 2 Tue 2 0
#> 3 Wed 0 0
#> 4 Thu 0 3
#> 5 Fri 0 1
#> 6 Sat 0 0
#> 7 Sun 0 0
10.2.6 Unused columns
Issue of retaining some data in a column that is not related to the pivoting process. Use of unused_fn
to apply a function to unused columns instead of dropping them. You can retain the information by creating a list column.
# Data
updates <- tibble(
county = c("Wake", "Wake", "Wake", "Guilford", "Guilford"),
date = c(as.Date("2023-01-01") + 0:2, as.Date("2023-01-03") + 0:1),
system = c("A", "B", "C", "A", "C"),
value = c(3.2, 4, 5.5, 2, 1.2)
)
updates
#> # A tibble: 5 × 4
#> county date system value
#> <chr> <date> <chr> <dbl>
#> 1 Wake 2023-01-01 A 3.2
#> 2 Wake 2023-01-02 B 4
#> 3 Wake 2023-01-03 C 5.5
#> 4 Guilford 2023-01-03 A 2
#> 5 Guilford 2023-01-04 C 1.2
# max
updates |>
pivot_wider(
id_cols = county,
names_from = system,
values_from = value,
unused_fn = list(date = max)
)
#> # A tibble: 2 × 5
#> county A B C date
#> <chr> <dbl> <dbl> <dbl> <date>
#> 1 Wake 3.2 4 5.5 2023-01-03
#> 2 Guilford 2 NA 1.2 2023-01-04
# list column
updates |>
pivot_wider(
id_cols = county,
names_from = system,
values_from = value,
unused_fn = list(date = list)
)
#> # A tibble: 2 × 5
#> county A B C date
#> <chr> <dbl> <dbl> <dbl> <list>
#> 1 Wake 3.2 4 5.5 <date [3]>
#> 2 Guilford 2 NA 1.2 <date [2]>
10.2.7 Contact list
Contact list with variable number of fields per contact. Can use cumsum()
to create unique id based on first value for each person (name
).
# Data
contacts <- tribble(
~field, ~value,
"name", "Jiena McLellan",
"company", "Toyota",
"name", "John Smith",
"company", "google",
"email", "john@google.com",
"name", "Huxley Ratcliffe"
)
# Create unique id
contacts <- contacts |>
mutate(person_id = cumsum(field == "name"))
contacts
#> # A tibble: 6 × 3
#> field value person_id
#> <chr> <chr> <int>
#> 1 name Jiena McLellan 1
#> 2 company Toyota 1
#> 3 name John Smith 2
#> 4 company google 2
#> 5 email john@google.com 2
#> 6 name Huxley Ratcliffe 3
contacts |>
pivot_wider(
names_from = field,
values_from = value
)
#> # A tibble: 3 × 4
#> person_id name company email
#> <int> <chr> <chr> <chr>
#> 1 1 Jiena McLellan Toyota <NA>
#> 2 2 John Smith google john@google.com
#> 3 3 Huxley Ratcliffe <NA> <NA>
10.3 Longer, then wider
10.3.1 World bank
Data: world_bank_pop
Problems:
- Year variables are column names:
pivot_longer()
. -
indicator
column contains multiple variables. - This creates variable column with two variables that should be widened:
pivot_wider()
- Make new column names lowercase:
dplyr::rename_with()
world_bank_pop
#> # A tibble: 1,064 × 20
#> country indicator `2000` `2001` `2002` `2003` `2004` `2005` `2006`
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ABW SP.URB.TOTL 4.16e4 4.20e+4 4.22e+4 4.23e+4 4.23e+4 4.24e+4 4.26e+4
#> 2 ABW SP.URB.GROW 1.66e0 9.56e-1 4.01e-1 1.97e-1 9.46e-2 1.94e-1 3.67e-1
#> 3 ABW SP.POP.TOTL 8.91e4 9.07e+4 9.18e+4 9.27e+4 9.35e+4 9.45e+4 9.56e+4
#> 4 ABW SP.POP.GROW 2.54e0 1.77e+0 1.19e+0 9.97e-1 9.01e-1 1.00e+0 1.18e+0
#> 5 AFE SP.URB.TOTL 1.16e8 1.20e+8 1.24e+8 1.29e+8 1.34e+8 1.39e+8 1.44e+8
#> 6 AFE SP.URB.GROW 3.60e0 3.66e+0 3.72e+0 3.71e+0 3.74e+0 3.81e+0 3.81e+0
#> 7 AFE SP.POP.TOTL 4.02e8 4.12e+8 4.23e+8 4.34e+8 4.45e+8 4.57e+8 4.70e+8
#> 8 AFE SP.POP.GROW 2.58e0 2.59e+0 2.61e+0 2.62e+0 2.64e+0 2.67e+0 2.70e+0
#> 9 AFG SP.URB.TOTL 4.31e6 4.36e+6 4.67e+6 5.06e+6 5.30e+6 5.54e+6 5.83e+6
#> 10 AFG SP.URB.GROW 1.86e0 1.15e+0 6.86e+0 7.95e+0 4.59e+0 4.47e+0 5.03e+0
#> # ℹ 1,054 more rows
#> # ℹ 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
#> # `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
#> # `2016` <dbl>, `2017` <dbl>
# 1. Pivot longer
world_bank_pop |>
pivot_longer(
cols = `2000`:`2017`,
names_to = "year",
values_to = "value"
) |>
# 2. separate indicator column
separate(indicator, c(NA, "area", "variable")) |>
# 3. pivot variable column wider
pivot_wider(
names_from = variable,
values_from = value,
) |>
# 4. Make new column names lowercase
rename_with(tolower)
#> # A tibble: 9,576 × 5
#> country area year totl grow
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 ABW URB 2000 41625 1.66
#> 2 ABW URB 2001 42025 0.956
#> 3 ABW URB 2002 42194 0.401
#> 4 ABW URB 2003 42277 0.197
#> 5 ABW URB 2004 42317 0.0946
#> 6 ABW URB 2005 42399 0.194
#> 7 ABW URB 2006 42555 0.367
#> 8 ABW URB 2007 42729 0.408
#> 9 ABW URB 2008 42906 0.413
#> 10 ABW URB 2009 43079 0.402
#> # ℹ 9,566 more rows
10.3.2 Multi-choice
Going from data that has choice number as columns to whether an individual selected that choice or not.
- Make data longer, while eliminating
NA
s to only have choices that were made. - Add logical column with
TRUE
s. - Pivot data wider and fill in missing observations with
FALSE
.
# Data
multi <- tribble(
~id, ~choice1, ~choice2, ~choice3,
1, "A", "B", "C",
2, "C", "B", NA,
3, "D", NA, NA,
4, "B", "D", NA
)
multi
#> # A tibble: 4 × 4
#> id choice1 choice2 choice3
#> <dbl> <chr> <chr> <chr>
#> 1 1 A B C
#> 2 2 C B <NA>
#> 3 3 D <NA> <NA>
#> 4 4 B D <NA>
# 1. Pivot longer
multi |>
pivot_longer(
cols = !id,
values_drop_na = TRUE
) |>
# 2. Add col whether choice was made
mutate(checked = TRUE) |>
# 3. Pivot wider
pivot_wider(
id_cols = id,
names_from = value,
values_from = checked,
values_fill = FALSE
)
#> # A tibble: 4 × 5
#> id A B C D
#> <dbl> <lgl> <lgl> <lgl> <lgl>
#> 1 1 TRUE TRUE TRUE FALSE
#> 2 2 FALSE TRUE TRUE FALSE
#> 3 3 FALSE FALSE FALSE TRUE
#> 4 4 FALSE TRUE FALSE TRUE
10.4 Manual specs
To gain more control over pivoting, you can instead create a “spec” data frame that describes exactly how data stored in the column names becomes variables (and vice versa). This section focuses on the use of build_longer_spec()
, pivot_longer_spec()
, build_wider_spec()
, and pivot_wider_spec()
.
A spec
data frame created by build_longer_spec()
or build_wider_spec()
with special columns called .name
for the name of the columns and .value
for the name of the column that the values in the cells will go into. There is also one column in spec for each column present in the long format of the data that is not present in the wide format of the data. This corresponds to the names_to
argument in pivot_longer()
and the names_from
argument in pivot_wider()
.
relig_income |>
build_longer_spec(
cols = !religion,
names_to = "income",
values_to = "count"
)
#> # A tibble: 10 × 3
#> .name .value income
#> <chr> <chr> <chr>
#> 1 <$10k count <$10k
#> 2 $10-20k count $10-20k
#> 3 $20-30k count $20-30k
#> 4 $30-40k count $30-40k
#> 5 $40-50k count $40-50k
#> 6 $50-75k count $50-75k
#> 7 $75-100k count $75-100k
#> 8 $100-150k count $100-150k
#> 9 >150k count >150k
#> 10 Don't know/refused count Don't know/refused
10.4.1 Longer
Data: relig_income
With a spec the process for pivoting is two steps. Creating a spec and then pivoting the spec. This does the same as Section 10.1.1.
# Build spec
spec <- relig_income |>
build_longer_spec(
cols = !religion,
names_to = "income",
values_to = "count"
)
spec
#> # A tibble: 10 × 3
#> .name .value income
#> <chr> <chr> <chr>
#> 1 <$10k count <$10k
#> 2 $10-20k count $10-20k
#> 3 $20-30k count $20-30k
#> 4 $30-40k count $30-40k
#> 5 $40-50k count $40-50k
#> 6 $50-75k count $50-75k
#> 7 $75-100k count $75-100k
#> 8 $100-150k count $100-150k
#> 9 >150k count >150k
#> 10 Don't know/refused count Don't know/refused
# Pivot with spec
pivot_longer_spec(relig_income, spec)
#> # A tibble: 180 × 3
#> religion income count
#> <chr> <chr> <dbl>
#> 1 Agnostic <$10k 27
#> 2 Agnostic $10-20k 34
#> 3 Agnostic $20-30k 60
#> 4 Agnostic $30-40k 81
#> 5 Agnostic $40-50k 76
#> 6 Agnostic $50-75k 137
#> 7 Agnostic $75-100k 122
#> 8 Agnostic $100-150k 109
#> 9 Agnostic >150k 84
#> 10 Agnostic Don't know/refused 96
#> # ℹ 170 more rows
10.4.2 Wider
Data: us_rent_income
Ability to improve the widening of the data from Section 10.2.4 through the use of a spec
. Change the names of the columns from the default of estimate_income
, estimate_rent
, moe_income
, and moe_rent
to income
, rent
, income_moe
, and rent_moe
.
# Default spec
spec1 <- us_rent_income |>
build_wider_spec(
names_from = variable,
values_from = c(estimate, moe)
)
spec1
#> # A tibble: 4 × 3
#> .name .value variable
#> <chr> <chr> <chr>
#> 1 estimate_income estimate income
#> 2 estimate_rent estimate rent
#> 3 moe_income moe income
#> 4 moe_rent moe rent
# Modify spec with mutate on .names
spec2 <- spec1 |>
mutate(
.name = paste0(variable, ifelse(.value == "moe", "_moe", ""))
)
spec2
#> # A tibble: 4 × 3
#> .name .value variable
#> <chr> <chr> <chr>
#> 1 income estimate income
#> 2 rent estimate rent
#> 3 income_moe moe income
#> 4 rent_moe moe rent
# Pivot with spec
us_rent_income |>
pivot_wider_spec(spec2)
#> # A tibble: 52 × 6
#> GEOID NAME income rent income_moe rent_moe
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 01 Alabama 24476 747 136 3
#> 2 02 Alaska 32940 1200 508 13
#> 3 04 Arizona 27517 972 148 4
#> 4 05 Arkansas 23789 709 165 5
#> 5 06 California 29454 1358 109 3
#> 6 08 Colorado 32401 1125 109 5
#> 7 09 Connecticut 35326 1123 195 5
#> 8 10 Delaware 31560 1076 247 10
#> 9 11 District of Columbia 43198 1424 681 17
#> 10 12 Florida 25952 1077 70 3
#> # ℹ 42 more rows
10.4.3 By hand
Data: construction
The column names in the data are two different kinds of variables: units
and region
. This can be described by creating a spec
by hand.
spec <- tribble(
~.name, ~.value, ~units, ~region,
"1 unit", "n", "1", NA,
"2 to 4 units", "n", "2-4", NA,
"5 units or more", "n", "5+", NA,
"Northeast", "n", NA, "Northeast",
"Midwest", "n", NA, "Midwest",
"South", "n", NA, "South",
"West", "n", NA, "West",
)
construction |> pivot_longer_spec(spec)
#> # A tibble: 63 × 5
#> Year Month units region n
#> <dbl> <chr> <chr> <chr> <dbl>
#> 1 2018 January 1 <NA> 859
#> 2 2018 January 2-4 <NA> NA
#> 3 2018 January 5+ <NA> 348
#> 4 2018 January <NA> Northeast 114
#> 5 2018 January <NA> Midwest 169
#> 6 2018 January <NA> South 596
#> 7 2018 January <NA> West 339
#> 8 2018 February 1 <NA> 882
#> 9 2018 February 2-4 <NA> NA
#> 10 2018 February 5+ <NA> 400
#> # ℹ 53 more rows
Since there is no overlap between the units
and region
data, it would actually be better to place them in two separate tables.
10.4.4 Theory
The same spec
is used for pivot_longer()
and pivot_wider()
. You can see this through the symmetry of the two functions.
construction |>
pivot_longer_spec(spec) |>
pivot_wider_spec(spec)
#> # A tibble: 9 × 9
#> Year Month `1 unit` `2 to 4 units` `5 units or more` Northeast Midwest South
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2018 Janua… 859 NA 348 114 169 596
#> 2 2018 Febru… 882 NA 400 138 160 655
#> 3 2018 March 862 NA 356 150 154 595
#> 4 2018 April 797 NA 447 144 196 613
#> 5 2018 May 875 NA 364 90 169 673
#> 6 2018 June 867 NA 342 76 170 610
#> 7 2018 July 829 NA 360 108 183 594
#> 8 2018 August 939 NA 286 90 205 649
#> 9 2018 Septe… 835 NA 304 117 175 560
#> # ℹ 1 more variable: West <dbl>