7  forcats

Published

March 16, 2023

Modified

January 8, 2024

7.1 forcats resources

These notes use both the forcats vignette and the chapter from R for Data Science, but they focus more on the latter, which is more robust. The notes use the gss_cat data from forcats from the General Social Survey.

7.2 Create factors

You can create a factor from a character vector with base factor(), forcats::fct(), or base as.factor(), forcats::as_factor(). fct() is stricter than factor(); it errors if your specification of levels is inconsistent with the values in the character vector. as_factor() creates levels in the order in which they appear, not in alphabetical order by locale as in base R.

fct(c("Dec", "Apr", "Jan", "Mar"), levels = month.abb)
#> [1] Dec Apr Jan Mar
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

You can also create a factor when reading your data with readr::col_factor(). See the readr notes.

library(readr)
csv <- "
month,value
Jan,12
Feb,56
Mar,12"

read_csv(csv, col_types = list(month = col_factor(month.abb)))
#> # A tibble: 3 × 2
#>   month value
#>   <fct> <dbl>
#> 1 Jan      12
#> 2 Feb      56
#> 3 Mar      12

When working with factors, the two most common operations are changing the order of the levels and changing the values of the levels.

7.3 Modifying factor order

Modifying factor order is particularly useful for purposes of visualization so that the geoms are ordered by frequency or a specified value in the data.

Reordering functions:

7.3.1 fct_infreq()

fct_infreq() is particularly useful for bar plots. It orders factor levels by the number of observations of each level, placing the largest first. It can be used with fct_rev() to reverse the order of the levels if you want smallest first.

# fct_infreq
ggplot(gss_cat) +
  geom_bar(aes(y = fct_infreq(marital)))

# fct_rev
gss_cat |>
  mutate(marital = marital |> fct_infreq() |> fct_rev()) |>
  ggplot() +
  geom_bar(aes(y = marital))

7.3.2 fct_reorder()

fct_reorder() reorders factor levels by sorting along another variable. This is only useful when factor levels have an arbitrary order such as alphabetical or first appearance.

relig_summary <- gss_cat |>
  group_by(relig) |>
  summarize(
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

# Without reordering
ggplot(relig_summary) +
  geom_point(aes(x = tvhours, y = relig))

# Reorder relig levels by tvhours
ggplot(relig_summary) +
  geom_point(aes(x = tvhours, y = fct_reorder(relig, tvhours)))

7.3.3 fct_reorder2()

fct_reorder2() is useful for when a factor is mapped to a non-position aesthetic such as color. fct_reorder2(f, x, y) reorders the factor f by the y values associated with the largest x values. This makes the colors of the line at the far right of the plot line up with the legend.

by_age <- gss_cat |>
  filter(!is.na(age)) |> 
  count(age, marital) |>
  group_by(age) |>
  mutate(prop = n / sum(n))

ggplot(by_age) +
  geom_line(aes(x = age, y = prop, color = fct_reorder2(marital, age, prop)),
            linewidth = 1) +
  labs(color = "marital") 

7.3.4 fct_relevel()

fct_relevel() manually moves levels. It can be used as a shortcut to move individual levels to the beginning of the order of levels. For instance, we can move “Not applicable” from the rincome variable to the front to be with the other non-answer types.

rincome_summary <- gss_cat |>
  group_by(rincome) |>
  summarize(
    age = mean(age, na.rm = TRUE),
    n = n()
  )

ggplot(rincome_summary) +
  geom_point(aes(x = age, y = fct_relevel(rincome, "Not applicable")))

7.4 Modifying factor level

Modifying functions:

7.4.1 fct_recode()

fct_recode() changes factor levels by hand. This is useful when the factor levels are abbreviations or shorthand that you want to change for presentation. fct_recode() will leave the levels that are not explicitly mentioned as is, and will warn you if you accidentally refer to a level that does not exist.

# partyid levels are awkward
gss_cat |> count(partyid)
#> # A tibble: 10 × 2
#>    partyid                n
#>    <fct>              <int>
#>  1 No answer            154
#>  2 Don't know             1
#>  3 Other party          393
#>  4 Strong republican   2314
#>  5 Not str republican  3032
#>  6 Ind,near rep        1791
#>  7 Independent         4119
#>  8 Ind,near dem        2499
#>  9 Not str democrat    3690
#> 10 Strong democrat     3490

# recode partyid levels
gss_cat |>
  mutate(
    partyid = fct_recode(partyid,
      "Republican, strong"    = "Strong republican",
      "Republican, weak"      = "Not str republican",
      "Independent, near rep" = "Ind,near rep",
      "Independent, near dem" = "Ind,near dem",
      "Democrat, weak"        = "Not str democrat",
      "Democrat, strong"      = "Strong democrat"
    )
  ) |>
  count(partyid)
#> # A tibble: 10 × 2
#>    partyid                   n
#>    <fct>                 <int>
#>  1 No answer               154
#>  2 Don't know                1
#>  3 Other party             393
#>  4 Republican, strong     2314
#>  5 Republican, weak       3032
#>  6 Independent, near rep  1791
#>  7 Independent            4119
#>  8 Independent, near dem  2499
#>  9 Democrat, weak         3690
#> 10 Democrat, strong       3490

7.4.2 fct_collapse()

You can use fct_recode() to lump levels together, but you need to retype the new level multiple times. If you want to collapse multiple levels into a smaller number of levels, it is best to use fct_collapse().

gss_cat |>
  mutate(
    partyid = fct_collapse(partyid,
      "other" = c("No answer", "Don't know", "Other party"),
      "rep" = c("Strong republican", "Not str republican"),
      "ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
      "dem" = c("Not str democrat", "Strong democrat")
    )
  ) |>
  count(partyid)
#> # A tibble: 4 × 2
#>   partyid     n
#>   <fct>   <int>
#> 1 other     548
#> 2 rep      5346
#> 3 ind      8409
#> 4 dem      7180

7.4.3 fct_lump_*()

Use the fct_lump_*() family of functions to quickly lump together small groups of factor levels to make a plot.

fct_lump_lowfreq() progressively lumps the smallest groups categories into “Other”, always keeping “Other” as the smallest category. It is not overly useful in the case below, but it shows how it works.

gss_cat |>
  mutate(relig = fct_lump_lowfreq(relig)) |>
  count(relig)
#> # A tibble: 2 × 2
#>   relig          n
#>   <fct>      <int>
#> 1 Protestant 10846
#> 2 Other      10637

fct_lump_n() is more exact, allowing you to control the number of levels to end up with.

gss_cat |>
  mutate(relig = fct_lump_n(relig, 10)) |>
  count(relig, sort = TRUE)
#> # A tibble: 10 × 2
#>    relig                       n
#>    <fct>                   <int>
#>  1 Protestant              10846
#>  2 Catholic                 5124
#>  3 None                     3523
#>  4 Christian                 689
#>  5 Other                     458
#>  6 Jewish                    388
#>  7 Buddhism                  147
#>  8 Inter-nondenominational   109
#>  9 Moslem/islam              104
#> 10 Orthodox-christian         95

fct_lump_min() lumps levels that appear fewer than min times.

gss_cat |>
  mutate(relig = fct_lump_min(relig, 100)) |>
  count(relig, sort = TRUE)
#> # A tibble: 9 × 2
#>   relig                       n
#>   <fct>                   <int>
#> 1 Protestant              10846
#> 2 Catholic                 5124
#> 3 None                     3523
#> 4 Christian                 689
#> 5 Other                     553
#> 6 Jewish                    388
#> 7 Buddhism                  147
#> 8 Inter-nondenominational   109
#> 9 Moslem/islam              104

fct_lump_prop() lumps levels that appear in fewer than (or equal to) prop * n times.

gss_cat |>
  mutate(relig = fct_lump_prop(relig, 0.01)) |>
  count(relig, sort = TRUE)
#> # A tibble: 6 × 2
#>   relig          n
#>   <fct>      <int>
#> 1 Protestant 10846
#> 2 Catholic    5124
#> 3 None        3523
#> 4 Other        913
#> 5 Christian    689
#> 6 Jewish       388

fct_other() is a convenience function that manually recodes levels with “Other” if you want to do it by hand. You can either list the levels to keep or those to drop to convert to other.

gss_cat |>
  mutate(relig = fct_other(relig, drop = c("Other eastern"))) |>
  count(relig, sort = TRUE)
#> # A tibble: 14 × 2
#>    relig                       n
#>    <fct>                   <int>
#>  1 Protestant              10846
#>  2 Catholic                 5124
#>  3 None                     3523
#>  4 Christian                 689
#>  5 Jewish                    388
#>  6 Other                     256
#>  7 Buddhism                  147
#>  8 Inter-nondenominational   109
#>  9 Moslem/islam              104
#> 10 Orthodox-christian         95
#> 11 No answer                  93
#> 12 Hinduism                   71
#> 13 Native american            23
#> 14 Don't know                 15

  1. (McNamara and Horton 2017)↩︎