6 dplyr 1.1.0
dplyr 1.1.0
was released on 29 January 2023.
6.1 dplyr 1.1.0 Blog posts
6.2 Resources
-
Mine Çetinkaya-Rundel, What’s new in the tidyverse in 2023
- Good overview of the changes in dplyr and tidyverse discussed here.
6.3 Joins
6.3.1 Data
Data with two tables of transactions
with company id and companies
with information about the company.
6.3.2 join_by()
Instead of the syntax: by = c("company" = "id")
use function join_by()
: by = join_by(company == id)
# Previously
transactions |>
inner_join(companies, by = c("company" = "id"))
#> # A tibble: 4 × 4
#> company year revenue name
#> <chr> <dbl> <dbl> <chr>
#> 1 A 2019 50 Patagonia
#> 2 A 2020 4 Patagonia
#> 3 B 2021 10 RStudio
#> 4 B 2023 12 RStudio
# With join_by()
transactions |>
inner_join(companies, by = join_by(company == id))
#> # A tibble: 4 × 4
#> company year revenue name
#> <chr> <dbl> <dbl> <chr>
#> 1 A 2019 50 Patagonia
#> 2 A 2020 4 Patagonia
#> 3 B 2021 10 RStudio
#> 4 B 2023 12 RStudio
6.3.3 Multiple matches
Add column to companies
with information about founding date.
Warning when multiple matches appear and extra rows are added.
transactions |>
inner_join(companies, by = join_by(company == id))
#> Warning in inner_join(transactions, companies, by = join_by(company == id)): Detected an unexpected many-to-many relationship between `x` and `y`.
#> ℹ Row 3 of `x` matches multiple rows in `y`.
#> ℹ Row 1 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set `relationship =
#> "many-to-many"` to silence this warning.
#> # A tibble: 6 × 5
#> company year revenue since name
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 A 2019 50 1973 Patagonia
#> 2 A 2020 4 1973 Patagonia
#> 3 B 2021 10 2009 RStudio
#> 4 B 2021 10 2022 Posit
#> 5 B 2023 12 2009 RStudio
#> 6 B 2023 12 2022 Posit
Suppress warning with multiple = "all"
or make it an error with multiple = "error"
.
transactions |>
inner_join(companies, by = join_by(company == id),
multiple = "error")
#> Warning: Specifying `multiple = "error"` was deprecated in dplyr 1.1.1.
#> ℹ Please use `relationship = "many-to-one"` instead.
#> Error in `inner_join()`:
#> ! Each row in `x` must match at most 1 row in `y`.
#> ℹ Row 3 of `x` matches multiple rows in `y`.
6.3.4 Inequality joins
Use of inequality joins to help to fix the issue of multiple matches. Use of inequality expression in join_by()
. In this case, we only want transactions to be linked to companies after they have been founded: year >= since
.
transactions |>
inner_join(companies, join_by(company == id, year >= since))
#> # A tibble: 5 × 5
#> company year revenue since name
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 A 2019 50 1973 Patagonia
#> 2 A 2020 4 1973 Patagonia
#> 3 B 2021 10 2009 RStudio
#> 4 B 2023 12 2009 RStudio
#> 5 B 2023 12 2022 Posit
This eliminates the 2021 match to Posit, but inequality joins are still likely to return multiple matches because they are only bounded on one side.
6.3.5 Rolling joins
Rolling joins are meant to help fix the issue of the one-sided nature of inequality joins. This can be done with the closest()
helper to filter matches to those that are closest between year
and since
.
transactions |>
inner_join(companies, join_by(company == id, closest(year >= since)))
#> # A tibble: 4 × 5
#> company year revenue since name
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 A 2019 50 1973 Patagonia
#> 2 A 2020 4 1973 Patagonia
#> 3 B 2021 10 2009 RStudio
#> 4 B 2023 12 2022 Posit
6.3.6 Unmatched rows
Can catch unmatched rows with argument unmatched
. Can set unmatched = "error"
to catch it rows are dropper in inner_join()
.
# Add row that will not be matched
transactions <- transactions |>
tibble::add_row(company = "C", year = 2023, revenue = 15)
# Unmatched rows silently dropped
transactions |>
inner_join(
companies,
join_by(company == id, closest(year >= since))
)
#> # A tibble: 4 × 5
#> company year revenue since name
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 A 2019 50 1973 Patagonia
#> 2 A 2020 4 1973 Patagonia
#> 3 B 2021 10 2009 RStudio
#> 4 B 2023 12 2022 Posit
# Error with unmatched rows
transactions |>
inner_join(
companies,
join_by(company == id, closest(year >= since)),
unmatched = "error"
)
#> Error in `inner_join()`:
#> ! Each row of `x` must have a match in `y`.
#> ℹ Row 5 of `x` does not have a match.
6.4 Per-operation grouping
6.4.1 Persistent grouping with group_by()
Groups are maintained through other function calls, such as to mutate()
or summarise()
.
6.4.2 Per-operation grouping with .by/by
dplyr 1.1.0 adds per-operation grouping within summarise()
, mutate()
, filter()
, and some other dplyr verbs with the new .by
argument.
6.4.3 group_by()
vs .by
- The results are always ungrouped, (e.g. ) and tidyselect is used to group multiple columns.
- You cannot create variables on the fly in
.by
, you must create them earlier in your pipeline, e.g., unlikedf |> group_by(month = floor_date(date, "month"))
. - Difference in how the two ways deal with ordering rows from
arrange()
.
# Ordered tibble
transactions2 <- transactions |>
arrange(company, desc(year))
transactions2
#> # A tibble: 6 × 3
#> company year revenue
#> <chr> <dbl> <dbl>
#> 1 A 2020 4
#> 2 A 2019 20
#> 3 A 2019 50
#> 4 B 2023 12
#> 5 B 2023 18
#> 6 B 2021 10
# `group_by()` re-sorts by grouping keys
transactions2 |>
group_by(company, year) |>
summarise(total = sum(revenue), .groups = "drop")
#> # A tibble: 4 × 3
#> company year total
#> <chr> <dbl> <dbl>
#> 1 A 2019 70
#> 2 A 2020 4
#> 3 B 2021 10
#> 4 B 2023 30
# `.by` keeps the previous ordering
transactions2 |>
summarise(total = sum(revenue), .by = c(company, year))
#> # A tibble: 4 × 3
#> company year total
#> <chr> <dbl> <dbl>
#> 1 A 2020 4
#> 2 A 2019 70
#> 3 B 2023 30
#> 4 B 2021 10
6.5 The power of vctrs
6.5.1 case_when()
Ability to use regular NA
in case_when()
instead of specific class of NA
. The following now just works.
New .default
argument for when none of the cases are met. The .default
is always processed last, so the recommendation is to place it at the end of the list of cases. Now the negative values are labeled “other”.
6.5.2 case_match()
Remapping values with case_when()
has been possible, but is a bit verbose.
case_match()
is a new function that removes the repetition involved with x %in%
.
#|label: case_match
case_match(
x,
c("USA", "Canada", "Mexico") ~ "North America",
c("France", "UK") ~ "Europe",
"China" ~ "Asia"
)
#> [1] "North America" "North America" NA "Europe"
#> [5] "Asia" NA "North America" NA
case_match()
is particularly helpful in a replacement helper function where you might want to change only a couple of values but leave everything else as is.
replace_match <- function(x, ...) {
case_match(x, ..., .default = x, .ptype = x)
}
replace_match(
x,
"USA" ~ "United States",
c("UK", "Wales") ~ "United Kingdom",
NA ~ "[Missing]"
)
#> [1] "United States" "Canada" "United Kingdom" "United Kingdom"
#> [5] "China" "[Missing]" "Mexico" "Russia"
6.5.3 consecutive_id()
The ability to create a consecutive id column linked to another id column so that only consecutive columns are collapsed with group_by()
and summarise()
. Create this with mutate(id = consecutive_id(name))
.
transcript <- tribble(
~name, ~text,
"Hadley", "I'll never learn Python.",
"Davis", "But aren't you speaking at PyCon?",
"Hadley", "So?",
"Hadley", "That doesn't influence my decision.",
"Hadley", "I'm not budging!",
"Mara", "Typical, Hadley. Stubborn as always.",
"Davis", "Fair enough!",
"Davis", "Let's move on."
)
# Create consecutive id column
transcript |>
mutate(id = consecutive_id(name))
#> # A tibble: 8 × 3
#> name text id
#> <chr> <chr> <int>
#> 1 Hadley I'll never learn Python. 1
#> 2 Davis But aren't you speaking at PyCon? 2
#> 3 Hadley So? 3
#> 4 Hadley That doesn't influence my decision. 3
#> 5 Hadley I'm not budging! 3
#> 6 Mara Typical, Hadley. Stubborn as always. 4
#> 7 Davis Fair enough! 5
#> 8 Davis Let's move on. 5
Now you can do the grouping and summarize. You could do it just by id
, but it is useful to group_by()
name and id to keep the name in the summary table.
transcript |>
mutate(id = consecutive_id(name)) |>
summarise(text = stringr::str_flatten(text, collapse = " "), .by = c(id, name))
#> # A tibble: 5 × 3
#> id name text
#> <int> <chr> <chr>
#> 1 1 Hadley I'll never learn Python.
#> 2 2 Davis But aren't you speaking at PyCon?
#> 3 3 Hadley So? That doesn't influence my decision. I'm not budging!
#> 4 4 Mara Typical, Hadley. Stubborn as always.
#> 5 5 Davis Fair enough! Let's move on.
6.6 pick()
, reframe()
, and arrange()
6.6.1 pick()
pick()
is a new function that is meant to be a compliment to across()
. With across()
, you typically apply a function to each column. With pick()
, you typically apply a function to the full data frame of columns that you pick. In this way, pick()
is replaceable with an equivalent call to tibble()
. pick(a, c)
creates a data frame that is the same as tibble(a = a, c = c)
.
df <- tibble(
x_1 = c(1, 3, 2, 1, 2),
x_2 = 6:10,
w_4 = 11:15,
y_2 = c(5, 2, 4, 0, 6)
)
# For instance finding the number of columns selected
# makes more semantic sense for pick() than across()
df |>
summarise(
n_x = ncol(pick(starts_with("x"))),
n_y = ncol(pick(starts_with("y")))
)
#> # A tibble: 1 × 2
#> n_x n_y
#> <int> <int>
#> 1 2 1
pick()
is particularly useful in combination with ranking functions like dense_rank()
that can take a data frame and rank the values.
df |>
mutate(
rank1 = dense_rank(x_1),
rank2 = dense_rank(pick(x_1, y_2)) # Using `y_2` to break ties in `x_1`
)
#> # A tibble: 5 × 6
#> x_1 x_2 w_4 y_2 rank1 rank2
#> <dbl> <int> <int> <dbl> <int> <int>
#> 1 1 6 11 5 1 2
#> 2 3 7 12 2 3 5
#> 3 2 8 13 4 2 3
#> 4 1 9 14 0 1 1
#> 5 2 10 15 6 2 4
You can also use pick()
as a bridge between tidy selection and data masking in functions. For instance, it is useful in creating a function with group_by()
. See Data masking patters: Bridge patterns and the notes on the vignette.
my_group_by <- function(data, cols) {
group_by(data, pick({{ cols }}))
}
my_group_by(df, starts_with("x"))
#> # A tibble: 5 × 4
#> # Groups: x_1, x_2 [5]
#> x_1 x_2 w_4 y_2
#> <dbl> <int> <int> <dbl>
#> 1 1 6 11 5
#> 2 3 7 12 2
#> 3 2 8 13 4
#> 4 1 9 14 0
#> 5 2 10 15 6
6.6.2 reframe()
reframe()
is intended as a replacement for the ability to use summarise()
to return multiple results per group that was introduced in dplyr 1.0.0. Now using summarise()
to do this causes a warning. reframe()
now takes on the role of “doing something” to each group with no restrictions on the number of rows returned per group.
One nice application of reframe()
is computing quantiles at various probability thresholds.
df <- tibble(
g = c(1, 1, 1, 2, 2),
x = c(4, 3, 6, 2, 8),
y = c(5, 1, 2, 8, 9)
)
# Helper function
quantile_df <- function(x, probs = c(0.25, 0.5, 0.75)) {
tibble(
value = quantile(x, probs, na.rm = TRUE),
prob = probs
)
}
df |>
reframe(quantile_df(x), .by = g)
#> # A tibble: 6 × 3
#> g value prob
#> <dbl> <dbl> <dbl>
#> 1 1 3.5 0.25
#> 2 1 4 0.5
#> 3 1 5 0.75
#> 4 2 3.5 0.25
#> 5 2 5 0.5
#> 6 2 6.5 0.75
You can apply such a function to multiple columns using across()
, which returns a packed data frame. You can unnest this using the new .unpack
argument from across()
.