25 Programming with dplyr (pre 1.0.0)
These notes refer to the vignette prior to dplyr 1.0.0, which fundamentally changed how the issue of non-standard evaluation or tidy evaluation is presented. For notes on the updated vignette see Programming with dplyr.
25.1 Introduction
dplyr
uses non-standard evaluation (NSE)
- Positives
- Enables ability to only state name of data frame once and perform multiple operations.
- Better able to integrate with SQL
- Negatives
- Arguments are not referentially transparent, meaning that you cannot replace a value with a seemingly equivalent object that has been defined elsewhere. This makes it hard to create functions with arguments that change how dplyr verbs are computed.
- Because of its terseness, dplyr code can be ambiguous, making functions more unpredictable.
- Tools to help solve this problem in working with functions.
- Pronouns
- Quasiquotation
- Goals of tutorial
- Teach quosures: the data structure that stores both an expression and an environment
- Teach tidyeval, which is the underlying toolkit through which this is implemented.
- Programming recipes
- dplyr verbs in functions can fail silently if one of the variables is not present in the data frame, but is present in the global environment.
- Writing a function is hard if you want one of the arguments to be a variable name (like
x
) or an expression (likex + y
). That is because dplyr automatically “quotes” those inputs, so they are not referentially transparent.
25.2 Summarise example
Start with a function that does not work.
The problem is that group_by()
works by quoting the input rather than evaluating it.
To fix this we can manually quote the input so that the function can take a take a bare variable name like group_by()
. We then need to use !!
to unquote an input so that it’s evaluated, not quoted within group_by()
.
To be able to call function without using quo()
in function call you need a function that turns an argument into a string. This is done by enquo()
: this looks at the argument, sees what the user typed, and returns that value as a quosure.
25.3 Different input variable
Solution for the same problem as above but with multiple arguments within a dplyr function
Test the approach above using quo()
and !!
Can also wrap quo()
around the dplyr call to see what will happen from dplyr’s perspective. This is useful for debugging.
Fully fixed function
my_summarise2 <- function(df, expr) {
expr <- enquo(expr)
summarise(df,
mean = mean(!!expr),
sum = sum(!!expr),
n = n()
)
}
my_summarise2(df, a)
#> # A tibble: 1 × 3
#> mean sum n
#> <dbl> <int> <int>
#> 1 3 15 5
my_summarise2(df, a * b)
#> # A tibble: 1 × 3
#> mean sum n
#> <dbl> <int> <int>
#> 1 7.4 37 5
25.4 Different input and output variable
mutate(df, mean_a = mean(a), sum_a = sum(a))
#> # A tibble: 5 × 6
#> g1 g2 a b mean_a sum_a
#> <dbl> <dbl> <int> <int> <dbl> <int>
#> 1 1 1 4 1 3 15
#> 2 1 2 1 4 3 15
#> 3 2 1 2 5 3 15
#> 4 2 2 5 2 3 15
#> 5 2 1 3 3 3 15
mutate(df, mean_b = mean(b), sum_b = sum(b))
#> # A tibble: 5 × 6
#> g1 g2 a b mean_b sum_b
#> <dbl> <dbl> <int> <int> <dbl> <int>
#> 1 1 1 4 1 3 15
#> 2 1 2 1 4 3 15
#> 3 2 1 2 5 3 15
#> 4 2 2 5 2 3 15
#> 5 2 1 3 3 3 15
This is different in that we want a function that will not only do the mean and sum calculation, but will also name the column correctly. Need to create new names by pasting strings. Use quo_name()
for this. !!mean_name = mean(!!expr)
is not valid R code, so need helper of :=
, thus !!mean_name := mean(!!expr)
.
my_mutate <- function(df, expr) {
expr <- enquo(expr)
mean_name <- paste0("mean_", quo_name(expr))
sum_name <- paste0("sum_", quo_name(expr))
mutate(df,
!!mean_name := mean(!!expr),
!!sum_name := sum(!!expr)
)
}
my_mutate(df, a)
#> # A tibble: 5 × 6
#> g1 g2 a b mean_a sum_a
#> <dbl> <dbl> <int> <int> <dbl> <int>
#> 1 1 1 4 1 3 15
#> 2 1 2 1 4 3 15
#> 3 2 1 2 5 3 15
#> 4 2 2 5 2 3 15
#> 5 2 1 3 3 3 15
25.5 Capturing multiple variables
In order to make the my_summarise()
function accept any number of grouping variables need to make three changes:
- Use
...
in the function definition so our function can accept any number of arguments. - Use
quos()
to capture all the...
as a list of formulas. - Use
!!!
instead of!!
to splice the arguments intogroup_by()
.
my_summarise <- function(df, ...) {
group_var <- quos(...)
df %>%
group_by(!!!group_var) %>%
summarise(a = mean(a))
}
my_summarise(df, g1, g2)
#> `summarise()` has grouped output by 'g1'. You can override using the `.groups`
#> argument.
#> # A tibble: 4 × 3
#> # Groups: g1 [2]
#> g1 g2 a
#> <dbl> <dbl> <dbl>
#> 1 1 1 4
#> 2 1 2 1
#> 3 2 1 2.5
#> 4 2 2 5
25.6 Theory
25.6.1 Quoting
See also: http://rlang.tidyverse.org/reference/quosure.html
- Defining quotation in R: “Quoting is the action of capturing an expression instead of evaluating it. All expression-based functions quote their arguments and get the R code as an expression rather than the result of evaluating that code.”
- Note that
""
is not a quoting operation, because it returns a string rather than an expression
- Note that
- Common quote expression is use of formula in statistical evaluations such as
disp ~ cyl + drat
- Have to be careful in creating formulas, because expressions could be different based on their environment.
- Ability for one name to refer to different values in different environments is an important part of R and dplyr.
- When an object keeps track of an environment, it is said to have an enclosure.
-
quosures: one-sided formulas; one-sided formulas are quotes (they carry an expression) with an environment.
- Example:
var <- ~toupper(letters[1:5])
- Example:
25.6.2 Quasiquotation
Automatic quoting makes dplyr very convenient for interactive use. But if you want to program with dplyr, you need some way to refer to variables indirectly. The solution to this problem is quasiquotation, which allows you to evaluate directly inside an expression that is otherwise quoted.
Automatic quoting makes dplyr very convenient for interactive use. But if you want to program with dplyr, you need some way to refer to variables indirectly. The solution to this problem is quasiquotation, which allows you to evaluate directly inside an expression that is otherwise quoted.
Three types of unquoting in the tidyeval framework
- Basic with either
UQ()
or!!
# Here we capture `letters[1:5]` as an expression:
quo(toupper(letters[1:5]))
#> <quosure>
#> expr: ^toupper(letters[1:5])
#> env: global
# Here we capture the value of `letters[1:5]`
quo(toupper(!!letters[1:5]))
#> <quosure>
#> expr: ^toupper(<chr: "a", "b", "c", "d", "e">)
#> env: global
quo(toupper(UQ(letters[1:5])))
#> <quosure>
#> expr: ^toupper(<chr: "a", "b", "c", "d", "e">)
#> env: global
- Unquote-splicing
Unquote-splicing’s functional form is UQS()
and the syntactic shortcut is !!!
. It takes a vector and inserts each element of the vector in the surrounding function call.
- Unquoting names
Setting argument names with :=