dplyr 0.5  new functions: part  I
dplyr version 0.5 introduced several new functions:
 coalesce()
 case_when()
 if_else()
 na_if()
 near()
 recode()
 union_all()
 summarise_all(), mutate_all()
 summarise_at() and mutate_at()
 summarise_if() and mutate_if()
 select_if()
Let’s take a look at the first five.
coalesce()
library(dplyr)
The coalesce()
function takes two or more vectors as arguments and finds the first nonmissing value at each position. It serves a similar purpose as the COALESCE SQL function.
It is easy to illustrate what the function does with a simple example:
y < c(NA, 2, NA, NA, 5)
z < c(NA, NA, 3, 4, NA)
w < c(10, 20, 30, NA, NA)
coalesce(y, z, w)
## [1] 10 2 3 4 5
All vectors must be of the same type, if you try to mix different types it will result in an error:
z < c(NA, NA, "3", "4", NA)
coalesce(y, z)
## Error: Vector 1 has type 'character' not 'double'
The function also recycles the second or more vectors, to deal with vectors of different sizes. For instance:
coalesce(y, 0)
## [1] 0 2 0 0 5
if_else()
The if_else()
function is very similar to the ifelse()
function from base R. There a few differences. The first one is that it takes a third argument to replace missing values.
x < c(1:5, NA)
ifelse(x < 3, "a", "b")
## [1] "a" "a" "b" "b" "b" NA
if_else(x < 3, "a", "b", "c")
## [1] "a" "a" "b" "b" "b" "c"
The new function also preserves types, the type is defined by the true vector. This means that if_else()
is typesafe, and any incompatible types will raise an error:
ifelse(x < 3, x, "b")
## [1] "1" "2" "b" "b" "b" NA
if_else(x < 3, x, "b")
## Error: `false` has type 'character' not 'integer'
Finally the function is faster than the base R function. The following example illustrates this:
library(microbenchmark)
set.seed(7867)
x < runif(10000)
microbenchmark(
ifelse(x < 0.5, "a", "b"),
if_else(x < 0.5, "a", "b")
)
## Unit: microseconds
## expr min lq mean median uq
## ifelse(x < 0.5, "a", "b") 1990.063 2041.839 2727.0666 2108.643 2945.422
## if_else(x < 0.5, "a", "b") 457.695 478.701 668.0394 491.841 563.120
## max neval
## 33813.663 100
## 1609.154 100
case_when()
The case_when()
function is a general vectorised if
and else if
statements.
It takes a sequence of twosided formulas. The lefthand side determines the match, and the righthand side the value. It is very useful to replace concatenated ifelse statements.
Unfortunately, the function does not yet work inside mutate()
, but this will be solved in a future version of dplyr
accordingly to Hadley Wickham.
It is possible to use it inside mutate()
using a simple workaround:
df < data_frame(x = 1:10, y = 10:1)
df %>% mutate(z = case_when(.$x > 5 & .$y < 5 ~ 1,
.$x < 4 & .$y > 7 ~ 2,
TRUE ~ 0))
## # A tibble: 10 × 3
## x y z
## <int> <int> <dbl>
## 1 1 10 2
## 2 2 9 2
## 3 3 8 2
## 4 4 7 0
## 5 5 6 0
## 6 6 5 0
## 7 7 4 1
## 8 8 3 1
## 9 9 2 1
## 10 10 1 1
The arguments are evaluated in the order that they are inserted. So in the previous example z will only be equal to zero if the previous statments are false.
na_if()
This is a useful function that changes a certain value to NA. Take the following data frame:
df < data_frame(x = c(99, rnorm(3), 99, 99))
df
## # A tibble: 6 × 1
## x
## <dbl>
## 1 99.0000000
## 2 1.1762076
## 3 1.1065201
## 4 0.5262338
## 5 99.0000000
## 6 99.0000000
If we want to substitute all the values of 99 to NA we can do:
df %>%
mutate(x = na_if(x, 99))
## # A tibble: 6 × 1
## x
## <dbl>
## 1 NA
## 2 1.1762076
## 3 1.1065201
## 4 0.5262338
## 5 NA
## 6 NA
near()
near()
allows the comparison of two floating point vectors within a given tolerance.
Let’s take the following vector:
x < c(sqrt(2) ^ 2, sqrt(3) ^ 2)
x[1] == 2
## [1] FALSE
x[2] == 3
## [1] FALSE
We can compare the x
vector with c(2,3)
vector using near()
:^{1}
near(x, c(2,3))
## [1] TRUE TRUE
The function takes a third argument that defines the tolerance (accuracy) of the comparison. The default value is 1.490116e08
.

Confused about the results? If so, take a look at What every computer scientist should know about floatingpoint arithmetic ↩︎