dplyr 0.5 - new functions: part - I

Sat 11 March 2017 | tags: R, -- (permalink)

dplyr version 0.5 introduced several new functions:

  • coalesce()
  • case_when()
  • if_else()
  • na_if()
  • near()
  • recode()
  • union_all()
  • summarise_all(), mutate_all()
  • summarise_at() and mutate_at()
  • summarise_if() and mutate_if()
  • select_if()

Let’s take a look at the first five.

coalesce()

library(dplyr)

The coalesce() function takes two or more vectors as arguments and finds the first non-missing value at each position. It serves a similar purpose as the COALESCE SQL function.

It is easy to illustrate what the function does with a simple example:

y <- c(NA, 2, NA, NA, 5)
z <- c(NA, NA, 3, 4, NA)
w <- c(10, 20, 30, NA, NA)
coalesce(y, z, w)
## [1] 10  2  3  4  5

All vectors must be of the same type, if you try to mix different types it will result in an error:

z <- c(NA, NA, "3", "4", NA)
coalesce(y, z)
## Error: Vector 1 has type 'character' not 'double'

The function also recycles the second or more vectors, to deal with vectors of different sizes. For instance:

coalesce(y, 0)
## [1] 0 2 0 0 5

if_else()

The if_else() function is very similar to the ifelse() function from base R. There a few differences. The first one is that it takes a third argument to replace missing values.

x <- c(1:5, NA)
ifelse(x < 3, "a", "b")
## [1] "a" "a" "b" "b" "b" NA
if_else(x < 3, "a", "b", "c")
## [1] "a" "a" "b" "b" "b" "c"

The new function also preserves types, the type is defined by the true vector. This means that if_else() is type-safe, and any incompatible types will raise an error:

ifelse(x < 3, x, "b")
## [1] "1" "2" "b" "b" "b" NA
if_else(x < 3, x, "b")
## Error: `false` has type 'character' not 'integer'

Finally the function is faster than the base R function. The following example illustrates this:

library(microbenchmark)

set.seed(7867)
x <- runif(10000)
microbenchmark(
  ifelse(x < 0.5, "a", "b"),
  if_else(x < 0.5, "a", "b")
)
## Unit: microseconds
##                        expr      min       lq      mean   median       uq
##   ifelse(x < 0.5, "a", "b") 1990.063 2041.839 2727.0666 2108.643 2945.422
##  if_else(x < 0.5, "a", "b")  457.695  478.701  668.0394  491.841  563.120
##        max neval
##  33813.663   100
##   1609.154   100

case_when()

The case_when() function is a general vectorised if and else if statements.

It takes a sequence of two-sided formulas. The left-hand side determines the match, and the right-hand side the value. It is very useful to replace concatenated if-else statements.

Unfortunately, the function does not yet work inside mutate(), but this will be solved in a future version of dplyr accordingly to Hadley Wickham.

It is possible to use it inside mutate() using a simple workaround:

df <- data_frame(x = 1:10, y = 10:1)

df %>% mutate(z = case_when(.$x > 5 & .$y < 5 ~ 1,
                            .$x < 4 & .$y > 7 ~ 2,
                            TRUE ~ 0))
## # A tibble: 10 × 3
##        x     y     z
##    <int> <int> <dbl>
## 1      1    10     2
## 2      2     9     2
## 3      3     8     2
## 4      4     7     0
## 5      5     6     0
## 6      6     5     0
## 7      7     4     1
## 8      8     3     1
## 9      9     2     1
## 10    10     1     1

The arguments are evaluated in the order that they are inserted. So in the previous example z will only be equal to zero if the previous statments are false.

na_if()

This is a useful function that changes a certain value to NA. Take the following data frame:

df <- data_frame(x = c(-99, rnorm(3), -99, -99))
df
## # A tibble: 6 × 1
##             x
##         <dbl>
## 1 -99.0000000
## 2   1.1762076
## 3   1.1065201
## 4  -0.5262338
## 5 -99.0000000
## 6 -99.0000000

If we want to substitute all the values of -99 to NA we can do:

df %>% 
  mutate(x = na_if(x, -99))
## # A tibble: 6 × 1
##            x
##        <dbl>
## 1         NA
## 2  1.1762076
## 3  1.1065201
## 4 -0.5262338
## 5         NA
## 6         NA

near()

near() allows the comparison of two floating point vectors within a given tolerance.

Let’s take the following vector:

x <- c(sqrt(2) ^ 2, sqrt(3) ^ 2)
x[1] == 2
## [1] FALSE
x[2] == 3
## [1] FALSE

We can compare the x vector with c(2,3) vector using near():1

near(x, c(2,3))
## [1] TRUE TRUE

The function takes a third argument that defines the tolerance (accuracy) of the comparison. The default value is 1.490116e-08.


  1. Confused about the results? If so, take a look at What every computer scientist should know about floating-point arithmetic 

Top