 dplyr 0.5 - new functions: part - I

Sat 11 March 2017 | tags: R, -- (permalink)

dplyr version 0.5 introduced several new functions:

• coalesce()
• case_when()
• if_else()
• na_if()
• near()
• recode()
• union_all()
• summarise_all(), mutate_all()
• summarise_at() and mutate_at()
• summarise_if() and mutate_if()
• select_if()

Let’s take a look at the first five.

## `coalesce()`

```library(dplyr)
```

The `coalesce()` function takes two or more vectors as arguments and finds the first non-missing value at each position. It serves a similar purpose as the COALESCE SQL function.

It is easy to illustrate what the function does with a simple example:

```y <- c(NA, 2, NA, NA, 5)
z <- c(NA, NA, 3, 4, NA)
w <- c(10, 20, 30, NA, NA)
coalesce(y, z, w)
```
```##  10  2  3  4  5
```

All vectors must be of the same type, if you try to mix different types it will result in an error:

```z <- c(NA, NA, "3", "4", NA)
coalesce(y, z)
```
```## Error: Vector 1 has type 'character' not 'double'
```

The function also recycles the second or more vectors, to deal with vectors of different sizes. For instance:

```coalesce(y, 0)
```
```##  0 2 0 0 5
```

## `if_else()`

The `if_else()` function is very similar to the `ifelse()` function from base R. There a few differences. The first one is that it takes a third argument to replace missing values.

```x <- c(1:5, NA)
ifelse(x < 3, "a", "b")
```
```##  "a" "a" "b" "b" "b" NA
```
```if_else(x < 3, "a", "b", "c")
```
```##  "a" "a" "b" "b" "b" "c"
```

The new function also preserves types, the type is defined by the true vector. This means that `if_else()` is type-safe, and any incompatible types will raise an error:

```ifelse(x < 3, x, "b")
```
```##  "1" "2" "b" "b" "b" NA
```
```if_else(x < 3, x, "b")
```
```## Error: `false` has type 'character' not 'integer'
```

Finally the function is faster than the base R function. The following example illustrates this:

```library(microbenchmark)

set.seed(7867)
x <- runif(10000)
microbenchmark(
ifelse(x < 0.5, "a", "b"),
if_else(x < 0.5, "a", "b")
)
```
```## Unit: microseconds
##                        expr      min       lq      mean   median       uq
##   ifelse(x < 0.5, "a", "b") 1990.063 2041.839 2727.0666 2108.643 2945.422
##  if_else(x < 0.5, "a", "b")  457.695  478.701  668.0394  491.841  563.120
##        max neval
##  33813.663   100
##   1609.154   100
```

## `case_when()`

The `case_when()` function is a general vectorised `if` and `else if` statements.

It takes a sequence of two-sided formulas. The left-hand side determines the match, and the right-hand side the value. It is very useful to replace concatenated if-else statements.

Unfortunately, the function does not yet work inside `mutate()`, but this will be solved in a future version of `dplyr` accordingly to Hadley Wickham.

It is possible to use it inside `mutate()` using a simple workaround:

```df <- data_frame(x = 1:10, y = 10:1)

df %>% mutate(z = case_when(.\$x > 5 & .\$y < 5 ~ 1,
.\$x < 4 & .\$y > 7 ~ 2,
TRUE ~ 0))
```
```## # A tibble: 10 × 3
##        x     y     z
##    <int> <int> <dbl>
## 1      1    10     2
## 2      2     9     2
## 3      3     8     2
## 4      4     7     0
## 5      5     6     0
## 6      6     5     0
## 7      7     4     1
## 8      8     3     1
## 9      9     2     1
## 10    10     1     1
```

The arguments are evaluated in the order that they are inserted. So in the previous example z will only be equal to zero if the previous statments are false.

## `na_if()`

This is a useful function that changes a certain value to NA. Take the following data frame:

```df <- data_frame(x = c(-99, rnorm(3), -99, -99))
df
```
```## # A tibble: 6 × 1
##             x
##         <dbl>
## 1 -99.0000000
## 2   1.1762076
## 3   1.1065201
## 4  -0.5262338
## 5 -99.0000000
## 6 -99.0000000
```

If we want to substitute all the values of -99 to NA we can do:

```df %>%
mutate(x = na_if(x, -99))
```
```## # A tibble: 6 × 1
##            x
##        <dbl>
## 1         NA
## 2  1.1762076
## 3  1.1065201
## 4 -0.5262338
## 5         NA
## 6         NA
```

## `near()`

`near()` allows the comparison of two floating point vectors within a given tolerance.

Let’s take the following vector:

```x <- c(sqrt(2) ^ 2, sqrt(3) ^ 2)
x == 2
```
```##  FALSE
```
```x == 3
```
```##  FALSE
```

We can compare the `x` vector with `c(2,3)` vector using `near()`:1

```near(x, c(2,3))
```
```##  TRUE TRUE
```

The function takes a third argument that defines the tolerance (accuracy) of the comparison. The default value is `1.490116e-08`.

1. Confused about the results? If so, take a look at What every computer scientist should know about floating-point arithmetic

Top