dplyr 0.5 - new functions: part - II

2017-04-29

In the previous post I have described the first five functions, introduced by dplyr 0.5, that are listed below:

  • coalesce()
  • case_when()
  • if_else()
  • na_if()
  • near()
  • recode()
  • union_all()
  • summarise_all() and mutate_all()
  • summarise_at() and mutate_at()
  • summarise_if() and mutate_if()
  • select_if()

In this post, I’ll describe the others. Meanwhile, the next version of dplyr is just around the corner, and will also bring new features.

recode()

The recode() function, as the name states, allow the recoding of a vector of values. There is also a similar function for factors: recode_factor().

Let’s take the following data_frame:

library(dplyr)
d_f <- data_frame(x = c(1:5, NA), y = letters[1:6])
d_f
## # A tibble: 6 × 2
##       x     y
##   <int> <chr>
## 1     1     a
## 2     2     b
## 3     3     c
## 4     4     d
## 5     5     e
## 6    NA     f

We can use recode to change numeric or alphanumeric values, but replacements must be all of the same type:

d_f %>% 
  mutate(y = recode(y, b = "bananas", c="coffee"))
## # A tibble: 6 × 2
##       x       y
##   <int>   <chr>
## 1     1       a
## 2     2 bananas
## 3     3  coffee
## 4     4       d
## 5     5       e
## 6    NA       f

We can also provide default and missing values:

d_f %>% 
  mutate(x = recode(x, `1` = 10, .default = 20, .missing = 30))
## # A tibble: 6 × 2
##       x     y
##   <dbl> <chr>
## 1    10     a
## 2    20     b
## 3    20     c
## 4    20     d
## 5    20     e
## 6    30     f

union_all()

union_all() performs the same operation as bind_rows() for local data frames, or combine() for vectors, however if the inputs are SQL sources it maps to UNIOM ALL SQL statment. The following is a simple example using vectors:

x <- 1:5
y <- 3:7

union_all(x, y)
##  [1] 1 2 3 4 5 3 4 5 6 7
combine(x,y)
##  [1] 1 2 3 4 5 3 4 5 6 7

summarise_all() and mutate_all()

These functions replace summarise_each() and mutate_each() that will be deprecated in a future release. As expected they apply the same function to all columns of a data frame.

summarise_at() and mutate_at()

These functions operate on a given set of columns of the data frame. The subset of columns can be defined by a vector of names, column positions, or using select().

d_f <- data_frame(x = 1:5, y = 6:10, w = letters[1:5])
d_f
## # A tibble: 5 × 3
##       x     y     w
##   <int> <int> <chr>
## 1     1     6     a
## 2     2     7     b
## 3     3     8     c
## 4     4     9     d
## 5     5    10     e

Using the column positions:

d_f %>% 
  summarize_at(1:2, mean)
## # A tibble: 1 × 2
##       x     y
##   <dbl> <dbl>
## 1     3     8

we could have also used the column names:

d_f %>% 
  summarize_at(c("x", "y"), mean)
## # A tibble: 1 × 2
##       x     y
##   <dbl> <dbl>
## 1     3     8

or alternatively:

d_f %>% 
  summarize_at(vars(x:y), mean)
## # A tibble: 1 × 2
##       x     y
##   <dbl> <dbl>
## 1     3     8

We can specify more than one function to be evaluated:

d_f %>% 
  summarize_at(vars(x:y), .funs=funs(min, max))
## # A tibble: 1 × 4
##   x_min y_min x_max y_max
##   <int> <int> <int> <int>
## 1     1     6     5    10

summarise_if() and mutate_if()

summarise_if() and mutate_if() are functions that operate on subset of columns for which the predicate is true.

d_f <- data_frame(x = 1:5, y = 6:10, w = letters[1:5])
d_f
## # A tibble: 5 × 3
##       x     y     w
##   <int> <int> <chr>
## 1     1     6     a
## 2     2     7     b
## 3     3     8     c
## 4     4     9     d
## 5     5    10     e

We can calculate the mean of all numeric columns using:

d_f %>% 
  summarise_if(is.numeric, mean)
## # A tibble: 1 × 2
##       x     y
##   <dbl> <dbl>
## 1     3     8

We can use a user defined function. Let’s assume we wanted to calculate the mean of numeric columns for which the sum is greater than 20, then:

mysum <- function(x){
  if(is.numeric(x)){
    sum(x) > 20
  } else {
    FALSE
  }
}

d_f %>% 
  summarize_if(mysum, mean)
## # A tibble: 1 × 1
##       y
##   <dbl>
## 1     8

We could have written this as:

d_f %>% 
  summarize_if(function(x) is.numeric(x) && sum(x) > 20, mean)
## # A tibble: 1 × 1
##       y
##   <dbl>
## 1     8

These functions are very simillar in aim to map_if() and map_at() from the purrr package.

select_if()

The select_if() function selects columns for which the predicate is true. It only works for local sources. It has a similar usage to summarise_if() and mutate_if().

Continuing with the same data frame:

d_f <- data_frame(x = 1:5, y = 6:10, w = letters[1:5])
d_f
## # A tibble: 5 × 3
##       x     y     w
##   <int> <int> <chr>
## 1     1     6     a
## 2     2     7     b
## 3     3     8     c
## 4     4     9     d
## 5     5    10     e

we can select all numeric columns with:

d_f %>%
  select_if(is.numeric)
## # A tibble: 5 × 2
##       x     y
##   <int> <int>
## 1     1     6
## 2     2     7
## 3     3     8
## 4     4     9
## 5     5    10