Microsoft Azure’s notebooks

Fri 26 May 2017 | tags: R, -- (permalink)

Recently Mike Croucher blogged about Microsoft Azure’s free Jupyter notebooks. He showed that the computational power provided by Microsoft Azure Notebooks is quite considerable. This is a free cloud service, and Jupyter notebooks power the interface. We can use within those notebooks both Python (2.7 and 3.5.1), F#, and R (3.3), and we have the ability to install packages if needed. 2

Although there is a 4Gb memory limit, the notebook has access to fast processors, eight in fact.1 I was curious to see if the service allowed parallel computing, and to my surprise it does. The machine uses Linux, but I was not able to fork the processes. However, it was possible to create a PSOCK cluster and use it. Since this is a PSOCK cluster, nodes do not share the same environment which adds some setup costs to the parallelization, particularly if large objects must be passed to them.

I have created a toy example, where the sum of the product of two matrices of random normal variates, with size \(1000x1000\), is computed. The calculation is performed 100 times, once in single threaded mode and afterwards in parallel, taking advantage of all processors. The notebook, with the code and results, is available here.

cores <- getOption("mc.cores", parallel::detectCores())
# Get the operating system[['sysname']]
# Creates a simple function that generates two matrices (1000x1000) of normal
# random variates, multiplies them, and returns the sum of that product
# There are better ways of dealing with random number generators with parallel 
# computations. Check the help for the clusterSetRNGStream() function
sum_product <- function(x, seed){
    sum(matrix(rnorm(1000000), ncol=1000, nrow=1000) %*% matrix(rnorm(1000000), ncol=1000, nrow=1000))

# The computation task that I'm testing consists in
# calling the function sum_product 100 times
aux <- 1:100
lsets <- as.list(aux)

# Time required to perform the computation single threaded 
system.time(results_st <- lapply(lsets, sum_product, seed = s))
       user  system elapsed 
     26.492   1.256  22.743 
# Create the cluster
cl <- parallel::makeCluster(cores, type="PSOCK")

system.time(results_p <- parLapply(cl, lsets, sum_product))
       user  system elapsed 
      0.000   0.004   4.797 
all.equal(results_st, results_p)

It is interesting to note that the computation time single threaded is faster than both my desktop and laptop machines. Microsoft is providing some considerable computing power for free.

  1. The FAQs are available here

  2. I prefer Rstudio’s R Notebooks, but beggars can’t be choosers.