All Posts

Number of blog posts: 11

A tale of file sizes

15 November 2017
I have recently helped a PhD student read and merge about 150 CSV files. I used R, but the student wanted to use Stata later, so I used the haven package to export to the Stata 14 native file format. $$x^2=2\beta\alpha$$ There is nothing to report about the process everything was quick and easy and worked as expected. But I noticed that .dta file (Stata file) was substantially larger than the original data. The original data CSV files were a little above 8GB, the consolidated R file was about 1.3GB (no surprises here as R saves the files in a compressed format) but the Stata file was 33.

Best Practices for Scientific Computing

25 September 2017
Scientific computation is nowadays an integral part of most research in many science fields. But the majority of researchers never had any formal training on how to structure, maintain, and collaborate on such projects. I had to learn the hard way, and there are several resources nowadays that I wished I had access to when I was starting. A good start is to read Good enough practices in scientific computing paper, and Best Practices for Scientific Computing. They provide apt recommendations from data management, collaboration, project organisation, revision control systems to the writing of manuscripts. I also recommend The Plain Person’s Guide to Plain Text Social Science by Kieran Healy.

R's JIT compiler

11 September 2017
A couple of days ago I was giving a course on R, and I used the following example\footnote{This function calculates the value of an American call option using a binomial tree.}: library(fOptions) system.time(CRRBinomialTreeOption(TypeFlag = "ca", S = 50, X = 50, Time = 5/12, r = 0.1, b = 0.1, sigma = 0.4, n = 2000)@price) ## user system elapsed ## 7.807 0.034 8.002 I was puzzled that on my laptop it took about 8 seconds to compute and on the lab computers it took less the 2 seconds. Since my laptop has a relatively recent CPU, and similar to the ones on the lab computers, I could not explain the difference in performance

Speeding up an OLS regression in R

3 June 2017
Yesterday a colleague told me that he was surprised by the speed differences of R, Julia and Stata when estimating a simple linear model for a large number of observations (10 million). Julia and Stata had very similar results, although Stata was faster. But both Julia and Stata were substantially faster than R. He is an expert and long time Stata user and was sceptical about the performance difference. I was also surprised. My experience is that R is usually fast, for interpreted language, and most procedures that are computationally intensive are developed in a compiled language like C, Fortran or C++, reducing any performance penalty.

Microsoft Azure's notebooks

26 May 2017
Recently Mike Croucher blogged about Microsoft Azure’s free Jupyter notebooks. He showed that the computational power provided by Microsoft Azure Notebooks is quite considerable. This is a free cloud service, and Jupyter notebooks power the interface. We can use within those notebooks both Python (2.7 and 3.5.1), F#, and R (3.3), and we have the ability to install packages if needed. 1 Although there is a 4Gb memory limit, the notebook has access to fast processors, eight in fact.2 I was curious to see if the service allowed parallel computing, and to my surprise it does. The machine uses Linux, but I was not able to fork the processes.