A tale of file sizes

Wed 15 November 2017 | tags: Scientific Computing, R, -- (permalink)

I have recently helped a PhD student read and merge about 150 CSV files. I used R, but the student wanted to use Stata later, so I used the haven package to export to the Stata 14 native file format.

There is nothing to report about the process everything was quick and easy and worked as expected. But I noticed that .dta file (Stata file) was substantially larger than the original data. The original data CSV files were a little above 8GB, the consolidated R file was about 1.3GB (no surprises here as R saves the files in a compressed format) but the Stata file was 33.7GB.

I was surprised that the Stata file format was four times the original CSV file size and suspected that this could be a problem due to the function I used to create the file (write_dta function from the haven package), or just a feature of this file type.

All data formats are different regarding what they try to optimise; some formats are tuned for reading speed, others for writing speed, others yet for file size, etc. The performance of these file formats can depend upon the hardware used (amount of RAM or if you have an SSD or an HD disk) and the type of data the file contains.

I’ve asked a local Stata genius at my school (Miguel Portela) what was going on and he told me that if the file contains strings, Stata is going to produce a file with a larger size than the original data.

We made some tests on the same hardware and Stata was able to read the file (in its native format) more quickly than R can read the file using its native format (both compressed an uncompressed). So Stata is optimising its format for reading speed, and R native format is optimised for file size.

In R there several different file formats that are optimised for reading speed, such as fst or feather. When using my computer, I was able to get a substantial speed gain with the feather file format for this particular file.

In R you get to choose between the file size or file read/write speeds, and the differences can be substantial for very large files.