Saturday, April 30, 2016

Fast File Concatenation in Python

Ed. 2016: I never published this post; it's probably incomplete but seeing as I have no recollection of what I'd intended to add, I'm just publishing what's here now.

In order to hook into the underlying version control for history tracking and distribution, Abundant stores each issue in a separate file. This is generally advantageous, but it runs headfirst into trouble when attempting to do operations on the database as a whole, like listing or querying. With each issue in a separate file, Abundant has to iterate over each file in turn, necessitating a disk seek between each file. There's many possible steps we could take to minimize query time, potentially loading our issues up into a database like SQLite, but if something simpler is good enough, we can avoid that sort of underlying complexity (ensuring the db is up to date with the actual issue files, and efficiently updating the db when it isn't is far from trivial).

O(n) query time isn't too terrible, but avoiding disk seeks could potentially net dramatic improvements. So what if we concatenated all the files into one big document we could read in one seek? Sounds like a potentially huge improvement!

First I threw together a quick test just to verify my initial assumption. This Python file loads 50,000 JSON files into dicts one by one, then concatenates all the files into one, and finally loads the JSON objects back into dicts reading from the single file instead of the directory. The results are what you might expect, but still impressive.

read_files took 12.892000
concatenate took 4.496000
read_file took 0.939000

Eliminating the disk-seeks improves our load time by approximately twelve-fold, and even the read-write loop is significantly faster than read-load. So now that we've established that caching our JSON files together is hugely beneficial, there's two more questions to answer; 1) what's the best (fastest, most stable) way to create a cache file, and 2) how can we efficiently detect when the cache has been invalidated?

Fast File Concatenation

To identify a "best" concatenation method, I tested three different ways of reading a set of files in Python, File.readlines(), File.read()
, and ShUtil.copyfileobj(). I also wanted to compare this against letting the OS do the heavy lifting, with cat or another program, so I tested Type and Copy on Windows, and Cat and Xargs Cat on Linux. Additionally, on Windows I tested both a native Python install via cmd.exe and a Cygwin instance, and on Linux I timed the same behavior with the file cache cleared (look for /proc/sys/vm/drop_caches).
Windows 7 - i7 3.4GHz 16GBUbuntu 11.10 - i7 3.4GHz 16GB
HD - cmdHD - CygwinSSD - cmdSSD - CygwinHDSSDHD - cold cacheSSD- cold cache
Load All Files10.85114.87685116.69617.1679821.8512401.602277339.54595815.793162
Concatenate (Readlines)4.9808.2964744.8667.7494430.6186100.586936338.83064614.245564
Load Concat'ed File0.5383.9462250.9857.1344080.8290640.8325720.8544990.848297
Python: ShUtil4.7958.5524904.3047.9184530.5989180.608400338.37906614.215956
Python: Readlines5.0118.3814794.9007.9594560.5979120.605008339.99561014.182779
Python: Read8.92710.07057615.0039.3805370.8022200.761661339.97446114.460762
Shell: Copy / Cat5.6796.7753887.0105.899337FAILEDFAILEDFAILEDFAILED
Shell: Type / Xargs3.4488.5274887.4116.2793590.3430880.401653332.40032012.669442
Timing methods of concatenating 50,000 JSON files

We can see immediately that cat failed completely on Linux, something I didn't expect. The exact error message is /bin/sh: /bin/cat: Argument list too long indicating that there's a limit to the number of files cat can handle in one go. Using Xargs gets around this issue, at the cost of some speed.