Abundant: Fast File Concatenation in Python

Ed. 2016: I never published this post; it's probably incomplete but seeing as I have no recollection of what I'd intended to add, I'm just publishing what's here now.

In order to hook into the underlying version control for history tracking and distribution, Abundant stores each issue in a separate file. This is generally advantageous, but it runs headfirst into trouble when attempting to do operations on the database as a whole, like listing or querying. With each issue in a separate file, Abundant has to iterate over each file in turn, necessitating a disk seek between each file. There's many possible steps we could take to minimize query time, potentially loading our issues up into a database like SQLite, but if something simpler is good enough, we can avoid that sort of underlying complexity (ensuring the db is up to date with the actual issue files, and efficiently updating the db when it isn't is far from trivial).

O(n) query time isn't too terrible, but avoiding disk seeks could potentially net dramatic improvements. So what if we concatenated all the files into one big document we could read in one seek? Sounds like a potentially huge improvement!

First I threw together a quick test just to verify my initial assumption. This Python file loads 50,000 JSON files into dicts one by one, then concatenates all the files into one, and finally loads the JSON objects back into dicts reading from the single file instead of the directory. The results are what you might expect, but still impressive.

read_files took 12.892000
concatenate took 4.496000
read_file took 0.939000

Eliminating the disk-seeks improves our load time by approximately twelve-fold, and even the read-write loop is significantly faster than read-load. So now that we've established that caching our JSON files together is hugely beneficial, there's two more questions to answer; 1) what's the best (fastest, most stable) way to create a cache file, and 2) how can we efficiently detect when the cache has been invalidated?

Fast File Concatenation

To identify a "best" concatenation method, I tested three different ways of reading a set of files in Python, File.readlines(), File.read()
, and ShUtil.copyfileobj(). I also wanted to compare this against letting the OS do the heavy lifting, with cat or another program, so I tested Type and Copy on Windows, and Cat and Xargs Cat on Linux. Additionally, on Windows I tested both a native Python install via cmd.exe and a Cygwin instance, and on Linux I timed the same behavior with the file cache cleared (look for /proc/sys/vm/drop_caches).

Timing methods of concatenating 50,000 JSON files
	Windows 7 - i7 3.4GHz 16GB				Ubuntu 11.10 - i7 3.4GHz 16GB
	HD - cmd	HD - Cygwin	SSD - cmd	SSD - Cygwin	HD	SSD	HD - cold cache	SSD- cold cache
Load All Files	10.851	14.876851	16.696	17.167982	1.851240	1.602277	339.545958	15.793162
Concatenate (Readlines)	4.980	8.296474	4.866	7.749443	0.618610	0.586936	338.830646	14.245564
Load Concat'ed File	0.538	3.946225	0.985	7.134408	0.829064	0.832572	0.854499	0.848297

Python: ShUtil	4.795	8.552490	4.304	7.918453	0.598918	0.608400	338.379066	14.215956
Python: Readlines	5.011	8.381479	4.900	7.959456	0.597912	0.605008	339.995610	14.182779
Python: Read	8.927	10.070576	15.003	9.380537	0.802220	0.761661	339.974461	14.460762
Shell: Copy / Cat	5.679	6.775388	7.010	5.899337	FAILED	FAILED	FAILED	FAILED
Shell: Type / Xargs	3.448	8.527488	7.411	6.279359	0.343088	0.401653	332.400320	12.669442

We can see immediately that cat failed completely on Linux, something I didn't expect. The exact error message is /bin/sh: /bin/cat: Argument list too long indicating that there's a limit to the number of files cat can handle in one go. Using Xargs gets around this issue, at the cost of some speed.

Abundant

Pages

Saturday, April 30, 2016

Fast File Concatenation in Python

Fast File Concatenation

No comments:

Post a Comment