In order to hook into the underlying version control for history tracking and distribution, Abundant stores each issue in a separate file. This is generally advantageous, but it runs headfirst into trouble when attempting to do operations on the database as a whole, like listing or querying. With each issue in a separate file, Abundant has to iterate over each file in turn, necessitating a disk seek between each file. There's many possible steps we could take to minimize query time, potentially loading our issues up into a database like SQLite, but if something simpler is good enough, we can avoid that sort of underlying complexity (ensuring the db is up to date with the actual issue files, and efficiently updating the db when it isn't is far from trivial).
O(n)
query time isn't too terrible, but avoiding disk seeks could potentially net dramatic improvements. So what if we concatenated all the files into one big document we could read in one seek? Sounds like a potentially huge improvement!
First I threw together a quick test just to verify my initial assumption. This Python file loads 50,000 JSON files into dicts one by one, then concatenates all the files into one, and finally loads the JSON objects back into dicts reading from the single file instead of the directory. The results are what you might expect, but still impressive.
read_files took 12.892000 concatenate took 4.496000 read_file took 0.939000
Eliminating the disk-seeks improves our load time by approximately twelve-fold, and even the read-write loop is significantly faster than read-load. So now that we've established that caching our JSON files together is hugely beneficial, there's two more questions to answer; 1) what's the best (fastest, most stable) way to create a cache file, and 2) how can we efficiently detect when the cache has been invalidated?
Fast File Concatenation
To identify a "best" concatenation method, I tested three different ways of reading a set of files in Python,File.readlines()
, File.read()
, and
ShUtil.copyfileobj()
. I also wanted to compare this against letting the OS do the heavy lifting, with cat
or another program, so I tested Type
and Copy
on Windows, and Cat
and Xargs Cat
on Linux. Additionally, on Windows I tested both a native Python install via cmd.exe
and a Cygwin instance, and on Linux I timed the same behavior with the file cache cleared (look for /proc/sys/vm/drop_caches
).
Windows 7 - i7 3.4GHz 16GB | Ubuntu 11.10 - i7 3.4GHz 16GB | |||||||
---|---|---|---|---|---|---|---|---|
HD - cmd | HD - Cygwin | SSD - cmd | SSD - Cygwin | HD | SSD | HD - cold cache | SSD- cold cache | |
Load All Files | 10.851 | 14.876851 | 16.696 | 17.167982 | 1.851240 | 1.602277 | 339.545958 | 15.793162 |
Concatenate (Readlines) | 4.980 | 8.296474 | 4.866 | 7.749443 | 0.618610 | 0.586936 | 338.830646 | 14.245564 |
Load Concat'ed File | 0.538 | 3.946225 | 0.985 | 7.134408 | 0.829064 | 0.832572 | 0.854499 | 0.848297 |
Python: ShUtil | 4.795 | 8.552490 | 4.304 | 7.918453 | 0.598918 | 0.608400 | 338.379066 | 14.215956 |
Python: Readlines | 5.011 | 8.381479 | 4.900 | 7.959456 | 0.597912 | 0.605008 | 339.995610 | 14.182779 |
Python: Read | 8.927 | 10.070576 | 15.003 | 9.380537 | 0.802220 | 0.761661 | 339.974461 | 14.460762 |
Shell: Copy / Cat | 5.679 | 6.775388 | 7.010 | 5.899337 | FAILED | FAILED | FAILED | FAILED |
Shell: Type / Xargs | 3.448 | 8.527488 | 7.411 | 6.279359 | 0.343088 | 0.401653 | 332.400320 | 12.669442 |
We can see immediately that
cat
failed completely on Linux, something I didn't expect. The exact error message is /bin/sh: /bin/cat: Argument list too long
indicating that there's a limit to the number of files cat can handle in one go. Using Xargs
gets around this issue, at the cost of some speed.