Saturday, June 23, 2012

Revisiting Fast Prefix Lookup

Abundant relies heavily on letting the user specify arguments via unique prefixes, rather than needing to type out the entire string (issue IDs in particular are 40 random characters, that would be a hassle). As such, I looked into efficient prefix lookup data structures previously, and implemented what I thought would be the best choice.

But when I started testing Abundant at scale (60,000 issues or so) I discovered some noticeable slowdowns in execution time - ab add would take upwards of 4 seconds to run, and the bottleneck was almost entirely in prefix lookup!

Trie's are very efficient at prefix lookup, taking O(m) time, where m is the number of characters in the requested prefix. Not only is it fast, but its speed is independent of the number of items in the trie as well - perfect for scaling. But what I didn't take into account while researching was the construction cost of a trie, which as it turns out is substantial. Since Abundant is a command line utility, it doesn't have any persistent memory between executions, meaning we have to re-construct the Trie every time ab is run.

Upon realizing the initialization cost was my bottleneck, I first looked into ways of persisting the structure between executions, such as storing the object in a pickle between executions or hooking into some persistent memory space behind the scene. None of these options were ultimately viable, however, as they introduced more problems than they solved (race conditions and stale data, in particular) and dramatically increased the complexity of Abundant.

So I backtracked, and started thinking about prefix lookup implementations that might not be as fast to query, but would be faster to construct, and I realized a simple binary search over a sorted list could be fairly painlessly converted into a prefix lookup algorithm. Standard binary search algorithms return an index, either of the (presumed) location of the object (Python's bisect_left) or the insertion point (bisect_right) - one index later than the most comparable item. These two functions essentially enable prefix lookup, bisect_left identifies the first match, bisect_right identifies the index after the last match, and from there we can test whether we've found a result or not by comparing the two indices.

With this model, lookup slows down to O(log n) but that remains a perfectly viable complexity well past ~60,000. Furthermore, construction now simply requires sorting, which is O(n log n) asymptotically, but is still quite fast for Python to handle. Even better, if we can ever ensure the list is pre-sorted, construction is O(1)!

I re-implemented my prefix data structure as a binary search, and ran some benchmarks to compare their behavior. Before getting into the code, here are the results:
TRIE STRUCTURE
build_from_list took 3.334000
build_from_adds took 4.164000
add_to took 0.103000
get took 0.594000
prefix took 0.515000
iter took 0.664000

BINARY SEARCH STRUCTURE
build_from_list took 0.790000
build_from_adds took 20.589000
add_to took 1.631000
get took 1.302000
prefix took 1.128000
iter took 0.038000
The build_from_list method takes an unsorted list of 250,000 SHA1 hashes and constructs a prefix structure. Even with a quarter of a million records, the binary search took less than a second to construct, whereas my trie takes for times as long. build_from_adds constructs an empty object, then adds a similar list in a loop, one at a time. The O(n) insertion behavior of the sorted list is apparent here, but I don't anticipate needing to add many items in a loop too often, and if I do, it will likely be simpler to just reconstruct the whole data structure.

add_to adds 10,000 records onto our existing list, and again the trie wins here, but again, 10,000 items is much, much more than I anticipate ever needing to add on the fly to my prefix table.

get looks up 10 arbitrary prefixes 10,000 times, and prefix computes the unique prefix of a 10 items 10,000 times. The trie is approximately twice as fast as binary search on a quarter of a million records, but similarly this difference isn't going to be noticeable in production usage.

Finally, attempting to iterate over our structure with iter demonstrates that our trie is dramatically slower to iterate, since we have to access each node one at a time, rather than just stepping through a list.

Given that construction time is most critical for Abundant, and the other operations will only be done a few times per execution, a trie is quite obviously not the right way to do prefix lookup here.
You can see my current prefix implementation in Abundant's source code, and below is the code I used to create this test. Of course feedback is welcome.

What's Next

My Thesis included the "release" of Abundant 0.3, which was somewhat minimal, but usable. In my thesis, I identified several specific missing or incomplete features that would be necessary for a 1.0 release. I'll enumerate them here, and try to create a blog post (or several) as I resolve them.

Each of these goals correlates to an increment of the minor version number, and accomplishing all of these goals would effectively indicate Abundant is ready for a wide scale 1.0 release. You can follow my progress via the 1.0 Track tag on this blog.

Read-only web interface
We should replicate the functionality of Mercurial to serve up information on the Abundant database in a browser, either from a central server or from the local computer. This should be painless to do, and allow the user to quickly browse the current database state.

Interactive web interface
Abundant is crippled until users can update issues without needing to first pull the source code down and then push it back up. As such, we need to expand the web interface to allow users to submit new issues.

Speed improvements
There is a notable startup delay at present, which is tolerable for now, but absolutely cannot continue. Standard commands, like init, new, and details must execute nearly instantly. More complex commands like list/tasks can be slower, but still cannot be as susceptible to slowdown as they are now.

Advanced workflows
Currently Abundant is fairly free form in its workflow, issues can be created, changed, and closed by anyone. This is desirable for small projects, and remains a good default behavior,however larger projects need greater control. For instance, it might be necessary for the person filling an issue to be the only one able to mark it resolved (user A files and assigns to user B, user B implements fix, marks it as fixed, it gets reassigned to user A, user A checks that it is resolved, closes it). These more powerful workflows need to be as simple to set up as possible, and these design decisions have not been made yet.

Data Caching
As it is, Abundant does not cache any information, and has to poll each individual file in order to get necessary information, including something as simple as the issue’s title. We need to implement caching which is transparent and failsafe (the user doesn’t need to know it exists,and when it’s outdated or broken, the system still runs, albeit slower) so that searches and lookups can be run faster. Examples include getting titles of related issues without having to load whole issues, and constructing sorted data structures to filter the list command faster.

Unit and Regression Testing
There may be some rudimentary testing being done before graduation, however it will be important if this project is to grow that a standardized and stable testing suite is developed to ensure the codebase does what is expected, and that future changes do not cause regressions that break previous functionality.

Improved User Interface
Notable when viewing the details of an issue, currently it is output in a format that is difficult to read, and in an order that isn’t necessarily desirable. We want to ensure the look and feel of the command line is professional.

Installation Scripts
It should be trivial, especially on Windows / Mac to install Abundant. Currently it isn’t terribly painful – pull code, add command to path, run – but isn’t the sort of installation flow users of these operating systems expect. It is ok for now to ask users to go through those steps,but for a large scale release to be possible, we must have installation binaries, which are preferably built by an automatic build script.

Friday, June 22, 2012

Abundant Thesis

I've been on about a year hiatus after graduating and moving across the country to Boston, but Abundant has still been on my mind, and in the past week or so, I've really gotten back into it! If anyone's still reading this blog, hello! I realize I never got around to posting my final thesis, so before I go further, here it is. Check back soon for a couple more updates! Abundant Thesis Report Abundant Final Presentation Unfortunately, apparently Google doesn't support uploading anything other than images and videos to blogger, so the links above point to Scribd and potentially could break in the future.  Hopefully not.