But when I started testing Abundant at scale (60,000 issues or so) I discovered some noticeable slowdowns in execution time -
ab add
would take upwards of 4 seconds to run, and the bottleneck was almost entirely in prefix lookup!
Trie's are very efficient at prefix lookup, taking
O(m)
time, where m
is the number of characters in the requested prefix. Not only is it fast, but its speed is independent of the number of items in the trie as well - perfect for scaling. But what I didn't take into account while researching was the construction cost of a trie, which as it turns out is substantial. Since Abundant is a command line utility, it doesn't have any persistent memory between executions, meaning we have to re-construct the Trie every time ab
is run.
Upon realizing the initialization cost was my bottleneck, I first looked into ways of persisting the structure between executions, such as storing the object in a pickle between executions or hooking into some persistent memory space behind the scene. None of these options were ultimately viable, however, as they introduced more problems than they solved (race conditions and stale data, in particular) and dramatically increased the complexity of Abundant.
So I backtracked, and started thinking about prefix lookup implementations that might not be as fast to query, but would be faster to construct, and I realized a simple binary search over a sorted list could be fairly painlessly converted into a prefix lookup algorithm. Standard binary search algorithms return an index, either of the (presumed) location of the object (Python's
bisect_left
) or the insertion point (bisect_right
) - one index later than the most comparable item. These two functions essentially enable prefix lookup, bisect_left
identifies the first match, bisect_right
identifies the index after the last match, and from there we can test whether we've found a result or not by comparing the two indices.
With this model, lookup slows down to
O(log n)
but that remains a perfectly viable complexity well past ~60,000. Furthermore, construction now simply requires sorting, which is O(n log n)
asymptotically, but is still quite fast for Python to handle. Even better, if we can ever ensure the list is pre-sorted, construction is O(1)
!
I re-implemented my prefix data structure as a binary search, and ran some benchmarks to compare their behavior. Before getting into the code, here are the results:
TRIE STRUCTURE build_from_list took 3.334000 build_from_adds took 4.164000 add_to took 0.103000 get took 0.594000 prefix took 0.515000 iter took 0.664000 BINARY SEARCH STRUCTURE build_from_list took 0.790000 build_from_adds took 20.589000 add_to took 1.631000 get took 1.302000 prefix took 1.128000 iter took 0.038000The
build_from_list
method takes an unsorted list of 250,000 SHA1 hashes and constructs a prefix structure. Even with a quarter of a million records, the binary search took less than a second to construct, whereas my trie takes for times as long. build_from_adds
constructs an empty object, then adds a similar list in a loop, one at a time. The O(n)
insertion behavior of the sorted list is apparent here, but I don't anticipate needing to add many items in a loop too often, and if I do, it will likely be simpler to just reconstruct the whole data structure.
add_to
adds 10,000 records onto our existing list, and again the trie wins here, but again, 10,000 items is much, much more than I anticipate ever needing to add on the fly to my prefix table.
get
looks up 10 arbitrary prefixes 10,000 times, and prefix
computes the unique prefix of a 10 items 10,000 times. The trie is approximately twice as fast as binary search on a quarter of a million records, but similarly this difference isn't going to be noticeable in production usage.
Finally, attempting to iterate over our structure with
iter
demonstrates that our trie is dramatically slower to iterate, since we have to access each node one at a time, rather than just stepping through a list.
Given that construction time is most critical for Abundant, and the other operations will only be done a few times per execution, a trie is quite obviously not the right way to do prefix lookup here.
You can see my current prefix implementation in Abundant's source code, and below is the code I used to create this test. Of course feedback is welcome.