Tuesday, February 22, 2011

Prefix Lookup

Be sure to read my new post where I explore prefix lookup algorithm efficiency in more detail.

A central principle of Abundant (stemming from b, which took the idea and code from t) is the ability to look up data by searching for a unique prefix of that data. For instance, if an issue is given the unique but unpleasant to use ID of '2fd4e1c67a2d28fced849ee1bb76e7391b93eb12', a user can refer to it simply as '2fd' and Abundant will be able to identify that this is the issue they were referring to (assuming no other issue starts with the same prefix of course).

It is not immediately clear how this lookup should work, however. t uses some custom and fairly inefficient (though good enough for it's use case) code to generate the prefix lookup. You can browse t's source code here, __getitem__ on line 141 is how items are looked up by prefix, and _prefixes on line 77 generates a table mapping items to their shortest unique prefixes. Both functions are O(n) on the number of items in the list, or more precisely, O(n*m) where m is the length of the item (or prefix).

This isn't terrible, and scales perfectly well for t and b; if I recall correctly from testing run last summer, these functions handle searching several hundred thousand records with ease. But when it came to scaling to greater demands of Abundant, adding additional features started becoming expensive. Running an O(n) lookup once isn't bad, for instance, but when you need to do it several times, and restructure the database after each step, it quickly becomes apparent that there ought to be a more robust solution.

So I ran a search on Stack Overflow for 'prefix data structure' and found a question 'most efficient data structure for a read-only list of strings (about 100,000) with fast prefix search' tagged 'Python' which in turn pointed me to a data structure called a trie that was exactly what I was looking for.


A trie (pronounced try or tree) is a tree (you guessed that I hope :P ) structure which is organized by similar prefixes of entries.  From the root, children all share a comon prefix.  So the children of the root are grouped by their first character, the grandchildren by their second character, and so on.  There are implementation details that differ from implementation to implementation, but that is the conceptual premise of all tries.

Tries operate in O(m) time, where m is the length of the prefix being looked up. This is very fast, and often faster than the "O(1)" hash table (it's actually also O(m) due to the hash function, and collisions make it more expensive) or the O(log n) binary search tree. This makes them a drastically more desirable choice than the linear search methodology of b and t.

After exploring several Python implementations described or referenced in the SO question I decided to build my own, since each of the existing implementations were not quite what I was looking for.  Most notably, the "standard" trie stores the entire key, one character at a time, down the tree.  For sparse trees (that is to say, long keys, but short unique prefixes) this seemed fairly space/time inneficient, and so my trie only goes as far as it needs to, and restructures as necessary.  In order to facilitate ease of use, my trie converts all keys and searches to lowercase, which is potentially dangerous in some use cases, but desirable for Abundant so users can look up data without regard to case sensitivity.  Imagine if, due to some platform specific issue, an ID hash was stored with uppercase hex instead of lower case.  Users would never be able to find the issue, even though as far as hex is concerned they are legitimately entering the correct number.

Furthermore, my trie implements a notion of an alias, which is stored in the trie like any other value, but when retrieved does a dictionary lookup of the alias value and returns the result.  For example, users can specify both a username and an email address in angle brackets. It would reasonable for users to be able to reference each other by either a prefix of their name or their email address. As such, Abundant adds the user string (username <email@address>) to the user trie, and then makes an alias from the email address to the user string, enabling prefix lookup on either value, transparently to the user.

It's not quite a general case trie, but I've tried to keep these specific features like lowercase and alias abstracted from the underlying data structure, so that a more traditional trie data structure could be constructed fairly trivially. All in all, I'm very pleased with this new feature.

No comments:

Post a Comment