Abundant: implementation

Showing posts with label implementation. Show all posts

Tuesday, April 19, 2011

Categorizing Metadata

One of Abundant's major benefits is the freedom it allows developers, not locking them down into any specific workflow or requirements, nor requiring them to configure anything beforehand.

However often it is very beneficial to configure an issue tracker to define certain metadata fields as only allowing certain options. For example, perhaps we want to track the severity of issues more formally, so we can know what needs to be dealt with first. If there is no structure to the severity field, this is impossible, as anything one could want to type would be allowed, and the lack of structure would be a hindrance.

And so the most recent feature improvement to Abundant allows you to specify in a config file allowable and default values for much of an Abundant issue's metadata, in order to control this. Future features include being able to sort and order based on these values, but this is lower priority than simply having the functionality there to utilize. Now larger projects can organize issues in a structured way - again without any requirement to do so being placed on individuals or groups interested in using the tool more loosely.

Tuesday, February 22, 2011

Prefix Lookup

Be sure to read my new post where I explore prefix lookup algorithm efficiency in more detail.

A central principle of Abundant (stemming from b, which took the idea and code from t) is the ability to look up data by searching for a unique prefix of that data. For instance, if an issue is given the unique but unpleasant to use ID of '2fd4e1c67a2d28fced849ee1bb76e7391b93eb12', a user can refer to it simply as '2fd' and Abundant will be able to identify that this is the issue they were referring to (assuming no other issue starts with the same prefix of course).

It is not immediately clear how this lookup should work, however. t uses some custom and fairly inefficient (though good enough for it's use case) code to generate the prefix lookup. You can browse t's source code here, __getitem__ on line 141 is how items are looked up by prefix, and _prefixes on line 77 generates a table mapping items to their shortest unique prefixes. Both functions are O(n) on the number of items in the list, or more precisely, O(n*m) where m is the length of the item (or prefix).

This isn't terrible, and scales perfectly well for t and b; if I recall correctly from testing run last summer, these functions handle searching several hundred thousand records with ease. But when it came to scaling to greater demands of Abundant, adding additional features started becoming expensive. Running an O(n) lookup once isn't bad, for instance, but when you need to do it several times, and restructure the database after each step, it quickly becomes apparent that there ought to be a more robust solution.

So I ran a search on Stack Overflow for 'prefix data structure' and found a question 'most efficient data structure for a read-only list of strings (about 100,000) with fast prefix search' tagged 'Python' which in turn pointed me to a data structure called a trie that was exactly what I was looking for.

A trie (pronounced try or tree) is a tree (you guessed that I hope :P ) structure which is organized by similar prefixes of entries. From the root, children all share a comon prefix. So the children of the root are grouped by their first character, the grandchildren by their second character, and so on. There are implementation details that differ from implementation to implementation, but that is the conceptual premise of all tries.

Tries operate in O(m) time, where m is the length of the prefix being looked up. This is very fast, and often faster than the "O(1)" hash table (it's actually also O(m) due to the hash function, and collisions make it more expensive) or the O(log n) binary search tree. This makes them a drastically more desirable choice than the linear search methodology of b and t.

After exploring several Python implementations described or referenced in the SO question I decided to build my own, since each of the existing implementations were not quite what I was looking for. Most notably, the "standard" trie stores the entire key, one character at a time, down the tree. For sparse trees (that is to say, long keys, but short unique prefixes) this seemed fairly space/time inneficient, and so my trie only goes as far as it needs to, and restructures as necessary. In order to facilitate ease of use, my trie converts all keys and searches to lowercase, which is potentially dangerous in some use cases, but desirable for Abundant so users can look up data without regard to case sensitivity. Imagine if, due to some platform specific issue, an ID hash was stored with uppercase hex instead of lower case. Users would never be able to find the issue, even though as far as hex is concerned they are legitimately entering the correct number.

Furthermore, my trie implements a notion of an alias, which is stored in the trie like any other value, but when retrieved does a dictionary lookup of the alias value and returns the result. For example, users can specify both a username and an email address in angle brackets. It would reasonable for users to be able to reference each other by either a prefix of their name or their email address. As such, Abundant adds the user string (username <email@address>) to the user trie, and then makes an alias from the email address to the user string, enabling prefix lookup on either value, transparently to the user.

It's not quite a general case trie, but I've tried to keep these specific features like lowercase and alias abstracted from the underlying data structure, so that a more traditional trie data structure could be constructed fairly trivially. All in all, I'm very pleased with this new feature.

Thursday, February 17, 2011

Mutable Types as Default Parameters in Python

I just came across a little feature/bug in Python I thought worth sharing.

In Python, you can set default values for parameters, like so:

def func(param="hello"):

The result of which is a function you can call either with or without a parameter. This seems simple enough, and works very nicely. However what if you want to use something more complex as the default parameter? Say, a list?

def func(param=[]):

This works just fine, exactly like the previous example. But one thing that may or may not immediately be obvious is every call to this function without an explicit parameter will share the same list. Python is smart enough not to re-parse the default parameters ever time the function is called, meaning that only one empty list is ever created in defining or using func(). Usually this has no effect on the programmer - if your default values are primitive or immutable types, which the vast majority of default values likely are, there is no loss to the function sharing the value between different calls - it won't be changed. And even when mutable types are used, it's rare to gain any benefit from changing a mutable data structure the rest of the program cannot access, and as such most cases of mutable types as default parameters are used simply for accessing, not mutating, the data.

But one example where that is not the case is in a constructor. Consider:

class BlogPost:

    def __init__(comments=[]):

When constructing a BlogPost object, I don't want to take the time to explicitly tell it there are no comments, so I would usually not even bother populating that parameter. On its own, this isn't enough to get me into trouble. Neither is creating multiple BlogPosts, nor is accessing, changing, or otherwise working with a BlogPost dangerous. And so you could go quite some time developing before you find yourself in a situation where you have constructed multiple BlogPost objects and where you update the comments array of one of them. As soon as you do that, however, all other existing BlogPost objects will also have the same comment in their comments list, because it's actually the same list!

Once you think about it this behavior is not terribly shocking - you might even go so far as to say it's intuitive. However if you aren't thinking about it, it's a very bizarre problem to debug when it happens to you.

Tuesday, February 15, 2011

Command Line Processing

When running programs from the command line, they are able to take a list of string arguments (the String[] args in Java's main method, for example) to control how they work. The only way to tell these arguments apart, however, is to look at their position in the list. This works for some programs, but breaks down quickly for complex tasks. Easy examples can be found in any standard command line programs. How would you use a command like grep to search through files and then display some context around the matches. Putting "context" as an argument wouldn't work, how would it know if "context" is a special parameter or a filename?

So the notion of options were invented. The idea being, a special flag (traditionally - on Unix, / on Windows) is placed at the begining of an argument, and the remainder of the argument has special meaning. -C in grep is used specially to indicate context. Importantly, whatever argument follows -C is also not considered an argument, but the parameter to the context option (in this case, a number of lines to display is expected).

This fairly simple syntax is all that the vast majority of command line programs need to function. It's all Mercurial needs, and it's presumably all Abundant will need also. But there is neither any one standard way to actually structure this interface (some allow + or other symbols as flags, some allow options that optionally take arguments, some use -FLAG=ARG rather than -FLAG ARG, etc.) nor any straight forward way to implement any of these designs. Python alone has three different libraries that all do the same thing, and they don't even provide the full range of features, so some projects still wouldn't be able to use them.

And so far we haven't even addressed how these arguments and options actually hook into executing code. The available libraries usually work fairly well for "light" programs which do exactly one task (admittedly, this is the model Unix is built on, so this isn't completely unreasonable) such as cp, grep, or ping, however they break down very quickly with larger programs like Mercurial or Abundant, where depending on which operation you wish to do different options should or should not be allowed. This seemingly reasonable functionality is all but completely missing from most argument processing libraries, and as such, needs to be built manually.

Mercurial built it's processor on top of optget, I am using the more powerful optparse for Abundant. Besides the underlying library, the code design is fairly similar. A function is defined for each atomic operation that can be run in Abundant. Then a table (dictionary, hash map, whatever you want to call it) is constructed consisting of the name of the task, like init, new or edit, which maps to a data structure containing the correlating function, a list of options the command will take, and a string describing the expected structure of the command. The program parses the input arguments, looks up the first parameter against the table, and if it finds a match, passes the remaining arguments to a parser library given the option structure described in the table, the result of parsing the input into options and arguments is then finally passed to the function to do its work.

It's a tedious and error prone process. Error catching occurs at three levels, before the parser, in the parser, and in the function. Many of these error checks need to be redundant between levels, and the end result is a very unpleasant piece of spaghetti code. Shockingly, however, it's better than most of the alternatives out there that I have seen. It scales easily (adding new commands and editing old ones has little to no effect on other commands), allows for unique options per task, and the end result will be quite nice, as Mercurial's UI demonstrates. Getting there is a fairly painful process sadly.

Pages