Saturday, April 30, 2016

Fast File Concatenation in Python

Ed. 2016: I never published this post; it's probably incomplete but seeing as I have no recollection of what I'd intended to add, I'm just publishing what's here now.

In order to hook into the underlying version control for history tracking and distribution, Abundant stores each issue in a separate file. This is generally advantageous, but it runs headfirst into trouble when attempting to do operations on the database as a whole, like listing or querying. With each issue in a separate file, Abundant has to iterate over each file in turn, necessitating a disk seek between each file. There's many possible steps we could take to minimize query time, potentially loading our issues up into a database like SQLite, but if something simpler is good enough, we can avoid that sort of underlying complexity (ensuring the db is up to date with the actual issue files, and efficiently updating the db when it isn't is far from trivial).

O(n) query time isn't too terrible, but avoiding disk seeks could potentially net dramatic improvements. So what if we concatenated all the files into one big document we could read in one seek? Sounds like a potentially huge improvement!

First I threw together a quick test just to verify my initial assumption. This Python file loads 50,000 JSON files into dicts one by one, then concatenates all the files into one, and finally loads the JSON objects back into dicts reading from the single file instead of the directory. The results are what you might expect, but still impressive.

read_files took 12.892000
concatenate took 4.496000
read_file took 0.939000

Eliminating the disk-seeks improves our load time by approximately twelve-fold, and even the read-write loop is significantly faster than read-load. So now that we've established that caching our JSON files together is hugely beneficial, there's two more questions to answer; 1) what's the best (fastest, most stable) way to create a cache file, and 2) how can we efficiently detect when the cache has been invalidated?

Fast File Concatenation

To identify a "best" concatenation method, I tested three different ways of reading a set of files in Python, File.readlines(),
, and ShUtil.copyfileobj(). I also wanted to compare this against letting the OS do the heavy lifting, with cat or another program, so I tested Type and Copy on Windows, and Cat and Xargs Cat on Linux. Additionally, on Windows I tested both a native Python install via cmd.exe and a Cygwin instance, and on Linux I timed the same behavior with the file cache cleared (look for /proc/sys/vm/drop_caches).
Windows 7 - i7 3.4GHz 16GBUbuntu 11.10 - i7 3.4GHz 16GB
HD - cmdHD - CygwinSSD - cmdSSD - CygwinHDSSDHD - cold cacheSSD- cold cache
Load All Files10.85114.87685116.69617.1679821.8512401.602277339.54595815.793162
Concatenate (Readlines)4.9808.2964744.8667.7494430.6186100.586936338.83064614.245564
Load Concat'ed File0.5383.9462250.9857.1344080.8290640.8325720.8544990.848297
Python: ShUtil4.7958.5524904.3047.9184530.5989180.608400338.37906614.215956
Python: Readlines5.0118.3814794.9007.9594560.5979120.605008339.99561014.182779
Python: Read8.92710.07057615.0039.3805370.8022200.761661339.97446114.460762
Shell: Copy / Cat5.6796.7753887.0105.899337FAILEDFAILEDFAILEDFAILED
Shell: Type / Xargs3.4488.5274887.4116.2793590.3430880.401653332.40032012.669442
Timing methods of concatenating 50,000 JSON files

We can see immediately that cat failed completely on Linux, something I didn't expect. The exact error message is /bin/sh: /bin/cat: Argument list too long indicating that there's a limit to the number of files cat can handle in one go. Using Xargs gets around this issue, at the cost of some speed.

Saturday, June 23, 2012

Revisiting Fast Prefix Lookup

Abundant relies heavily on letting the user specify arguments via unique prefixes, rather than needing to type out the entire string (issue IDs in particular are 40 random characters, that would be a hassle). As such, I looked into efficient prefix lookup data structures previously, and implemented what I thought would be the best choice.

But when I started testing Abundant at scale (60,000 issues or so) I discovered some noticeable slowdowns in execution time - ab add would take upwards of 4 seconds to run, and the bottleneck was almost entirely in prefix lookup!

Trie's are very efficient at prefix lookup, taking O(m) time, where m is the number of characters in the requested prefix. Not only is it fast, but its speed is independent of the number of items in the trie as well - perfect for scaling. But what I didn't take into account while researching was the construction cost of a trie, which as it turns out is substantial. Since Abundant is a command line utility, it doesn't have any persistent memory between executions, meaning we have to re-construct the Trie every time ab is run.

Upon realizing the initialization cost was my bottleneck, I first looked into ways of persisting the structure between executions, such as storing the object in a pickle between executions or hooking into some persistent memory space behind the scene. None of these options were ultimately viable, however, as they introduced more problems than they solved (race conditions and stale data, in particular) and dramatically increased the complexity of Abundant.

So I backtracked, and started thinking about prefix lookup implementations that might not be as fast to query, but would be faster to construct, and I realized a simple binary search over a sorted list could be fairly painlessly converted into a prefix lookup algorithm. Standard binary search algorithms return an index, either of the (presumed) location of the object (Python's bisect_left) or the insertion point (bisect_right) - one index later than the most comparable item. These two functions essentially enable prefix lookup, bisect_left identifies the first match, bisect_right identifies the index after the last match, and from there we can test whether we've found a result or not by comparing the two indices.

With this model, lookup slows down to O(log n) but that remains a perfectly viable complexity well past ~60,000. Furthermore, construction now simply requires sorting, which is O(n log n) asymptotically, but is still quite fast for Python to handle. Even better, if we can ever ensure the list is pre-sorted, construction is O(1)!

I re-implemented my prefix data structure as a binary search, and ran some benchmarks to compare their behavior. Before getting into the code, here are the results:
build_from_list took 3.334000
build_from_adds took 4.164000
add_to took 0.103000
get took 0.594000
prefix took 0.515000
iter took 0.664000

build_from_list took 0.790000
build_from_adds took 20.589000
add_to took 1.631000
get took 1.302000
prefix took 1.128000
iter took 0.038000
The build_from_list method takes an unsorted list of 250,000 SHA1 hashes and constructs a prefix structure. Even with a quarter of a million records, the binary search took less than a second to construct, whereas my trie takes for times as long. build_from_adds constructs an empty object, then adds a similar list in a loop, one at a time. The O(n) insertion behavior of the sorted list is apparent here, but I don't anticipate needing to add many items in a loop too often, and if I do, it will likely be simpler to just reconstruct the whole data structure.

add_to adds 10,000 records onto our existing list, and again the trie wins here, but again, 10,000 items is much, much more than I anticipate ever needing to add on the fly to my prefix table.

get looks up 10 arbitrary prefixes 10,000 times, and prefix computes the unique prefix of a 10 items 10,000 times. The trie is approximately twice as fast as binary search on a quarter of a million records, but similarly this difference isn't going to be noticeable in production usage.

Finally, attempting to iterate over our structure with iter demonstrates that our trie is dramatically slower to iterate, since we have to access each node one at a time, rather than just stepping through a list.

Given that construction time is most critical for Abundant, and the other operations will only be done a few times per execution, a trie is quite obviously not the right way to do prefix lookup here.
You can see my current prefix implementation in Abundant's source code, and below is the code I used to create this test. Of course feedback is welcome.

What's Next

My Thesis included the "release" of Abundant 0.3, which was somewhat minimal, but usable. In my thesis, I identified several specific missing or incomplete features that would be necessary for a 1.0 release. I'll enumerate them here, and try to create a blog post (or several) as I resolve them.

Each of these goals correlates to an increment of the minor version number, and accomplishing all of these goals would effectively indicate Abundant is ready for a wide scale 1.0 release. You can follow my progress via the 1.0 Track tag on this blog.

Read-only web interface
We should replicate the functionality of Mercurial to serve up information on the Abundant database in a browser, either from a central server or from the local computer. This should be painless to do, and allow the user to quickly browse the current database state.

Interactive web interface
Abundant is crippled until users can update issues without needing to first pull the source code down and then push it back up. As such, we need to expand the web interface to allow users to submit new issues.

Speed improvements
There is a notable startup delay at present, which is tolerable for now, but absolutely cannot continue. Standard commands, like init, new, and details must execute nearly instantly. More complex commands like list/tasks can be slower, but still cannot be as susceptible to slowdown as they are now.

Advanced workflows
Currently Abundant is fairly free form in its workflow, issues can be created, changed, and closed by anyone. This is desirable for small projects, and remains a good default behavior,however larger projects need greater control. For instance, it might be necessary for the person filling an issue to be the only one able to mark it resolved (user A files and assigns to user B, user B implements fix, marks it as fixed, it gets reassigned to user A, user A checks that it is resolved, closes it). These more powerful workflows need to be as simple to set up as possible, and these design decisions have not been made yet.

Data Caching
As it is, Abundant does not cache any information, and has to poll each individual file in order to get necessary information, including something as simple as the issue’s title. We need to implement caching which is transparent and failsafe (the user doesn’t need to know it exists,and when it’s outdated or broken, the system still runs, albeit slower) so that searches and lookups can be run faster. Examples include getting titles of related issues without having to load whole issues, and constructing sorted data structures to filter the list command faster.

Unit and Regression Testing
There may be some rudimentary testing being done before graduation, however it will be important if this project is to grow that a standardized and stable testing suite is developed to ensure the codebase does what is expected, and that future changes do not cause regressions that break previous functionality.

Improved User Interface
Notable when viewing the details of an issue, currently it is output in a format that is difficult to read, and in an order that isn’t necessarily desirable. We want to ensure the look and feel of the command line is professional.

Installation Scripts
It should be trivial, especially on Windows / Mac to install Abundant. Currently it isn’t terribly painful – pull code, add command to path, run – but isn’t the sort of installation flow users of these operating systems expect. It is ok for now to ask users to go through those steps,but for a large scale release to be possible, we must have installation binaries, which are preferably built by an automatic build script.

Friday, June 22, 2012

Abundant Thesis

I've been on about a year hiatus after graduating and moving across the country to Boston, but Abundant has still been on my mind, and in the past week or so, I've really gotten back into it! If anyone's still reading this blog, hello! I realize I never got around to posting my final thesis, so before I go further, here it is. Check back soon for a couple more updates! Abundant Thesis Report Abundant Final Presentation Unfortunately, apparently Google doesn't support uploading anything other than images and videos to blogger, so the links above point to Scribd and potentially could break in the future.  Hopefully not.

Wednesday, April 20, 2011

Assign Me A Bug

At 2:15 the morning of a major presentation, I realize a very small feature is currently missing from Abundant.

It is currently not possible to assign an issue to a user.

Let's see how quickly I can fix this.

Update: Ten minutes, including testing.

Tuesday, April 19, 2011

Categorizing Metadata

One of Abundant's major benefits is the freedom it allows developers, not locking them down into any specific workflow or requirements, nor requiring them to configure anything beforehand.

However often it is very beneficial to configure an issue tracker to define certain metadata fields as only allowing certain options.  For example, perhaps we want to track the severity of issues more formally, so we can know what needs to be dealt with first.  If there is no structure to the severity field, this is impossible, as anything one could want to type would be allowed, and the lack of structure would be a hindrance.

And so the most recent feature improvement to Abundant allows you to specify in a config file allowable and default values for much of an Abundant issue's metadata, in order to control this.  Future features include being able to sort and order based on these values, but this is lower priority than simply having the functionality there to utilize.  Now larger projects can organize issues in a structured way - again without any requirement to do so being placed on individuals or groups interested in using the tool more loosely.

Tuesday, March 29, 2011

Version 0.1

Today I marked changeset b82b778f2bad as version 0.1.  At that point, Abundant was largely, but not fully, functional.  Notably the ability to mark issues resolved was missing at the time, however that was implemented surprisingly easily in a matter of just a few minutes.

My next step is to begin parsing and using configuration files, which I hope to have completed by Friday so that Abundant can be distributed to people willing to test it out and provide feedback.

Wednesday, February 23, 2011

Atlassian, one of the biggest developers of proprietary tools for software developers, on DVCS, Mercurial, and version control.

DVCS systems in the Enterprise

Tuesday, February 22, 2011

Prefix Lookup

Be sure to read my new post where I explore prefix lookup algorithm efficiency in more detail.

A central principle of Abundant (stemming from b, which took the idea and code from t) is the ability to look up data by searching for a unique prefix of that data. For instance, if an issue is given the unique but unpleasant to use ID of '2fd4e1c67a2d28fced849ee1bb76e7391b93eb12', a user can refer to it simply as '2fd' and Abundant will be able to identify that this is the issue they were referring to (assuming no other issue starts with the same prefix of course).

It is not immediately clear how this lookup should work, however. t uses some custom and fairly inefficient (though good enough for it's use case) code to generate the prefix lookup. You can browse t's source code here, __getitem__ on line 141 is how items are looked up by prefix, and _prefixes on line 77 generates a table mapping items to their shortest unique prefixes. Both functions are O(n) on the number of items in the list, or more precisely, O(n*m) where m is the length of the item (or prefix).

This isn't terrible, and scales perfectly well for t and b; if I recall correctly from testing run last summer, these functions handle searching several hundred thousand records with ease. But when it came to scaling to greater demands of Abundant, adding additional features started becoming expensive. Running an O(n) lookup once isn't bad, for instance, but when you need to do it several times, and restructure the database after each step, it quickly becomes apparent that there ought to be a more robust solution.

So I ran a search on Stack Overflow for 'prefix data structure' and found a question 'most efficient data structure for a read-only list of strings (about 100,000) with fast prefix search' tagged 'Python' which in turn pointed me to a data structure called a trie that was exactly what I was looking for.

A trie (pronounced try or tree) is a tree (you guessed that I hope :P ) structure which is organized by similar prefixes of entries.  From the root, children all share a comon prefix.  So the children of the root are grouped by their first character, the grandchildren by their second character, and so on.  There are implementation details that differ from implementation to implementation, but that is the conceptual premise of all tries.

Tries operate in O(m) time, where m is the length of the prefix being looked up. This is very fast, and often faster than the "O(1)" hash table (it's actually also O(m) due to the hash function, and collisions make it more expensive) or the O(log n) binary search tree. This makes them a drastically more desirable choice than the linear search methodology of b and t.

After exploring several Python implementations described or referenced in the SO question I decided to build my own, since each of the existing implementations were not quite what I was looking for.  Most notably, the "standard" trie stores the entire key, one character at a time, down the tree.  For sparse trees (that is to say, long keys, but short unique prefixes) this seemed fairly space/time inneficient, and so my trie only goes as far as it needs to, and restructures as necessary.  In order to facilitate ease of use, my trie converts all keys and searches to lowercase, which is potentially dangerous in some use cases, but desirable for Abundant so users can look up data without regard to case sensitivity.  Imagine if, due to some platform specific issue, an ID hash was stored with uppercase hex instead of lower case.  Users would never be able to find the issue, even though as far as hex is concerned they are legitimately entering the correct number.

Furthermore, my trie implements a notion of an alias, which is stored in the trie like any other value, but when retrieved does a dictionary lookup of the alias value and returns the result.  For example, users can specify both a username and an email address in angle brackets. It would reasonable for users to be able to reference each other by either a prefix of their name or their email address. As such, Abundant adds the user string (username <email@address>) to the user trie, and then makes an alias from the email address to the user string, enabling prefix lookup on either value, transparently to the user.

It's not quite a general case trie, but I've tried to keep these specific features like lowercase and alias abstracted from the underlying data structure, so that a more traditional trie data structure could be constructed fairly trivially. All in all, I'm very pleased with this new feature.

Thursday, February 17, 2011

Mutable Types as Default Parameters in Python

I just came across a little feature/bug in Python I thought worth sharing.

In Python, you can set default values for parameters, like so:

def func(param="hello"):

The result of which is a function you can call either with or without a parameter.  This seems simple enough, and works very nicely.  However what if you want to use something more complex as the default parameter?  Say, a list?

def func(param=[]):

This works just fine, exactly like the previous example.  But one thing that may or may not immediately be obvious is every call to this function without an explicit parameter will share the same list.  Python is smart enough not to re-parse the default parameters ever time the function is called, meaning that only one empty list is ever created in defining or using func().  Usually this has no effect on the programmer - if your default values are primitive or immutable types, which the vast majority of default values likely are, there is no loss to the function sharing the value between different calls - it won't be changed.  And even when mutable types are used, it's rare to gain any benefit from changing a mutable data structure the rest of the program cannot access, and as such most cases of mutable types as default parameters are used simply for accessing, not mutating, the data.

But one example where that is not the case is in a constructor.  Consider:

class BlogPost:
    def __init__(comments=[]):

When constructing a BlogPost object, I don't want to take the time to explicitly tell it there are no comments, so I would usually not even bother populating that parameter. On its own, this isn't enough to get me into trouble. Neither is creating multiple BlogPosts, nor is accessing, changing, or otherwise working with a BlogPost dangerous. And so you could go quite some time developing before you find yourself in a situation where you have constructed multiple BlogPost objects and where you update the comments array of one of them. As soon as you do that, however, all other existing BlogPost objects will also have the same comment in their comments list, because it's actually the same list!

Once you think about it this behavior is not terribly shocking - you might even go so far as to say it's intuitive. However if you aren't thinking about it, it's a very bizarre problem to debug when it happens to you.

Tuesday, February 15, 2011

Command Line Processing

When running programs from the command line, they are able to take a list of string arguments (the String[] args in Java's main method, for example) to control how they work.  The only way to tell these arguments apart, however, is to look at their position in the list.  This works for some programs, but breaks down quickly for complex tasks.  Easy examples can be found in any standard command line programs.  How would you use a command like grep to search through files and then display some context around the matches.  Putting "context" as an argument wouldn't work, how would it know if "context" is a special parameter or a filename?

So the notion of options were invented.  The idea being, a special flag (traditionally - on Unix, / on Windows) is placed at the begining of an argument, and the remainder of the argument has special meaning.  -C in grep is used specially to indicate context.  Importantly, whatever argument follows -C is also not considered an argument, but the parameter to the context option (in this case, a number of lines to display is expected).

This fairly simple syntax is all that the vast majority of command line programs need to function.  It's all Mercurial needs, and it's presumably all Abundant will need also.  But there is neither any one standard way to actually structure this interface (some allow + or other symbols as flags, some allow options that optionally take arguments, some use -FLAG=ARG rather than -FLAG ARG, etc.) nor any straight forward way to implement any of these designs.  Python alone has three different libraries that all do the same thing, and they don't even provide the full range of features, so some projects still wouldn't be able to use them.

And so far we haven't even addressed how these arguments and options actually hook into executing code.  The available libraries usually work fairly well for "light" programs which do exactly one task (admittedly, this is the model Unix is built on, so this isn't completely unreasonable) such as cp, grep, or ping, however they break down very quickly with larger programs like Mercurial or Abundant, where depending on which operation you wish to do different options should or should not be allowed.  This seemingly reasonable functionality is all but completely missing from most argument processing libraries, and as such, needs to be built manually.

Mercurial built it's processor on top of optget, I am using the more powerful optparse for Abundant.  Besides the underlying library, the code design is fairly similar.  A function is defined for each atomic operation that can be run in Abundant.  Then a table (dictionary, hash map, whatever you want to call it) is constructed consisting of the name of the task, like init, new or edit, which maps to a data structure containing the correlating function, a list of options the command will take, and a string describing the expected structure of the command.  The program parses the input arguments, looks up the first parameter against the table, and if it finds a match, passes the remaining arguments to a parser library given the option structure described in the table, the result of parsing the input into options and arguments is then finally passed to the function to do its work.

It's a tedious and error prone process.  Error catching occurs at three levels, before the parser, in the parser, and in the function.  Many of these error checks need to be redundant between levels, and the end result is a very unpleasant piece of spaghetti code.  Shockingly, however, it's better than most of the alternatives out there that I have seen.  It scales easily (adding new commands and editing old ones has little to no effect on other commands), allows for unique options per task, and the end result will be quite nice, as Mercurial's UI demonstrates.  Getting there is a fairly painful process sadly.

Friday, February 4, 2011

Python Documentation

I had an interesting time understanding how to use Python's json library, which is very sparsely documented and completely lacks examples.  I found a page on using json in Python which was very helpful, but I did not read the code examples quite deeply enough, and the annotations were limited.

I wanted to be able to take a "bug" object and turn it into JSON data, and then take that JSON data and reconstruct the bug.  The example is for the generic case of serializing a class, and therefore stores metadata about the class, like its name, module, and package, along with the other data.  I wanted to avoid that if possible, and so stripped out the dynamic class construction and the like, and effectively wrote a pair of functions that took a dictionary and turned it into an object, and took an object and turned it into a dictionary.

However, I tried making one of the bug's instance variables a dictionary as well, and the program crashed when converting from JSON to Python object, seeming to have dropped the surrounding dictionary, and only considering the internal dictionary.  This seemed to me like completely bizarre behavior, and took a fair bit of poking around, including multiple stack overflow questions and spending time on #python (a truly painful process) in order to understand completely.

The specific problem was the object_hook parameter of json's load function.  The description of this parameter in the documentation is:
object_hook is an optional function that will be called with the result of any object literal decoded (a dict). The return value of object_hook will be used instead of the dict. This feature can be used to implement custom decoders (e.g. JSON-RPC class hinting).
If you read this closely, you may see what I was missing.  Remember that I was coming from the perspective of needing to serialize exactly one object, who's contents would all be data structures (lists or dicts), or primitives.  It hadn't really crossed my mind that an object could contain other objects that would need to be serialized.  Silly in retrospect of course, but not immediately intuitive if you aren't thinking about it.

So what was happening when I was seeing only the internal dict was exactly what was supposed to happen - json was applying the object_hook function to each nested dict in turn.  This meant that you could not (nor would you want to) use object_hook to apply to only one type of object, it needs to at least potentially handle any possible object that could be thrown at it.  I ended up implementing a much cleaner way of (un)serializing bugs once I understood exactly what object_hook is doing - which I'll claim isn't completely clear from the language used in the Python docs.

Wednesday, February 2, 2011

Development Documentation

In addition to version control (which we've got) and bug tracking, which of course we're making, the third most common tool used when developing large projects is some sort of easy to maintain documentation system - usually a wiki.  Therefore I have decided to set up a wiki to maintain a high level overview of the development of Abundant.  I think this would be a very good idea for all of our projects, as maintaining documentation as we develop our code will help improve the quality of our work.

I have decided to use TiddlyWiki for Abundant.  There are several reasons:

  • Easy to set up
    The entire wiki consists of one HTML file and one JAR, there is no installation or system requirements.  Furthermore, setup and configuration is a breeze, with less than half an hour's work I had customized my TiddlyWiki to a desirable structure for software development.
  • Compatible with version control
    The wiki is tracked by Mercurial and located right next to the source code.  This means this documentation always goes with the code, and tracking and merging changes works just like any other file.
  • Lightweight bug tracker
    Although Abundant has far loftier goals than TiddlyWiki can ultimately fulfill, it will make a perfectly suitable issue tracker until Abundant is ready for dogfooding.  Similarly, this is one of the biggest reasons I think others should look into using TiddlyWiki or some other documentation tool, as it will provide for some very nice issue tracking as they develop their project.
Additionally, since TiddlyWiki is just an HTML file, you can view and browse it straight from the source code.  If you are interested in playing with TiddlyWiki as a development documenter, you may like to download the first version of the wiki which is structured with development in mind and an issue tracker page created.

Note: Presently, the links to the Abundant repository are password protected.  Anyone reading this blog is welcome to come talk to me and I'll tell you the password, but otherwise it will be publicly accessible soon enough.

Wednesday, January 19, 2011

Data Storage

One of the fundamental challenges of Abundant is storing bug data.  There are several conflicting challenges that must all be met in order to store data in ways that address all needs.

The data format must:

Be Human Readable
In order to best integrate with version control, we want the data format to be text based (as opposed to in some sort of database) so that users can see the bugs and track changes directly in the version control system.  Furthermore, version control works best with text content, as it is easier to compare changes automatically.

Be Able to Store Metadata
A lesson learned from developing b is that metadata (assigned-to, bug status, etc.) needs to stick with the actual data like issue description and comments.  In b metadata was stored separately in order to ease caching, but I believe it makes more sense to keep cached data for speed separate from all the actual data.  The data structure therefore needs to store structured data that is machine readable, at the same time as remaining human readable.

Be Fast to Parse
We want this to be scalable and that means fast access to any bug.  This can largely be done with untracked cache files, but assuming each bug is stored in its own file, each file should be fast to display without caching.  Listing, browsing, and filtering can be assisted with some sort of caching or indexing mechanism that will remain invisible to the user.

These factors affect what formats we can use, some options include:

Plain Text / Custom Format
A plain text file is the most human readable, but the least computer parseable.  This what was used for b, text in the file was split into sections denoted by titles square brackets, and any section that wasn't empty was displayed, and metadata was stored in a separate file.

XML / JSON / Other Standard Data Format
A structured text file which will contain both user content and metadata in one file per bug.  The primary advantage is this will make working with the data very easy, as there exist powerful parsers for standard data formats in both Python and most other languages.  This seems like a better option than a custom format.

A potential limitation of this method (and to some extend the one above) is in how Mercurial stores changes.  More details can be found in this email thread.

A File-based Database
Another alternative is using a database system that works well with version control.  This might be somewhat ideal, as it would (presumably) work flawlessly with Mercurial's change tracking, and also be efficient.  The problem with this method is I don't know of such a tool.  It seems fairly likely to me that such a thing exists, but I don't know of it.

These are all the challenges and options that come to mind, what are your thoughts?

Thursday, December 16, 2010


Well, I have done a truly terrible job keeping this blog up to date over the last semester, but I've done a lot of work and finalized what I will be working on, so I give to you my proposal for Abundant, a distributed issue tracker!

The Proposal:

The Presentation - Scribd seems to not like the font I used for some reason, click the 'Download' link below if it looks bad:

Unfortunately, apparently Google doesn't support uploading anything other than images and videos to blogger, so the links above point to Scribd and potentially could break in the future.  Hopefully not.

Thursday, September 2, 2010

System.out.println("Hello world");

Hello all!

This blog is intended to be a place for me to chronicle the work I will be doing over the next school year developing my senior project in the Computer Science department at Willamette University. With apologies to non-technical readers, this blog is likely to be somewhat dense and is not intended for a general public readership.

At this point, I am exploring different potential ideas for what I could be working on. I have two ideas at this point, neither of which are fully formed, but both would be enjoyable, interesting, and challenging.

Developing an Operating System and Multi-Threaded Virtual Machine
My first idea, which I have had since sophomore year when I took CS353 and developed the PC231, a simple virtual machine to develop assembly code on, is to expand upon the PC231 in the following ways:
  • Improve the architecture (larger word size, two's complement, more hardware commands, etc.) - potentially/probably trying to replicate a more common architecture
  • Develop a limited kernel to handle memory management and execution control
  • Develop a rudimentary operating system to allow the user to execute arbitrary commands
  • Develop a virtual file system to integrate with the virtual machine
This would be a fairly new field for me - outside of CS353 I have never done any kind of assembly code work or low-level programming. Exploring this area would be interesting, however my limited experience and knowledge makes it difficult to predict if this is a reasonable goal, and if the amount of work is feasible.

Develop A Lightweight Issue Tracker for Use With Distributed Revision Control Systems
Over the summer, I developed an extension for Mercurial, a distributed revision control system. This extension is a lightweight, command-line bug tracker called b. I would like to develop this into a free standing tool which can be used anywhere, with few system requirements, no setup or configuration necessary, and a minimal footprint. Specifically, I intend to do the following:
  • Remove Mercurial dependency and turn b into a standalone program
  • Enable it to integrate cleanly with the major distributed version control tools (Mercurial, Git, Bazaar)
  • Develop an internal web server to provide a web interface to view and manage issues
  • Develop plugins for major IDEs (Eclipse, others if time permits)
  • Develop a graphical front-end for non-command line use
The different facets of this project would allow me to explore many fields at once, ranging from Python programming to web development to software management concepts. This would be an interesting and valuable project, both personally - because I would love to use such a tool - and potentially others would find it useful as well.

And these are my two ideas at this point. Comments, questions, and alternative ideas are welcome!