Wednesday, January 19, 2011

Data Storage

One of the fundamental challenges of Abundant is storing bug data.  There are several conflicting challenges that must all be met in order to store data in ways that address all needs.

The data format must:

Be Human Readable
In order to best integrate with version control, we want the data format to be text based (as opposed to in some sort of database) so that users can see the bugs and track changes directly in the version control system.  Furthermore, version control works best with text content, as it is easier to compare changes automatically.

Be Able to Store Metadata
A lesson learned from developing b is that metadata (assigned-to, bug status, etc.) needs to stick with the actual data like issue description and comments.  In b metadata was stored separately in order to ease caching, but I believe it makes more sense to keep cached data for speed separate from all the actual data.  The data structure therefore needs to store structured data that is machine readable, at the same time as remaining human readable.

Be Fast to Parse
We want this to be scalable and that means fast access to any bug.  This can largely be done with untracked cache files, but assuming each bug is stored in its own file, each file should be fast to display without caching.  Listing, browsing, and filtering can be assisted with some sort of caching or indexing mechanism that will remain invisible to the user.


These factors affect what formats we can use, some options include:

Plain Text / Custom Format
A plain text file is the most human readable, but the least computer parseable.  This what was used for b, text in the file was split into sections denoted by titles square brackets, and any section that wasn't empty was displayed, and metadata was stored in a separate file.

XML / JSON / Other Standard Data Format
A structured text file which will contain both user content and metadata in one file per bug.  The primary advantage is this will make working with the data very easy, as there exist powerful parsers for standard data formats in both Python and most other languages.  This seems like a better option than a custom format.

A potential limitation of this method (and to some extend the one above) is in how Mercurial stores changes.  More details can be found in this email thread.

A File-based Database
Another alternative is using a database system that works well with version control.  This might be somewhat ideal, as it would (presumably) work flawlessly with Mercurial's change tracking, and also be efficient.  The problem with this method is I don't know of such a tool.  It seems fairly likely to me that such a thing exists, but I don't know of it.


These are all the challenges and options that come to mind, what are your thoughts?