we write about the things we build and the things we consume
the next steps for uriplay's content store
We’ve been working, over the first 2 weeks of my being with MetaBroadcast, on the next version of URIplay. Version 2.0 builds on top of the very flexible API developed to support Test Tube Telly and a number of other projects. In particular the focus has been on the performance and scalability of the platform, to push the volume of meta-data from the hundreds of thousands into the millions.
URIplay has been working quite happily on MySQL for some time now, all wrapped up in the popular ORM framework, Hibernate (for better and for worse). Clearly it would be possible to continue with the current setup but this would require developing application logic that shards broadcast information across many MySQL instances, along with de-normalising the schema to remove joins.
So, with that in mind, we’ve been looking at what other data-stores are available that would give us the performance and scalability we need and fit our data model and usage patterns. URIplay provides a very rich set of potential queries on a highly structured data model. We’re effectively saying that "we want it be thunderingly quick, we want it to scale out to an enormous size and we want to be able to query it in pretty much any way we can think of". Great.
A number of people have described the criteria by which they judged the suitability of available data-stores. RJ's Anti-RDBMS post was written a while ago now, but still provides a good matrix of the merits of alternatives to MySQL (it could do with a bit of an update). Similarly Ryan King focusses on the operational issues in An Interview with Ryan King.
To see how we’d model the current schema in stores that aren’t relational, we chose to implement URIplay backed by both Cassandra and MongoDB. This gave us, as the risk of being flamed, the best-of-bread in both key/value(ish) and document based storage for us to compare the two storage types, and evaluate two very popular implementations.
Cassandra is a mixture of column-based and row-based DBMS (so much so that neither Wikipedia page references it). I’ll leave the exact definition to smarter people than I but, crudely, it’s like using a hash that’s either 4 or 5 levels deep, no more, no less. This makes it pretty close to a strict key-value store, but with a little more wiggle room.
The most useful articles we found were Arin Sarkissian's WTF is a SuperColumn and Up and running with Cassandra by Evan Weaver. There’s not a huge amount of helpful information out there and both the programmatic and command-line interfaces to running instances are pretty opaque. It’s quite clear that Cassandra is a safe bet. It’s bloody quick and it scales out well for some really enormous sites. However, this also means that you’re taking on a lot of operational development, of tools, client-libraries and interfaces to give you the kind of access and speed of development that you need.
MongoDB is a document-oriented store (and is referenced in the Wikipedia entry) which means you store whole structured documents (JSONish ones in this case). MongoDB then lets you query and retrieve those documents based on the data within those documents, no matter how deeply into the document structure your query goes.
Basically, MongoDB gives you a lot of power. It’s very enjoyable to use and there are all kinds of features that you don’t even get in MySQL (regular expressions, map/reduce and full-text search to name a few). In a lot of ways it behaves like an RDBMS, even in how you set-up and use indexes.
In-fact I can well imagine that we’ll end up using a combination of any number of these technologies for various parts of the overall URIplay platform. For the main repository of broadcast meta-data we certainly feel that the document-oriented approach best fits our model and MongoDB gives us the flexibility that such an open API requires.
This approach is not without its risks. MongoDB really is very quick and the power it gives you makes you wonder where the catch is. While Cassandra would require considerable work to develop all the indexes and logic to query them, you would always be able to bank on its scalability. You cannot make the same assertions about the alpha release of MongoDB’s sharding as it has not gone through the same rigour. However, considering the levels that URIplay needs to scale to and MongoDB’s existing performance, quality of community and direction, we’re confident that it’s the right choice…
…if we’re wrong, we’ll change it.