In my last post, I talked about the next version of Atlas and some of the changes we’ll be making. Today I’m going to go into more detail on what we’re going to be doing with queues.
Atlas has content from many different sources, and with every week that goes by we are adding more. We have a number of tasks which need to run across large subsets of these sources to derive further information from them and, in common with much of the industry, these often take the form of batch jobs running over large subsets of the data.
This works perfectly well but we’d like to improve on it for a couple of reasons. Firstly, as we bring in more and more data these jobs will start taking longer to run. Secondly, running batch jobs frequently is never going to be quite as timely as doing things as changes come in.
form an orderly queue, please
Lots of these tasks can be broken down into smaller tasks around individual pieces of content, and only need to be recomputed when the content itself changes: a great example of where a queue could be used! We’ll be adding an event bus into Atlas so that we’re able to act on updates as they come in.
- Combining topics and queues to provide per-client topic queues: this lets use cleanly separate out producers from consumers using broadcast topics for changes, yet also have a queue per logical destination so that we can scale up with multiple worker threads without message duplication
- Administration interfaces: it’s critically important to see what’s going on in a queuing infrastructure. Both ActiveMQ and RabbitMQ offer decent consoles for this.
On the non-functional side, too, we were happy with the performance each of them offered. We plan to deploy it to AWS, which limits the options for high availability (we can’t have a shared disk across an highly available cluster, for example) and presents other challenges such as being able to deal with availability zone outages.
There was little to choose between the different options, but ultimately we chose ActiveMQ, as it had a slight edge on the HA configuration in AWS.
We don’t like getting woken in the middle of the night with an alert from PagerDuty unless we can help it. It was therefore important from the get-go to have a resilient setup for such a key piece of infrastructure.
For that, we’ve deployed a 2-node infrastructure using medium AWS instances. These are in the same region and, as anyone who’s used AWS will have guessed by now, span two availability zones.
The nodes are run in a master/slave configuration with shared storage, so the slave is able to take over without intervention in case the master goes down. You obviously can’t use a SAN in the cloud for shared storage, so instead we’re using GlusterFS. Again, this is a 2-node configuration, spanning availability zones.
Next up, we’ll be using it to keep our new search index up-to-date, and soon after that we’ll be looking at modifying our equivalence engine to perform real-time calculations.
With all of these changes in Atlas deer, we’re working hard to minimise the impact on existing functionality. We’re prototyping technology that’s new to Atlas on new pieces of work, then when we’re happy with them rolling out more widely, and this is no different. Starting small-scale with copying content into Cassandra has let us test and iron out the new infrastructure, before moving on to using it for existing, core functionality.