Max Tomassi

protecting our canary from the dangerous world

We already introduced our Twitter ingest system, Canary, in the past, talking about some of its functionality and how it feeds other systems with the data it gathers from Twitter. Today I’ll talk about the importance of its reliability for our architecture and what we’re doing in order to maximize it. Being a realtime ingest system, connected to the Twitter streaming API, it’s very delicate and needs to be highly available. If it for any reason stops working it may cause data loss, and not losing data (tweets in this case) is what we want to achieve.

the problem we’re tackling

As I said, if Canary for any reason dies (oh poor thing), we won’t able to process the current tweets until it has been resurrected. But how could Canary die? It may run out of memory, for example. Let me explain: in our architecture Canary is heavily dependent upon an external messaging system (ActiveMQ at the moment). All the tweets Canary receives from Twitter are forwarded to a JMS topic so that our other systems (our counting systems for example) can just listen to it for receiving them too. Things can always go wrong with external dependencies, and if they do we can’t stop people tweeting, so we have to have a backup plan. If we don’t, Canary starts queuing messages internally, and it might end up running out of memory if the external disruption is long-lived.

We want to protect our dear Canary from this kind of danger, so we planned to tackle this problem in two different ways:

  • Making the tweets Canary receives persistent until they are actually published externally
  • Making Canary highly available, with an idle clone instance that is ready to replace the primary one

making tweets persistent until they’re processed

This avoids Canary running out of memory if it’s not able to publish messages externally. What we chose in this phase is to use an internal persistent queue, so that if Canary dies for some reason it can still drain the queue and process the tweets as soon as it comes back to life. We decided to use an embedded ActiveMQ broker for this use case. It’s basically a JMS broker created within the same Canary JVM. Both producers and listeners can then use the VM protocol in order to connect to it – not requiring remote connections makes this mechanism more reliable and efficient.

Summarising the complete flow for processing a tweet is now:

  1. A tweet is received from Twitter through the Streaming API

  2. A JMS producer publishes the tweet into an embedded queue

  3. A JMS listener takes the tweet from the embedded queue

  4. The tweet is then forwarded to the external JMS topic

Only when the last action is successfully executed the tweet is removed from the embedded queue. This is achieved by making the listeners in step 3 using transacted sessions.

Using this technique, even if the producers to the external topic are all blocked, tweets can still be published to the embedded persistent queue (and not in memory as it used to happen) and be drained as soon as the problem with the external JMS provider has been solved.

making canary highly available: master/slave architecture

One problem is tackled but we still need to make Canary highly available. If the instance dies for any reason, we need to recover from the failure as soon as possible. The solution we adopted to achieve this is having a clone instance ready to replace the unhealthy one. James already wrote about how we are handling the master/slave election through etcd, that also allows the slave instance to take the control when the master dies.

Another problem we want to handle is the loss of tweets in the interval between when the master instance dies and the slave instance takes control. For this reason the slave instance is also connected to the Twitter Streaming API and it publishes the tweets it receives into an embedded queue that is used as a buffer. This buffer queue will only contain recent tweets: enough to cover the time needed by the slave to become the new master. As soon as the slave takes the control it can then drain the tweets from the buffer queue and then starts working as described in the previous paragraph.

looking forward to hearing your opinions

I described the measures we’re taking in order to make Canary more reliable and highly available. We always look for great ideas to improve our architecture, so feel free to propose your suggestions tweeting us or leaving a comment below.

blog comments powered by Disqus