I’ve been working at MetaBroadcast for a little over 10 months now and I just happened to notice a nice, healthy statistic. Aside from a few machines, which I will get my claws into very soon, the oldest running instance we have in our cloud infrastructure is just 9 months. Before I started, we had instances kicking about for over 18 months with over a year’s uptime. In a cloud infrastructure, what we should be caring about is service uptime, not server uptime.
the devil is in the detail
We’re not talking about reboots either, as much as there’s an instinctive habit to keep instances running unless there’s a terrifying kernel vulnerability, I’ve been regularly replacing instances as part of our fire-drill processes and as upgrade replacements for legacy Ubuntu releases.
This all means that we get to routinely test our deployment processes, work with updating our tools for the latest Ubuntu releases, and ensures we’re designing our system architectures to support instance failure.
designing for the worst case scenario
Being in the cloud, in our case, on Amazon’s cloud means that one should never grow too attached to one’s instances. Designing your services and systems with single points of failure in a cloud environment is a recipe for pain and misery. By embracing this philosophy, we’ve ensured that for our most critical services, we’re appropriately distributed over multiple ‘availability zones’, with a healthy level of redundancy and fault tolerance in our instances. As a result, we can do things like replace our entire infrastructure over the space of a year.
all instances are equal, but some instances are more equal than others
The philosophy is great, but expensive. So it’s important to pick and choose where you dedicate extra resources to support fault tolerance. All other things should be well documented and, ideally, subject to configuration management tools to ensure a rapid redeployment. This is why we try to ensure that the first step in adopting a new technology is to get it into Puppet first.
what is downtime?
Well, it’s inevitable. But how much downtime you suffer, that’s what you can control. I will talk at length in a future blog post about our monitoring and alerting services as we move away from Nagios to a Sensu & Graphite configuration, but needless to say, having the proper tools, alerting rules and escalation policies are important here. Having tools to ensure a one-click deployment of replacement instances is also really useful when it’s 3am and you’re being woken up by a robotic voice.
bracing for impact
So, being prepared is important, and ensuring your engineers are ready to tackle outages even more so. This is why we conduct weekly firedrills. They’re simulation exercises which paint scenarios for the designated engineer on-call to respond to and resolve. We typically make the engineer re-deploy or rebuild a service or recover data from backups and deploy it alongside existing services. Sometimes, if we have our happy fault tolerant ecosystem, we’ll just terminate an instance and let the engineer work on fixing it. This is a manual “chaos monkey” approach, inspired by Netflix. As we, and our engineering team grow more accustomed to the infrastructure we intend to roll out an actual chaos monkey service to automate this process.
how I learned to stop worrying and love the cloud
I came from a 24/7 live videostreaming service company, and the civil service before it, so these were physical servers with an emphasised importance on downtime. Working at MetaBroadcast is my first foray into a fully virtualised cloud infrastructure, and I love it. It’s very liberating in that it enforces best practices. So when we eventually move onto a more physical environment (such is inevitable as companies grow, well, unless you’re Netflix), we’ve already got the tools and ideals in place to make sure instance deployment doesn’t fall back to bad habits.
auto-scaling food for thought
Once you start exploring the full extent of a dynamic, fault tolerant environment, you reach some interesting issues with monitoring services. We currently implement AWS auto-scaling for our Atlas product, which will automatically change the number of instances serving queries dependent on the load the service is under. This means that our instances are highly expendable, and as such our monitoring service needs to be more dynamic and cloud aware. Unfortunately, Nagios is not a good tool for this, which is one of the main motivators why we are intending to move way to tools like Sensu. As expendable as the instances are, we still care about service metrics and instance health, and currently that’s very difficult to achieve with Nagios. Embracing a low shelf-life infrastructure leads you to embrace more dynamic monitoring tools.