Tom McAdam

6 aws tips from our time in the trenches

We were early adopters for Amazon Web Services here at MetaBroadcast and have been using it for all our hosting needs since the start of 2009, and would like to share few of the things we’ve learnt over our time with it.

1. things fail, deal with it

The first tip sounds obvious but is so important that it’s worth stating here. Whilst every good engineer knows that failures can happen and they should be planned for, it’s often something that’s overlooked until the worst does indeed happen. In the world of physical infrastructure if a host dies you can call your ISP or get yourself to the data centre to replace that faulty PSU. With AWS, there’s much less of a guarantee placed on being able to get an instance back.

The SLA doesn’t make any promises about individual instances. Hell, it doesn’t even make promises about availability zones (which are, effectively, isolated data centres). The guarantees are all around entire regions being available for 99.95% of the time, i.e. clusters of data centres in, say, Ireland (in AWS speak, eu-west-1). You should plan for an entire availability zone going away in an incident, and when they happen, outages are generally more severe than your average data centre outage, taking much longer to return to normal service. We’ve not seen it in Europe, but there have been multi-availability zone outages, too.

Support can sometimes help in getting a single instance back, but generally we haven’t found them to be timely enough to be of use. Much quicker to start up a replacement instance. We’ve only ever had the basic support plan, so your mileage may vary if you’re willing to pay more.

Wherever possible, use auto scaling groups. AWS will then control scaling of your application depending on the criteria you specify, such as CPU usage. Health checks will also make sure that if an instance dies, another will be automatically spun up to replace it. This isn’t going to work in all cases, such as applications that aren’t designed to cluster (our build server, for example).

2. avoid ebs-backed instances where possible

The root partition on an AWS instance can either be on an EBS volume, or on the local (or ephemeral) disk of the physical host on which the virtual host is running. EBS volumes are akin to a remote NAS volume, and therefore have the advantage that the data on them is long-lived, unlike data on ephemeral disks where the data is lost when an instance is terminated. EBS-backed instances can therefore be shutdown and restarted at will, unlike those using the local disk which, when terminated, are lost forever.

While EBS-backed instances sound great on paper, we’ve found the reality of the situation to be very different. In the aftermath of the not lightning strike incident in 2011 we found our EBS-backed instances were worst affected and took longest to recover. Given point 1, and having to plan for failure anyway, we prefer to treat all instances as throwaway and not even attempt to recover an instance in the case of it failing. In which case, there’s no advantage to EBS-backed instances anyway, so we use instances with local instance storage in most situations. If you really do need to use EBS-backed instances, be prepared to ditch them in the case of failure.

Quite aside from the reliability of EBS-backed instances, it’s worth also pointing out that EBS volumes in general are much slower than local storage. There are optimisations to be had, such as RAIDing EBS volumes together and the recently-announced higher performing EBS volumes.

3. automate, automate, automate

So, we need to deal with failures, and we don’t think EBS-backed instances are worth their failure modes. That means we need good tools in place to be able to get instances up and running. Step up your configuration management tool of choice; ours is puppet, along with a script to start up and bootstrap an instance to a given configuration. Our one-off instances that aren’t auto scaled can be started up and configured using a base Ubuntu image with a single command-line, and are ready to use within a couple of minutes. We could mint our own AMIs, one for each of our instance types. However, we much prefer starting up instances from base AMIs of stock installs. It means that all the configuration is scripted so we can upgrade operating system version easily enough, just by replacing the base image.

4. naming names

AWS instances have 2 IP addresses: an internal one for communication within AWS in the 10.x.x.x Class A subnet, and an external one for traffic outside of AWS. Hostnames are great, so we use Route 53 to name our AWS instances. We want to use the 10.x.x.x interface when communicating between AWS instances, because it’s free within the same availability zone. We also want to use the public IP address to access the instance from outside of AWS. A nifty trick in the way DNS resolution works inside AWS makes this possible.

Each instance has a public DNS name you can get from the AWS console. For example, ec2-79-123-32-173.eu-west-1.compute.amazonaws.com. It turns out that AWS’ internal DNS resolvers, which are configured by default on instances, will resolve that hostname to its internal IP address. AWS’ external DNS resolvers will resolve it to its external IP address. By creating a CNAME record of your own which points to the AWS public DNS name, you can use this throughout your infrastructure and know that the right interface will be used all the time.

5. availability zone names differ across accounts

Within a region the availability zones are named a, b, c etc. So in Ireland there’s eu-west-1a, eu-west-1b and eu-west-1c. Of course, everyone wants to put their instances in the first region! To stop that creating an imbalance, what Amazon tell me is eu-west-1a may well differ from the physical location they call eu-west-1a in your account.

This is important for a few reasons. Firstly, resilience: you need to make your infrastructure resilient to an availability zone disappearing, so if you happen to have dependencies across accounts, you need to know what instances are actually in the same data centre. Secondly, cost: intra-availability zone data transfer using the internal network interfaces of instances is free but data transfer across availability zones is not. Thirdly, incidents: you may hear people talking of an incident on Twitter in a particular availability zone. When they say they’re having problems in eu-west-1a that may be your eu-west-1b!

The zone mappings are not something AWS disclose in their interface, but our AWS sales rep told us how our availability zones mapped. There appear to be ways to glean it from the output of the command-line tools, too.

6. stick to one account

As you can guess from point 5, we’ve worked with multiple AWS accounts, for a couple of reasons: to separate out backups from their source to avoid costly mistakes in deleting both, and to be able to see costs across different pieces of infrastructure. However, we no longer need to thanks to the recent addition of cost allocation which lets us tag instances and group costs by them. We’d already solved the backup problem using IAM with different users and policies set up so that only designated backup accounts are able to delete backups.

We’d love to hear your AWS tips! Please do comment below if you’ve got any to share.

blog comments powered by Disqus