Adam Horwich

spot the difference

sempre vive

In my last article (Hadoop in the Cloud) I talked about ways to optimise your own Hadoop cluster in AWS. Part of this was a dabble into the lucrative and fascinating Spot Instance pricing market. Today I’m going to be delving further into this, and reflecting on its practicalities.

Spot Instance pricing

now a warning?

They all die, eventually. I should start by saying: know your use case! Spot Instances are not for everyone. They’re certainly not for Databases, or services with stateful sessions. These instances will be terminated without mercy if your bid price is exceeded, and the market has a tendency to be rather volatile. They’re good for services which are already designed with fault tolerance in mind, and great to be used alongside existing, robust deployments to provide additional support when load increases. We saw Spot Instances as a great thing for Hadoop; a low cost way to bolster resources when jobs take a long time to complete. The type of job is very important though! If you’re not expecting your Spot Instances to be long lived (because your bidding strategy is to get the instances at their lowest price), then running Reducer tasks on TaskTrackers isn’t the best idea. But, what we’ve also found is that running jobs which require a full sort of a lot of data, you’re going to have long lived reducers that will require all of the map task data while it runs. This isn’t such a good thing when your reducer takes several hours to complete; there’s a high likelihood that the map data you’ve generated on your Spot Instance will go away. Sadly there’s no concept of shared storage for temporary map output data, it’s all stored locally rather than on HDFS, and even if all nodes could access it through a shared filesystem, the mechanisms of ownership mean that the JobTracker automatically re-issues map tasks for those which were lost and discards the data. The only way for intermediate data to be written to HDFS is if the number of reducers for a job is set to 0. Here’s a nice overview of the Map-Reduce flow.

i’m going to get a second opinion

That’s not to say they’re inappropriate for our use case. During stable times, they can happily contribute to the completion of map tasks, which in turn will help the reducers finish sooner. As long as the reduce task has completed, it’s not the end of the world if a Spot Instance TaskTracker is terminated, as those missing map tasks will be reallocated, and the reduce task is still safe. So how to we help this happen? Well it turns out we can tune when a reduce task commences, and how many run.

you can’t raise an eyebrow without major surgery!

Oh gosh, yeah so some of the Hadoop MapReduce defaults are pretty ill suited for most cases:

  • mapred.reduce.tasks.speculative.execution – Yeah, this is default to true which means multiple reduce slots can be occupied by the same task and the one which completes first is crowned the winner. The loser is discarded. Great if you have an empty cluster doing nothing or heterogeneous nodes, but most people do not! The map speculative execution is more helpful though, and I’d recommend it left enabled.
  • mapred.reduce.slowstart.completed.maps – This wonderful little flag determined the percentage of completed map tasks before reducers are allowed to start. The default is 0.05 (5%). That’s incredibly low for most use cases and can lead to starved reducers waiting hours for the necessary maps to complete. In our cluster, using pre-emption, it also means they’re often killed. A more appropriate default would be 0.5, and you may wish to tune individual jobs for better results.
  • mapred.submit.replication – The default is 10… which is really annoying when your cluster replication is nowhere near that and you’re wondering where all these under-replicated warnings are coming from, or if you want to decommission a node!
  • mapred.running.reduce.limit – Each job has no default number of reduce slots it can occupy (the maximum being the number assigned to the job). If you don’t want your job hogging all your resources, especially if it’s going to wait for map tasks to complete before achieving much, tune this more appropriately.

then what?

OK, so where do these tweaks get us? Well they make Spot Instances much more attractive. We’re less worried now about the lifetime of a TaskTracker. So let’s now look at the other side of the equation: working with spot instances.

At the top of my post, I included a summary of the spot market pricing for the last week. You can see that for the first half it was highly volatile, and the second half it’s practically sedate. This is the spot market for you! It’s a good thing though, as we know how bad it can be. Just typically though, hours before everything stabilised, we moved our Spot Instance Auto-Scaling group to use the only stable zone we had at the time. Ideally, you’ll want to be launching TaskTrackers in all the zones you have data in, to minimise data transfer.

check ok?

So in the first 10 days of running Spot Instances we found that our bidding strategy of ‘just above the lowest price’ (effectively gaining 4 m1.large instances for the price of 1) had the following qualities:

  • 120 launched Spot Instances in 10 days
  • Over 60% lasted less than 1 hour
  • Average lifespan: 2 hours, 13 minutes
  • Median lifespan: 24 minutes
  • 3-5am GMT time was a good window to launch instances in
  • Instances were often priced out of the market between 9 and 10am GMT

Since the market has stabilised, we’ve seen instances launched on request, and have had no premature terminations.

and there’s something really wrong with your neck too

It’s been a fascinating dabble in the Spot market, but it’s not without its drawbacks. Like most AWS off-piste tools, there’s little good documentation or integration with the EC2 Console. Integrating Spot Instances to Auto-Scaling groups is simple enough, but the launch mechanics are not the best. In my last post I mentioned that the request mechanism was a bit flawed, in that it sticks to the first randomly assigned zone submitted to, if you have multiple zones in your configuration. Also, we’re finding that we want to build tools to manage the Auto-Scaling group configuration so we can analyse the spot price history and make more intuitive decisions on where to launch new TaskTrackers to work around the ‘sticky zone’ issue.

What we will be looking at next in our goal for a cost efficient cluster is job optimisation in a Spot Instance Hadoop cluster, identifying the monolithic reduce jobs and finding better ways of breaking them down.

blog comments powered by Disqus