I’ve written some recently about the work put in to make our infrastructure self-healing, self-scaling and generally just automated in ways that simplify things for the team. In most cases, an unhealthy instance can have itself replaced with a working successor in about 5 to 10 minutes.
One area in which that dream wasn’t fully realised was with Mongo, because of complications around replicaset management. A replicaset is Mongo’s definition of its cluster; which node is the primary, which are secondaries, and which can be elected to primary should the existing one have a problem.
why is it tricky?
There are a few reasons why automating a Mongo recovery is different. Firstly, commands to modify the replicaset can only be performed on the active primary, so there’s nothing we can run on a newly created instance to add itself to the set.
Therefore we’ll need something that runs on the primary regularly that looks for new instances not yet in the replicaset, and adds them. We’ll want to make sure that what we’re adding is really a Mongo node (rather than just another instance), and also whether it’ll be hidden or not.
We use a single EC2 tag on instances that will form the replicaset, the tag is ‘Replicaset’ and takes a string name for the cluster, such as ‘atlas-mongo’. If the instance is to be hidden, we can set the tag as ‘atlas-mongo,hidden’, or even replace hidden with master where needed. We set $replicaSet in the Ruby code to match the string name for the cluster as set in the EC2 tags.
If you enjoyed the read, drop us a comment below or share the article, follow us on Twitter or subscribe to our #MetaBeers newsletter. Before you go, grab a PDF of the article, and let us know if it’s time we worked together.