Garry Wilson

getting etcd2 to play nicely

I wrote recently about making Kubernetes reliable, and all of the spinning parts needed in making the master instances do what’s required of them. This post is going to be about the most spinniest of parts, etcd2.

I also wrote recently about how goalposts often move on big projects, with no mention of the irony that we’ve abandoned our earlier plan to use Ansible and CentOS to manage our Kubernetes infrastructure.

keeping it simpler

Rather than using Ansible, we’re going to make things even simpler and use CoreOS as our Linux of choice. What sets it apart from CentOS and most others is that it offers the bare minimum required of an OS; it encourages you to do everything else either as Docker or Rocket containers (the latter being the CoreOS team’s own container implementation).

The strict simplicity of the OS means it takes a bit more planning to get things going. The tricky thing about etcd is that it’s a clustered service, that relies on knowing the IP address of the other master instances, but we’re running in AWS with dynamic instances that autoscale.

I tried a number of different ways to solve this problem; etcd’s discovery service (didn’t handle instances coming and going), using an AWS CLI container to add a fixed IP (handling networking in CoreOS via cloud-config is not fun), but ultimately found a great solution from the nice people at Monsanto. Yes, that Monsanto.

maintaining the masters

The container they’ve made publicly available, etcd-aws-cluster, checks the autoscaling group to find what other etcd masters exist. If there are none, it starts itself as the first seed, if there are others, it joins the cluster.

Additionally, and what makes it really nice, is it will remove any nodes listed in etcd that are no longer in the ASG. It writes a single file, /etc/sysconfig/etcd-peers, with the configuration needed.

So, all three masters are in a single autoscaling group (spread across 3 zones), and they contain the following user-data cloud-config to get etcd2 clustering properly:

One thing to note here is that, although they’ve made it available as a Docker image, we’re running it via Rocket instead. The reason for this is that we’ll start the Kubernetes containers with Docker later on, but Docker relies on flannel (another CoreOS project), which relies on etcd. There’s a lovely long chain of dependancies, which I’ll probably cover in a future post.

The Kubernetes workers, which proxy from the etcd2 masters, also use the etcd-aws-cluster container, but we pass in the environmental variable --set-env=PROXY_ASG=kubernetes-controllers, so the script knows to find the autoscaling group of the masters, not of the worker instances. We also skip the etcd2 block (lines 5-9) from the cloud-config of the workers.

Stay tuned for more fun and, at times, frustrating developments in getting Kubernetes ready for production traffic.

If you enjoyed the read, drop us a comment below or share the article, follow us on Twitter or subscribe to our #MetaBeers newsletter. Before you go, grab a PDF of the article, and let us know if it’s time we worked together.

blog comments powered by Disqus