Hunting for bugs is never a fun affair but hunting for network bugs is one of the most frustrating issues. The reason it is so frustrating is because it’s hard to get more information about the error. Recently, I had a bug that was not only a network one but also an intermittent one.
This particular bug was a situation where some pods in Kubernetes would start up with no networking. Everything about the pod would be fine apart from the network. The pods would fail their healthcheck and restart, yet still without network. The affected pods were always random.
the hunt begins
My first thought was that this was happening on certain hosts so I created a Kubernetes daemonset to run a container that checked its own networking on every host in our cluster. This revealed that the problem was not on specific hosts but did show an interesting trend. When I brought up the daemonset across our 12 nodes and 3 controllers, there would almost always be at least one broken pod. Restarting the flannel networking service the node usually fixed the issue but couldn’t explain why it was.
hark companions of the hunt
Armed with this knowledge I went off to sig-network channel on the Kubernetes slack. After some back and forth there, I was pointed to this github bug and lo and behold down the page this
CoreOS Beta (1153.4.0). That was the exact version of CoreOS that we were using. Looking at the bug reports it turns out that systemd-networkd was trying to manage all the networks on the host. This created a race condition where docker would setup a bridge and depending on if docker or systemd-networkd configured it last, it would function properly or the bridge element would be removed.
our prey in sight
Armed with this knowledge and after searching a bit more to see what others had done I added the following to our userdata.
- name: 50-docker.network mask: true - name: 50-docker-veth.network mask: true - name: zz-default.network runtime: false content: | # default should not match virtual Docker/weave bridge/veth network interfaces [Match] Name=eth* [Network] DHCP=yes [DHCP] UseMTU=true UseDomains=true
This snippet of code configures systemd-networkd to not configure the docker and the veth networks allowing them to be properly configured by Docker. After a rolling restart of all the servers in our cluster the problem was eliminated.
If you enjoyed the read, drop us a comment below or share the article, follow us on Twitter or subscribe to our #MetaBeers newsletter. Before you go, grab a PDF of the article, and let us know if it’s time we worked together.