Luke Hopkins

hunting for bugs

Hunting for bugs is never a fun affair but hunting for network bugs is one of the most frustrating issues. The reason it is so frustrating is because it’s hard to get more information about the error. Recently, I had a bug that was not only a network one but also an intermittent one.

the prey

This particular bug was a situation where some pods in Kubernetes would start up with no networking. Everything about the pod would be fine apart from the network. The pods would fail their healthcheck and restart, yet still without network. The affected pods were always random.

the hunt begins

My first thought was that this was happening on certain hosts so I created a Kubernetes daemonset to run a container that checked its own networking on every host in our cluster. This revealed that the problem was not on specific hosts but did show an interesting trend. When I brought up the daemonset across our 12 nodes and 3 controllers, there would almost always be at least one broken pod. Restarting the flannel networking service the node usually fixed the issue but couldn’t explain why it was.

hark companions of the hunt

Armed with this knowledge I went off to sig-network channel on the Kubernetes slack. After some back and forth there, I was pointed to this github bug and lo and behold down the page this CoreOS Beta (1153.4.0). That was the exact version of CoreOS that we were using. Looking at the bug reports it turns out that systemd-networkd was trying to manage all the networks on the host. This created a race condition where docker would setup a bridge and depending on if docker or systemd-networkd configured it last, it would function properly or the bridge element would be removed.

our prey in sight

Armed with this knowledge and after searching a bit more to see what others had done I added the following to our userdata.

    - name: 50-docker.network
      mask: true
    - name: 50-docker-veth.network
      mask: true
    - name: zz-default.network
      runtime: false
      content: |
        # default should not match virtual Docker/weave bridge/veth network interfaces
        [Match]
        Name=eth*
        [Network]
        DHCP=yes
        [DHCP]
        UseMTU=true
        UseDomains=true

This snippet of code configures systemd-networkd to not configure the docker and the veth networks allowing them to be properly configured by Docker. After a rolling restart of all the servers in our cluster the problem was eliminated.

If you enjoyed the read, drop us a comment below or share the article, follow us on Twitter or subscribe to our #MetaBeers newsletter. Before you go, grab a PDF of the article, and let us know if it’s time we worked together.

related posts

blog comments powered by Disqus