Tim Spurling

failure is the only option

Another human once tried to help me by tidying up my mugs. However, they put them somewhere I wasn’t expecting, and my next cup of tea was therefore less satisfying as I was unable to find anything that would hold a whole pint.

What is my point? Attempts to be helpful, however well-intentioned, can easily be worse than doing nothing.

Nice software developers, like many other nice people, are always trying to help—and sadly in many cases the result is surprise and confusion.

To be honest, I usually find errors to be the most helpful possible behaviour. Errors are very nice. Errors are very clear. Nice, clear, immediate and unambiguous.

The sad fact is that you can be more certain with an error than with a working application. At least with an error, the situation can’t suddenly and unexpectedly get worse!

first (worst) example

Hiera is a tool for Puppet which allows variables to be defined based on a hierarchy of config files. Leaving out whether I think this is even a good idea to start with, it does generally do this quite well; but there was an issue with the old version we were accidentally using.

It has one feature called “deep merging” which allows a hash to be built by combining/overwriting keys from multiple files. This is great! We can define a server-options map for a service, and override specific subkeys for different environments (i.e. staging). However… to quote the documentation:

You must install the deep_merge Ruby gem for deep merges to work. If it isn’t available, Hiera will fall back to the default native merge behavior.

Isn’t that wonderful?! If this external installation of a gem somehow went wrong on a Puppet master, it would (with only a server-side warning) produce inconsistent and incorrect configuration. Suddenly, the application receiving the config would behave strangely. It might connect to the wrong database. It might do its threading differently. It might start doing one tiny catastrophic thing, several days later. It would depend on which piece of config exactly got mangled.

Here, an error is an obviously better result. A nice straightforward “you’ve asked for deep merging but I can’t do it”. Catalog builds fail, problem gets fixed immediately; very little fuss. Luckily this is exactly what newer versions of Hiera do.

What made the situation particularly bizarre, though, is that this was not just a coincidental default—someone deliberately and specifically added some code to rescue the LoadError from the missing gem and thus mask the problem.

another example

Recently I’ve been replacing our API server deployments with a new approach based on Docker images, transforming dependency provision and service configuration from problems configuring the hosts to more-easily-controlled problems building the images.

In general this seems to work well, but there are still cases where the state of the host unpredictably affects the image’s behaviour—in this specific case, another manifestation of the same problem with an error being hidden.

This one was a bit strange. I started up the API container as normal, but it was unable to connect to the Mongo database it needed, waiting forever and timing out. Obviously a security group problem. But everything was configured correctly. The logging did not explicitly reveal this, but this looked exactly like the problems we’d had before with DNS incorrectly resolving to external IPs and services being accessed over the wrong interface.

Testing this inside the container was fun.

$ sudo docker run --rm -it --entrypoint="/bin/bash" docker-repo.mbst.tv/jetty-atlas:latest -c "apt-get update && apt-get install dnsutils && dig atlas-mongo-whichever.mbst.tv"

[…package installation crap…]

; <<>> DiG 9.9.5-9-Debian <<>> atlas-mongo-whichever.mbst.tv
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 32685
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
; atlas-mongo-whichever.mbst.tv.		IN	A

;; ANSWER SECTION:
atlas-mongo-whichever.mbst.tv.	300	IN	CNAME	ec2-54-154-190-190.eu-west-1.compute.amazonaws.com.
ec2-54-154-190-190.eu-west-1.compute.amazonaws.com. 60 IN A 54.154.190.190

;; Query time: 33 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Tue Jun 23 10:12:38 UTC 2015
;; MSG SIZE  rcvd: 159

SERVER: 8.8.8.8. What.

It turns out the particular machine hosting this container happened to be configured with dnsmasq for an unrelated reason, and therefore had localhost as its default DNS server. Docker quite reasonably was unable to pass this config on to the container, as the loopback interface isn’t even accessible from within it—but rather than simply stating this, it automatically defaulted to use Google’s DNS server. I mean, there are probably cases where this is useful for someone doing something trivial not involving a private network, but I doubt it’s ever as useful as knowing about the broken configuration would be. And again, this was a choice that someone specifically made!

The fix was to add --dns="172.0.0.2" (Amazon’s EC2 DNS server) to the docker run command—a trivial fix, but only after half an hour of guesswork.

summary

This whole problem seems to strongly resemble any other leaky abstraction. In each case, a tool’s developer has decided to abstract away the decision of how to handle an unexpected case, by providing a default, assuming that the desired behaviour is to do something, regardless of how correct that thing actually might be.

Defaults can at times be very helpful. For example, consider writing code that calls a function that makes an HTTP request. This function could default to being a GET request, to using no proxy, to being a blocking call (or not). In each case, the default saves the pain of manually repeating all these instructions-to-be-unsurprising—and crucially, in a sane API, the logic determining the default behaviour is contained completely within the function. The external system can’t influence its decision, and its behaviour will never suddenly change unless the calling code explicitly requests it (or a clear version change occurs).

In this case, though, the defaults are unexpected, and based on a hidden condition. The abstraction is not complete; it is not clean. It leaks. Suddenly the user of the functionality must understand not only the condition of failure, but the whole problem it was initially trying to hide—and must in fact find and understand the problem without help from the system that decided to hide it!

Failing fast” would be preferable, with a good error message to prompt the user’s own decision by pointing to clear documentation of the problem.

see also

Thanks for reading!

If you enjoyed the read, drop us a comment below or share the article, follow us on Twitter or subscribe to our #MetaBeers newsletter. Before you go, grab a PDF of the article, and let us know if it’s time we worked together.

blog comments powered by Disqus