we write about the things we build and the things we consume
Garry Wilson

a rule by any other name

One of the sticking points we have occasionally at MetaBroadcast is with support and, specifically, the alerts that whomever is on support should monitor.

As mentioned before, we use PagerDuty to notify us when we have an alert, and Sensu as the main place in which we generate them.

keeping support as pleasant as possible

Where things go wrong is we want to add lots of logic to who gets alerted, and when. Is this alert about something complex? is it a first line issue? Shall we notify them today or is it a bank holiday? Are they likely to be sleeping currently?

Much as we want to ensure everything is running as optimally as possible at all times, being on support at any time shouldn’t feel like a life sentence. Only the most critical alerts should be received outside of someone’s office hours.

being specific

The diagram above, referenced in the blog post earlier, is the way in which we determine the rules an alert should follow. The problem is, it’s not the easiest thing to remember, and isn’t as specific as it could be.

Is a distant doom important? More important than a minor doom? What are waking and core hours? The table comes from an engineer’s point of view, instead of from that of someone on support. It’ll be easier to clarify how the alerts should work if we flip it to look at how we want them.

stick to the basics

The table covers two basic points: is it first line or second line, and during which times should this alert be received. The introduction of ‘support’, ‘devops’, ‘simple’, ‘complex’, ‘major, ‘minor’, ‘impending’ and ‘distant’, add terms we then have to remember separately.

Instead, let’s describe them in terms we actually use – first line, second line; core hours/waking hours/anytime. Everyone on support will be familiar with these, and they’re also related to the SLAs our clients may be familiar with, too.

making it clearer

screen-shot-2016-11-25-at-18-46-06

The table above uses those familiar terms, makes it clear where an alert will go, and when. It’s quicker to parse and doesn’t require an understanding of what level of doom was assigned to an alert by whomever created it originally. This new naming will be put in place everywhere soon (more on Alertmanager soon), to make sure things stay consistent!

If you enjoyed the read, drop us a comment below or share the article, follow us on Twitter or subscribe to our #MetaBeers newsletter. Before you go, grab a PDF of the article, and let us know if it’s time we worked together.

blog comments powered by Disqus
sign up to #metabeers
slideshow