Jamie Perkins

graph all the things!

We currently have Graphite as the centre piece of our metrics architecture, with a few bits of supporting software like syslog-ng and KairosDB for aggregation and log parsing where needed.

Graphite as a data store isn’t terrible, but it has to be said that the web UI is horrific. We recently came across Grafana, which is a UI that uses Graphites API to retrieve metrics and do a far better job of displaying them.

It’s based off of Kibana (hence the name) the UI used in the ElasticSearch, Logstash and Kibana (ELK) stack, which we are also currently looking at deploying to replace large parts of our metrics architecture. Here’s a very early work in progress…

Grafana doesn’t actually do anything special, it does no aggregation and is entirely client-side apart from storing dashboard configuration in ElasticSearch. All of the magic is still done by Graphite, which has a big number of functions and aggregations available. It wasn’t until I saw them in Grafana that I even knew it could do all that.

Another cool little thing is Grafana’s annotations. It can query an ElasticSearch index for events occuring over the same time period as your metric data points and plot them on the graph. We have installed a plugin into our CI of choice (Jenkins) that has it write those events every time a build/deploy is run.

It’s rather useful as it gives us an idea of how new code and features is impacting performance.

The plan is to make a few of these dashboards and put them up on TVs around the office. Deciding what is important to show and how to display it is an interesting little project. We’d like 2 categories of dashboard, some that are useful to engineers when monitoring and debugging issues. For example we already have one that displays the consumer offsets and topic size of Atlas Deer’s Kafka queues, which is rather useful when evaluating how well it’s processing messages from other systems.

The other flavour of dashboard will be “big picture” metrics such as response time means and 99th percentiles of our client facing APIs, both of which are important numbers for SLAs.

Ultimately the dream is to not have to look at Graphite’s face ever again and hopefully make our lives a little easier by catching issues sooner rather than later. Thanks for reading!

blog comments powered by Disqus