Sergio Bossa

logfile lovin: marrying nimrod and nagios for software visibility

At MetaBroadcast, we’ve recently started making internal visibility of software systems awesome, by gathering extended real-time metrics from production systems and processing them with the aim of:

  • Tracking and guarantiing SLA achievements
  • Monitor systems to discover problems as early as possible
  • Providing developers a way to easily and unobtrusively add monitoring hooks to their software

This implies tracking and processing of both infrastructure metrics, that is information coming from software systems lifecycle/ operations (such as up time and response time), and business metrics, that is any kind of application specific information (such as number of active users and number of updates), and we wanted to do that with one firm principle in mind: metrics gathering and processing should never affect application correctness, robustness and performance.

So what is the most unobtrusive and reliable way to get metrics out of your application? The answer is simpler than you think: log files. That’s why we chose to build a metrics processing pipeline based on log processing, and here’s an introduction of how we’re building it, showing the building blocks we used for improving visibility of our Voila service.

requirements

Our Voila infrastructure is made up of Jetty-powered application servers with Apache web servers in front of them. The requirement was to measure response times of every single request, assign each request to a specific group and then track those response times in order to compute the following statistical measures:

  • 5 minutes median and 99th percentile
  • 1 day median and 99th percentile
  • 1 month median

We chose to take request/ response times from Apache logs, rather than from inside our Java applications, in order to take into account garbage collection (and other) pause times happening upon request receipt: this is because the fairly standard way of computing response times from inside the Java application by taking the start time and subtracting it from the end time, wouldn’t take into account any kind of pause happening after request receipt but before the start time measurement.

We had to address the following problems:

  • How to consolidate and gather multiple Apache logs in order to process them as a whole
  • How to extract log data and store it in such a way that statistics can be calculated
  • How to graph those statistics

And here’s how we solved it:metrics at metabroadcast visibility1

distributing logs with syslog-ng

Initially, we thought to gather Apache logs and distribute them by using Syslog, an universal tool every operations engineer knows. Unfortunately, Syslog only works with UDP, which doesn’t play nicely on EC2, so we replaced it with Syslog-NG, which supports TCP communication. First, we setup Apache server to pipe logs through the UNIX logger, which would in turn output them to a file socket endpoint; here’s a snippet showing such a configuration:

CustomLog "|/usr/bin/logger -t 'apache' -u /your.log.socket" your.log.format

Then, we setup Syslog-NG on each Voila host, to read logs from the file socket and send them to a remote log processing server; here is a configuration snippet on how to do that:

source s_apache { 
  unix-stream("/var/log/apache2/syslog-access.log.socket"
    max-connections(512)
    keep-alive(yes));
};
destination d_syslog {
  tcp("your.host" 
    port(host.port));
};
log { source(s_apache); destination(d_syslog); };

So now we have logs consolidated on a single host, and we’re ready to process them.

pre-processing logs with unix-style pipes&filters

On the centralised log processing host, we first setup Syslog-NG to listen to remote log sources, and output everything through a UNIX pipe; here is a configuration snippet on how to do that:

source s_remote { 
  tcp(ip(0.0.0.0) 
    port(1000)); 
};
destination d_pipe { 
  pipe("your_unix_pipe"); 
};
log { source(s_remote); destination(d_pipe); };

Then, we wrote a simple bash script based on Sed to pre-process logs coming from the mentioned UNIX pipe with the aim of extracting and grouping relevant metrics and preparing them to be archived and further processed with statistical computations: we could have tailored Apache logs to directly output metrics the way we wanted to process them, but that would have lead to a much more complicated Apache configuration, and would have coupled it to our metrics infrastructure.

The output of such a pre-processing phase, directed to a common log file, is now our metric of interest (i.e., response time), grouped, formatted and ready to be processed by the next building block!

storing and calculating statistics with nimrod

Nimrod is a metrics server based on log processing: it listens to log files, ingesting properly formatted log lines, storing contained metrics and computing statistics, making them available through a json-over-http interface.

It is obviously a perfect fit for our metrics infrastructure, as we were able to plug it into our metrics processing pipeline by simply having it listen to the output file of the previous phase, and execute statistical computations without affecting other systems. Here’s how a logged metric looks like (resulted from the previous pre-processing phase):

[nimrod][1332937323985][gauge][single_content_single_id][3931]

This log line contains the following information:

  • The timestamp the metric has been produced at
  • The metric type: i.e., a gauge, that is a number representing a measure in time
  • The metric name, that is the grouping we were talking about before
  • The actual metric value: i.e., the response time in our case

Once ingested and processed by Nimrod, the metric gets stored and several statistical computations get applied based on current and old values, such as average and previously mentioned median and percentiles. Now that we have our metrics stored, processed and published by Nimrod, we’re ready for the final step.

graphing with nagios

Nagios is a well known monitoring and alerting tool, and we configured it to pull data from Nimrod, parse its json output and render it into nice human readable graphs. Here’s a sample query to pull median and 99th percentiles for latest 5 minutes metrics:

http://nimrod.host/logs/1/gauges/single_content_single_id/history/aggregate?
   age=300000&percentiles=99

With related response:

{
  "time": {
    "from": 1332944588704,
    "to": 9223372036854776000
  },
  "size": 2253,
  "median": 5262,
  "percentiles": {
    "99th": {
      "systemtime": "2012-03-28T14:27:38UTC",
      "id": "single_content_single_id",
      "timestamp": 1332944857964,
      "gauge": 543652,
      "samples": 17684514,
      "gauge-average": 166809.1564758781,
      "gauge-variance": 8030.220410235391,
      "tags": null
    }
  }
}

Which gets turned by Nagios into the following graph:
metrics at metabroadcast visibility2

is this open source?

Our whole metrics infrastructure is built on top of open tools such as Syslog-NG, Sed and Nagios. Also, we are active contributors to the Nimrod metrics server and related projects, and we will soon open source the Nimrod-Nagios integration: so stay tuned!

acknowledgements

All the metrics processing infrastructure has been developed by Sergio Bossa and Adam Horwich, with help and applause from the entire MetaBroadcast team.

blog comments powered by Disqus