Adam Horwich

sensual monitoring

Over the past few months I’ve been flirting with Sensu, which is a monitoring replacement for the beleaguered ‘Industry Standard’ Nagios, created by the guys at Sonian. This has been a gradual and methodical introduction, taking care to understand the application, and integrate it seamlessly to overcome some of the design issues with Nagios in the cloud. This week I’m going to talk about how we use Sensu, what it does well for us, and how we feel it could be improved.

if it ain’t broke

It’s not so much that Nagios is broke, it’s just bloody useless at all the things it tries to achieve. As a scheduler, it scales poorly. Introducing new checks can mean they won’t be executed until 20-30 minutes after you’ve reloaded your Nagios config. We’re only running 839 checks on our infrastructure, but that seems to be quite an issue for Nagios. To address this, Nagios can operate with a master/slave hierarchy, but being built on technologies which have little persistence and no cluster logic, it’s just deferring the problem. NRPE (Nagios Remote Plugin Executor) is no great solution either. It is heavily restrictive, and requires the Nagios Server to talk directly to the NRPE client on each instance. I’m not even going to expend any energy on the design issues with the interface, and I’ve already talked about some of the extreme solutions to get Nagios to play well with cloud infrastructures. None of this paints Nagios as an adequate, modern DevOps tool.

i didn’t know i was looking for love

A month ago I posted a blog on visualising metrics, and at the heart of this was Sensu. It ensured that metric data could be routed from source to multiple destinations (if desired), and at low cost compared to Nagios based solutions. I held back at the time on exploring Sensu for you, as we were still in the process of evaluating it. One month later, and I’m now ready to start talking about it.

paradigm schmaradigm

The fundamental difference between Sensu and Nagios, is that Sensu won’t ever execute checks on clients, nor will it talk to clients directly. Instead it makes extensive use of RabbitMQ to push some JSON onto a queue for a client to read from. Immediately you can see that we’re building a distributed application. The queue can be made resilient, and maintained outside of the Sensu Server. But that’s not all. Instead of relying on log file data to achieve ‘persistence,’ Sensu uses Redis, which is fast and resilient. Sensu is also written entirely in Ruby, instead of Perl and PHP as used in Nagios. This makes Sensu much more easily maintained and readable! Also, given it’s a heated issue right now, it’s worth pointing out that Sensu makes no use of Rails or YAML, plus no Sensu components are exposed to the outside world without locked down apache restrictions in place.

your ideas are intriguing to me, and i wish to subscribe to your newsletter

Sensu works on ‘subscriptions.’ These subscriptions are analogous to servicegroups in Nagios (the concept of hostgroups doesn’t explicitly exist). Clients define what they want to subscribe to, and these subscriptions contain collections of checks to execute. A client can have many subscriptions and a client can have standalone checks which the server doesn’t know about (this is a very nice feature!). This configuration allows us to define the checks we want to perform on a service separate from which servers run said service. This is a nice deployment win, as we can do funky things in Puppet to define what services are included on an instance and at that point set which subscriptions it should have. This nicely sits separately from the Sensu configuration files. Unlike the hack required for Nagios, this means that monitoring is configured at the puppet-run stage, and does not need to interact with AWS or any other configuration management service. Oh, yeah, and no need to re-invent the wheel either. Any check you made or utilise in Nagios, you can drop into a sensu check command and it’ll work out of the box.

mastering the masterless puppet

We don’t implement a ‘puppet master’ because we don’t like the canonical dependencies and points of failure it creates. Instead we opt for the decentralised, Github managed, local ‘puppet-run’ methodology, where each server takes care of itself. This is generally perceived as the ‘DevOps’ way of doing things. A few people have blogged about similar setups if you are interested in learning more. This ultimately requires a little extra finesse when it comes to configuring servers, which only have self-awareness.

class sensu-client {
    define role ($host_name = $::fqdn, $ip_address = $::ipaddress_eth0, $service_folder = "base", $service_environment = "detect")  {
                $instance_role = $name
                file { "/etc/sensu/conf.d/${service_folder}":
                        source => "puppet:///modules/sensu-client/services/${service_folder}",
                        recurse => true,
                        purge => true,
                        require => Package["sensu"],
                        notify => Service["sensu-client"],
                        before => File["/etc/sensu/config.json"],
                }
                if ($service_environment == "detect") {
                        $env_file = generate("/usr/bin/find","/etc","-wholename","/etc/env")
                        if ($env_file) {
                                $inst_env_raw = generate("/bin/cat","/etc/env")
                                $inst_env = $inst_env_raw ? {
                                        "infra\n" => 'base',
                                        "stage\n" => 'stage',
                                        "prod\n"  => 'base',
                                        default => 'stage'
                               }
                        } else {
                                $inst_env = "stage"
                        }
                } else {
                        $inst_env = $service_environment
                }
                file { "/etc/sensu/conf.d/client-${name}.json":
                        content => template("sensu-client/client.json"),
                        require => Package["sensu"],
                        notify => Service["sensu-client"]
                }
    }
}
Above we have our sensu-client ‘role’ configuration which will ensure a folder of checks is deployed locally to the instance (not mandatory for Sensu, clients are passed the JSON document containing the check definition via the queue from the server), and that there is a client configuration file generated per service.
{
    "client": {
      "name": "<%= host_name %>",
      "address": "<%= ip_address %>",
      "subscriptions": [ <% instance_role.each do |ir| %> "<%= ir -%>", <% end %> "<%= inst_env -%>" ]
    }
}
Above is the client.json template file in Puppet. Below is how we include the definition in our server role manifests:
sensu-client::role{ "mongo": service_folder => "mongo" }

i’ll deep merge you!

For quite a while we were concerned about how we’d get all these service definitions onto an instance if it’s running many different things. We suspected that config files overwrite each other on the object configurations, and then there’s how Puppet dislikes knowing about a file or folder definition more than once. The latter we worked around by defining stage and prod subscription folders in Puppet and including them that way, the former came through a fortunate design choice by the guys at Sonian. When parsing configuration files on startup, the Sensu-Client will perform a ‘deep-merge’ operation, and combine together similar arrays. This means we can happily define multiple client.json configuration files per subscription, and they will all be merged together on startup.

who needs documentation when the checks document themselves

A great feature, and one we’ve enhanced when we integrated Sensu with PagerDuty, is to include service documentation on alerts. This means that information helpful to engineers or support teams responding to alerts can be passed via the alert itself. This is because Sensu passes around an event JSON object to the handler. This contains the check configuration, results, and basic historical information.

{
  "checks": {
    "check_disks": {
      "notification": "Disk space is low",
      "handlers": ["mailer", "pagerduty"],
      "command": "/opt/sensu/embedded/bin/ruby /etc/sensu/plugins/check-disk.rb -w 75 -c 85",
      "interval": 300,
      "occurrences": 4,
      "subscribers": [ "base" ],
      "low_flap_threshold": 5,
      "high_flap_threshold": 25,
      "event_description": "The partitions noted in this alert have exceeded 85% capacity. Please refer to http://[WIKI]/Dealing+with+Disk+Space+Alerts for further information"
    }
  }
}

All we’ve done here is add an event_description field to the check JSON object. This gets passed to the client to be executed, and then the server reads it off the queue and decides whether to pass it to the hander. Our handler then looks for this event_description field and includes the information provided, or a placeholder if unavailable. Within the official PagerDuty handler it’s as simple as updating the details parameter:

when 'create'
  Redphone::Pagerduty.trigger_incident(
    :service_key => settings['pagerduty']['api_key'],
    :incident_key => incident_key,
    :description => description,
    :details => @event['check']['event_description'] || "PLACEHOLDER"
  )

skynet ain’t got nothing on us

The Sensu API is a nifty thing, and for our more dynamic instances, such as auto-scaling hosts in AWS, we have implemented a self termination script. Instances will remove themselves from Sensu as they shut down, eliminating the risk of false positive alerts. We do this in a very locked down and controlled fashion, as we don’t want to over-expose the API. We’ve implemented an Apache on top of the API which requires and authorised user, and can only implement the HTTP DELETE command to prevent client discovery requests. Essentially the only thing you can do, even if you gain access, is to delete a single named instance from Sensu’s Redis DB. And even then, that’s not particularly significant, as, if the client’s still running, records will just re-appear when it next submits check results after listening to the messages on RabbitMQ.

it’s not all roses

Sensu is YOUNG. It’s great, but there’s a lot to be done, and things change often. For example, the developers are currently looking to replace the sensu-dashboard with something better, which we’re totally happy with. So far we’ve had to implement some of our own changes, and forked the existing code for a few things.

  • Filter stashed items: When you stash (acknowledge) an alert, it still remains on the ‘current events’ board. We implemented a checkbox feature to filter these stashed items so we only see actual current events, i.e. things we need to be aware of.
  • Suppression windows: The current implementation for suppressing alerts occurs on the check level, not on the handler level, which is less than ideal. We have use cases where we want alerts to be sent to email, and only for certain times of day to be sent to PagerDuty. This is to be addressed in the next release (0.9.10). We’ve worked around this for now by including time window checks in the PagerDuty handler we implement.
  • No alert histories: We can’t see a history and frequency of previous alerts. It’s quite useful knowing this for trend analysis.
  • Metrics checks don’t work well with alert based checks: Ideally we’d like to collect metrics and alert from the values received if they exceed a threshold. But there’s a compromise: either we don’t record metrics when the return code is 0, or we always end up passing metrics onto our alerting handlers and have to code around the issue there. Neither is ideal.

moving forwards

Thankfully, the team behind Sensu actively listen to their users and provide great access to discuss issues and work around problems. It may be a small community, and not nearly as many people are working with it compared to Nagios, but we’re excited to work with the technology that gets around many of the fundamental issues we’ve had with Nagios. And with a few tweaks and customisations, we’re going to produce something that helps us understand our infrastructure and services better than ever before. Watch this space for further Sensu musings, tips, and tweaks.

blog comments powered by Disqus