I’ve mentioned our slowly-evolving Lambda-architecture project a number of times over the past six months, although mostly from a realtime-processing perspective. Today comes a shift—in this post I’m going to give more of an overview, and then dive into some of the differences in the way in which the real-time code differs from the batch processing.
how’s it work then?
We ingest various queues of data via our Storm topology. This firstly labels the data with appropriate tags and annotations (e.g. which piece of Atlas content it relates to), then fires the data in two directions. The first stream carries on within Storm, being aggregated in some manner (more on that in a bit), and then stored in an in-memory store.
The second stream takes a slightly longer route. It goes via another queue into HBase, to be stored in a mighty table of raw data. From there, it’s picked up by Hadoop MapReduce jobs that aggregate the data, and store it back into HBase again.
Finally, users can query the API, which will then query both Storm and Hadoop for results, combining them via some cunning wizardry (that we may cover another time) to produce lovely realtime stats that can go back as far as you want.
count all the things… in all the ways
We’ve tried our best to make this system as flexible as possible, with the intention of supporting all sorts of useful counting and aggregation across many different data types. At the moment we’re implementing the more basic aggregation types, and will work up to more complex methods as we flesh out the system. To that end, our main form of aggregation currently is to aggregate by time period—you can count how many events occurred in each second, over some time period. The whole system is designed to allow you to query any time granularity, from seconds to decades, and any time range. This allows you to make nice graphs of x event per time period:
The key to the Lambda Architecture is the combination of realtime and batch processing. The batch processing is designed to process large amounts of data (i.e. all data for all time), running at regular intervals. Realtime however, is best suited to counting streams of data very fast, so can cover the window between the last batch run and the present moment. Thanks to the fundamental differences in the ways in which the two work, the counting and storage techniques used are very different.
Our realtime system of choice is, as mentioned earlier, is Storm. We decided that as speed was of utmost importance, we would keep all the realtime-data in memory, stored in an elaborate nested custom data-structure, designed to make the required lookups extremely fast. We store data at the finest granularity, which, while somewhat expensive in terms of storage, is cheaper than storing all granularities (we’re only storing data for a relatively short period anyhow). Storm excels at aggregating streams of data, so to generate data for larger granularities, the aggregation is performed at query time.
Meanwhile, the Hadoop batch processing is performed in an entirely different manner. Here, due to the sheer volume of data, we’re storing it on disc, in HBase. Regular MapReduce jobs trawl through that data, aggregating it and writing back out to a new results table. The difference from Storm is that we store all granularities, for two reasons. Firstly, disc storage is slower to access, so we don’t have the time to perform additional aggregations. Secondly, disc storage is cheaper than memory, so storing all granularities isn’t a massive hit (also, storing all time granularities—second through decade—is less than twice as expensive as storing just second-based counts, thanks to the reduction in the number of values stored at each level). The higher granularity counts are calculated from the lower granularity data, to avoid going back to the raw data each time. We also store the timestamp of the last complete period for each granularity, so that the jobs only have to process new data, rather than reprocessing all data upon every run.
So there you have it—a grand overview of our new statistics system, and a bit of a look into how it works. I won’t pretend it’s been easy thus far (in fact, I may go into some of the issues we’ve encountered with the batch side of things next time), but it’s all starting to come together, and produce some really awesome realtime stats.