A short while back, I wrote about our plans for persistence of real-time data in our upcoming Lambda architecture statistics system. Essentially, we were persisting data in memory in on our Storm nodes, but due to the way in which Storm restarts worker processes quietly on any failure, we were having our data reset on a fairly regular basis. Our initial plan was to move away from in-memory storage within Storm to storing all our real-time data in Cassandra. You can read more details about how and why we planned to move away in the aforementioned previous blog post.
After giving the change some extensive thought, we came up with a few rather pressing issues with moving real-time storage to Cassandra.
- Retrieval becomes much slower, as data is fetched from disc rather than memory
- How the data is stored becomes crucial, and depends on the type of query. Different types of queries might need different table structures for fast retrieval.
- DRPC actually does some really handy summing across time periods, which would have to be reimplemented if we were to move away from Storm for querying.
While any single one of these points might not have swung the vote against full-blown Cassandra persistence, the combination meant that the work was looking to be both complex, and quite probably would lead to reduced query performance.
keep it simple
After agonising over this for a while, we came up with a compromise—keep what we had in terms of in-memory storage, but just dump the whole lot to disc (to be specific—a single Cassandra row) at regular intervals. Then, when the Storm process comes back up, restore from that persisted state, rather than starting again from zero. This solution, while perhaps inelegant, has the advantage of being simple, and preserves almost all of our existing in-memory persistence code, and requires little dev work. It’s a quick way to make us resilient (in terms of data, at least) to those inevitable errors that creep through.
partition me, bro
The only slight hiccup to this plan is Storm’s habit of multi-threading everything. In our case, we were using a custom-written State, that Storm then parallelised across various worker processes. This complicates things somewhat, as instead of a single monolithic lump of data to save, there are n, where n is the number of threads that Storm has given that part of your data stream. Fortunately, Storm can be encouraged to use a consistent number of threads (via the use of the
parallelismHint method), and the method it uses for assigning which data to which thread is consistent, so as long as you keep your persistence parallelism constant, then everything should be dandy. We’re also looking into detection of parallelism changes, so that we don’t immediately lose data if someone accidentally changes this parameter without realising the implications.
the proof is in the pudding
Or, in our case, how long our data is persisted for. From having our Storm data reset every two hours, we are now at a point where we can keep up to 3 days worth of data in Storm (or more if we so choose), and if Storm restarts a process, we lose a maximum of a couple of minutes of data. Not bad for a quick fix!
So here we are—older, wiser, and with a distinctly more resilient Storm real-time persistence layer than before. What have we learnt? Sometimes, the best answer is the simplest. When something isn’t working as planned, don’t leap to ditch all the existing code, but see if there’s a quick way to work around the problem. The chances are, it’s not as bad as you initially thought.