So, after working with Storm a fair bit recently, I thought I’d write about types in Storm, and why they’re good.
types: who needs ’em?
Recently, when adding a new data source to our Storm topology, I stumbled across a very simple bug. Basically, each source of data produces tuples with a certain set of fields for each event, then all of these tuples are merged and persisted in an in memory store for later querying. Everything ran fine, and I could see the tuples being persisted fine, but when I queried the store, there appeared to be no data. Much scratching of heads ensued, but no obvious solution was forthcoming.
To cut a long story short, the bug turned out to be related to Enums. Basically, the key to the data in the store included the value of an Enum. On the input side, this was passed into a tuple as an Enum. However, when querying, because the query was through Storm’s DRPC querying (which only allows String queries), the key was being assembled with a String. Naturally, the Enum object was not equal to its String representation, so the store didn’t match and return the data.
How did this arise? Surely you’d know when you’re comparing a String and an Enum together? Right? Well, not so much. You see, Storm cares not for your types. For example, it provides an in memory store with a key of type
List<Object>. This means you store stuff against a
List<Object> and retrieve with a
List<Object>. That’s hideous in so many ways.
every problem has a solution
So, Storm doesn’t really bother with type safety. That doesn’t mean it’s OK to join in. You must resist! Type everything you can, and make sure you know what type every element in your tuple is. Don’t let yourself pass around instances of
Object. Bad things will happen. If you need to use a map with a key of
List<Object>, consider subclassing it, and creating something more specific to your needs. By all means use the logic that Storm provide, but be sure to wrap it in a layer of type-safe loveliness.
Another area we’re working on improving is the DRPC query. The initial solution we came up with to pass parameters to a DRPC query was to use a series of Strings joined by some separator. While OK for a first pass, it’s limiting, makes passing in non Stringly-typed parameters a pain. We decided to create our
DRPCQuery type, then simply by ensuring that the type knows how to serialise and deserialise itself to/from a String, we have a simple and easy to use method to pass in parameters with correct types. Like most of our Storm code, it’s very much work-in-progress, but it should make Storm a little less of a handful.
nothing’s perfect, however…
Tuples in Storm are an interesting bag. When you use a Trident Function in a Topology, you tell Storm what the names of the fields you’re passing in, and what you want the output fields to be called. It’s then up to the function what it does with the input, and what it outputs. It could entirely ignore the input. It could try and cast it to anything at all. It could produce more or less fields than were specified where the function is used. None of this will cause any complaints until you run your Topology, at which point some nasty runtime exception will crop up. If you’re lucky, it’ll tell you why it failed. If you’re less lucky, you’ll just get a NPE.
We’ve yet to find an ideal solution to this. Do you have any clever ideas? Please let us know in the comments below!