we write about the things we build and the things we consume

enhancing cassandra powers with secondary indexes

As you may recall from our previous post about migrating our storage infrastructure for the next generation Atlas Platform, along with all existing data, from MongoDB to Cassandra, one of the challenges was about implementing same indexing and querying capabilities on top of Cassandra.

While we implemented the most complex and rich ones on top of ElasticSearch, we were left with two "basic" bu very important ones: one-to-one and many-to-many secondary indexes.

One-to-one secondary indexes are currently used in our data set to implement unique constraints, i.e. when we're interested in retrieving a piece of content by some unique attribute other than its primary key:

Content:

content -> primary key, unique attribute

Index:

unique attribute -> primary key              

Many-to-many secondary indexes are currently used instead to implement a kind of inverted index, as we often have parent-children relationships, where children have indeed more parents, and we want to find all parents having a given children:

Content:

parent1 -> child1, child2
parent2 -> child2

Index:

child1 -> parent1
child2 -> parent1, parent2        

We could have implemented this on top of ElasticSearch too, but we didn't want to go through a fully fledged search index for simple key-based secondary indexes. Or, we could have implemented it on top of native Cassandra secondary indexes: but they are not designed to index frequently changing data whose indexed values cardinality is large, which is exactly our case, so this ruled them out too.

So we wrote a simple utility class to implement the secondary indexes we needed on top of standard Cassandra column families, and thought to share it with the community and possibly contribute back later.

Let the fun begin.

at a glance

For the impatience and the code-hungry, here's how it looks like to create a direct (unique constraint) index:

index.
    direct(keyspace, columnFamily, consistencyLevel).
    from(uniqueValue).
    to(primaryKey).
    index().
    execute(1, TimeUnit.MINUTES);

Our Cassandra index is a simple DSL built on top of the Astyanax Cassandra client, and that's the only additional dependency we need.

As you can see above, the first thing you do is either opting for a direct index or an inverted one:

CassandraIndex index = new CassandraIndex();
// Direct:
index.direct(keyspace, columnFamily, consistencyLevel);
// Inverted:
index.inverted(keyspace, columnFamily, consistencyLevel);

You have to specify Cassandra keyspace, column family and consistency level you want to work with: the already cited Astyanax client provides easy APIs and good documentation about how to create such objects, so I will not go further.

After that, different method calls will follow depending on the type of index and the operation you want to perform.

Let's start with the direct one.

be direct

Creating a direct index is as easy as additionally specifying the unique attribute value you want to index from, and the actual primary key value you want to refer to:

index.
    direct(keyspace, columnFamily, consistencyLevel).
    from(uniqueValue).
    to(primaryKey).
    index().
    execute(1, TimeUnit.MINUTES);

The unique value can really be anything in you data model, and doesn't need to be a separated column on your Cassandra model, nor any uniqueness is really enforced by Cassandra itself: it is the programmer's job to make sure the value actually has a meaning, and it actuall is unique. The primary key has to be instead the value of the row key of the indexed data, as this will be later taken into consideration by the lookup action.

Speaking of which, here's how to lookup from a direct index:

String primaryKey = index.
    direct(keyspace, columnFamily, consistencyLevel).
    from(uniqueValue).
    lookup().
    execute(1, TimeUnit.MINUTES);

It goes pretty similar to the previous index call, except you just provide the unique value you want to lookup: the execute call will return the previously indexed primary key value that you can use as a row key to get the full data.

Finally, you can also easily delete indexes as follows:

index.
    direct(keyspace, columnFamily, consistencyLevel).
    from(uniqueValue).
    delete().
    execute(1, TimeUnit.MINUTES);

Almost identical to the lookup call, except you ask this time to delete the unique value (and associated primary key).

or indirect

Inverted indexes are created instead by providing the primary key of the source data you want to index from, and the values you actually want to index:

index.
    inverted(keyspace, columnFamily, consistencyLevel).
    from(primaryKey).
    index(value1, value2).
    execute(1, TimeUnit.MINUTES);

Then you can lookup a value and get all primary keys which are pointing to them:

Collection primaryKeys = index.
    inverted(keyspace, columnFamily, consistencyLevel).
    lookup(value1).
    execute(1, TimeUnit.MINUTES);

And obviously delete indexed values from a primary key:

index.
    inverted(keyspace, columnFamily, consistencyLevel).
    from(primaryKey).
    delete(value1).
    execute(1, TimeUnit.MINUTES);

That's really all, and we hope you've found it as easy and useful as we did.

where can I look it up?

Our CassandraIndex class is part of our Atlas open source effort, so you can have a look at it right now. By the way, we plan to polish it a bit and contribute as an Astyanax recipe: if you have any suggestions or any other kind of feedback, please do as always by twitter or blog comments!

blog comments powered by Disqus