we write about the things we build and the things we consume
let cassandra identify your data
Well, it happens we also used Mongo to generate identifiers for newly inserted content into our Atlas platform, hence another problem suddenly came up: how do we generate those identifiers now that Mongo is gone?
Let the journey begin.
what and why
Functionally speaking, we have two strong requirements:
- Identifiers should be only numbers, human-friendly and as short as possible.
- We should support already existent identifiers (we have millions and they all adhere to the first requirement above).
Our identifiers are directly faced by our users, who use them for searching and obviously getting handles to contents: we like users to easily type and track them, so the first requirement comes as consequence, while the second one is obviously driven by the need to be backward-compatible, as we cannot break all existent user applications.
Non-functionally speaking instead, we have one single strong requirement: reliability. We should always be able to generate new identifiers whenever needed, while performance and scalability are a minor concern, as our data access patterns are dominated by reads and updates, rather than insertion of new content.
So candidates rapidly queued up: let's have a look at them.
First candidate coming in our mind was obviously the Snowflake id generation service from Twitter: open source, industry-proven, brilliant algorithm, nice name. Unfortunately, it doesn't suit our requirements, as Snowflake-generated identifiers tend to be quite long. Also, Snowflake requires a running Zookeeper cluster: not much of a burden, except it would require more boxes and more moving parts in our infrastructure, while we prefer self-contained, easy-to-manage, things, possibly keeping infrastructure management overhead at minimum.
We already have a running Cassandra cluster, so next candidate was Cassandra counters: built-in, already in-house, no additional moving parts. Unfortunately this doesn't work at all, as Cassandra counters cannot be atomically increased and retrieved, meaning that different clients may end up retrieving the same identifier after updating.
By the way, we really loved the idea of using our running Cassandra cluster: so we started investigating how to generate distributed identifiers on top of that.
astyanax & cassandra
Unless you go with the uncoordinated approach similar to Snowflake, the first thing to do for generating identifiers in a distributed environment is to coordinate for atomic set-and-get operations, where the "set" operation here is the generation of the next unique identifier and the "get" is its retrieval: the Astyanax client library comes to help here, as it provides a distributed locking implementation on top of Cassandra.
Without going into much detail, as you can get them out of Astyanax docs, here's how its locking mechanism work:
- Each lock is represented by a row keyed after a user-specified lock "name".
- Upon lock acquisition attempt, a uniquely identified column is put into the row, and that means lock has been acquired by a given client (under the hood, it is more complex than this, but let's simplify for the sake of this post).
- While the lock is held, the client usually reads and prepares data to update on the same "locked" row: this is important as operations on the same row are atomic and isolated.
- Finally, client updates are executed upon lock release.
Devising an id generation algorithm is as easy as:
- Acquiring the lock.
- Reading the current id from the locked row and updating it by one.
- Storing the updated id upon lock release.
Unfortunately, the naive algorithm above has two major problems: it is terribly slow, as it does a few roundtrips to Cassandra just to generate a single identifier, and is prone to severe contention on lock acquisition.
Let's talk about how we overcame them.
The first and most important problem is about performance: doing a few roundtrips to Cassandra just to generate a single identifier is terribly inefficient, so we had to implement batching. This means we "generate" a given number of identifiers out of Cassandra, then we keep them locally by the client and avoid roundtripping to Cassandra until we exhaust the batch. This turns the id generation algorithm into the following:
- If we have a batch of identifiers to serve locally:
- Return the next identifier from the batch.
- Acquire the lock.
- Read the current id from the locked row and update it by the batch length.
- Store the updated id upon lock release.
- Update local batch and return next identifier.
So, at the risk of "losing" some identifier in case of client-side failures, we greatly improved performance by amortizing the cost of Cassandra calls over the batch length: but, we're still prone to contention on lock acquisition, in particular when batch length is similar in all clients, and they all consume identifiers at same pace.
Let's stripe over them.
Lock striping is a fundamental technique to reduce lock contention, by basically splitting a single lock into several different locks, each one guarding an independent data subset or operation.
How does it translate into our use case? We basically want to generate more identifiers in parallel, striping over not just the acquired locks, but also over the sequence of generated identifiers to avoid collisions, hence here's the striping strategy:
- Define a target parallelism level, which will be the number of stripes we want to allow for.
- Assign each client a stripe identifier, which will also be the first identifier generated for that specific stripe.
- Identify locks not just by their name, but by the name plus the stripe identifier.
The id generation algorithm is exactly the same as before, with the difference that each stripe will hold a different lock and generate (in parallel) a different sequence of identifiers. Here's an example with three stripes:
- Sequence for stripe id 1: 1, 4, 7, 10 ...
- Sequence for stripe id 2: 2, 5, 8, 11 ...
- Sequence for stripe id 3: 3, 6, 9, 12 ...
So everything is algorithmically beautiful, but as we often say, devil is in the details...
finally back it off
Unfortunately, Astyanax distributed lock implementation has a quite inefficient conflict resolution algorithm, which makes conflicts a real problems even with batching and striping in the mix. That's because it is exclusively based on a back-off-and-retry algorithm, and while the correct solution would be to replace it with something smarter, we decided to go with the fastest solution (at least for now) and improve the back-off algorithm instead.
Astyanax provides the following two back-off algorithms: constant and exponential time. But, when clients proceed with lock requests at the same pace, both of them are equally bad as clients will either end up applying the same back-off time sequence, follow the same retry pattern and keep conflicting again and again, or quickly approaching a very high back-off time which will make for a very bad lock latency. So we implemented a random back-off algorithm as follows:
- Initial back-off time is generated randomly.
- Subsequent ones are calculated by multiplying times the number of attemps.
- If back-off time exceeds a given timeout, retries stop and lock acquisition fails.
The timeout case above never happened in our tests, and lock latency was actually much lower, confirming so our algorithm was good to go.
We tested on a single-node Cassandra instance, with 10 concurrent clients, and got the following numbers:
- Non-striped, 100 batch length: ~30000 identifiers per second.
- 5 stripes (so with some contention overhead still), 100 batch length: ~100000 identifiers per second.
Obviously, numbers will be different when moving to a proper cluster with larger quorum, but our target is way lower than that, so we're quite happy with those numbers.
enough talking now, show me some code!
Plain, non-striped generators are created as follows:
CassandraId.plainGenerator(context, lockCF, targetCF, firstId, batchLength)
- context is the Astyanax context to connect to Cassandra.
- lockCF is the name of the column family where the lock row will be put.
- targetCF is the name of the column family the lock refers to, let's say the "locked" column family (but could really be everything).
- firstId is the first identifier to generate in the sequence.
- batchLength is how many identifiers to generate in a single Cassandra call.
Striped generators are created instead as follows:
CassandraId.stripedGenerator(context, lockCF, targetCF, stripeId, stripes, batchLength)
- stripeId is the stripe identifier of this client and first identifier in the sequence.
- stripes is the max number of stripes and desired parallelism level.
- batchLength is how many identifiers to generate in a single Cassandra call.
Finally, actually generating identifiers is as easy as:
Everything we discussed so far is open source as usual: so feel free to have a look at it, it's beta quality and more improvements will come, so don't miss to get back with feedback !