Fred van den Driessche

d is for denormalization

Continuing the Atlas A-Z series, this article is brought to you by the letter D. D is for Denormalization.

to normalize or denormalize

Database normalization and denormalization refer to how the stored data is organised and how it relates to itself. Normalized data has low redundancy and inter-dependency; it’s often divided into a large number of tables. Normalization makes it easier to insert and modify data since the changes need to be made in very few places. Retrieving data is relatively more difficult as many tables may be required to fetch all the relevant attributes.

Denormalization, unsurprisingly, is essentially the opposite process. Rather than concentrating on isolating data points, denormalization increases redundancy and duplication of data. Through denormalization read performance can be greatly improved since fewer tables need to be accessed. The trade-off is increased effort maintaining the consistency and integrity of the duplicated data at write time.

swooping owl

So where does Atlas stand when it comes to denormalization? Well, it’s currently backed by MongoDB, which, being the “leading NoSQL database”, pushes you in the direction of denormalization because integrity constraints between collections must be managed by the application and the document model lends itself to nesting sub-resources.

The Atlas model also lends itself to denormalization. If we go back to last time, the Item type is a good example of this. Items have many Versions, Versions have many Encodings and Encodings have many Locations. These are all one to many relationships. In a normalised model there would be a separate table for each type, so to find all the Locations for an Item you’d have to join together from all four tables. Instead each list is nested in the document above it so all the Locations for an Item are stored directly in the document for that Item, making it quick to access them.

Relations between top-level types are slightly more involved as they are generally many-to-many relationships: a piece of Content can have many Topics and a Topic can be referred to by many pieces of Content. In a normalized model there would be three tables: 1 each for Content and Topics and one more for linking Content to Topics and vice-versa. In the document model, the relationship is denormalized one or both ways with the references being stored directly on the Content or Topic documents. Since collections can’t be joined two queries are needed to find the Topics for a Content document. A further denormalisation would store the Topics on their references in the Content collection meaning only one request is required. The feasibility of such denormalizations depends on the cardinality of the relationship and their maintainability.

gestating deer

As we develop deer, the next version of Atlas, we’re really concentrating on how we use the database, which in deer’s case is Cassandra. We’re aiming to minimise the number of rows fetched to furnish a request. Take, for example, schedules: currently they are index-based with the schedule divided in to hour-long buckets which contain references to the items broadcast during that hour. To resolve a schedule all the relevant hour buckets need to be fetched and then all the references in those buckets too, in a separate secondary query. That means for an average day on BBC One you’re fetching 24 schedule buckets and then about 35 items. In deer, we’ve reduced that to one or two rows. A row contains all the items for a single day of a channel according to a source. Since we’ve duplicated item data from the Content column family, whenever an item changes we have to sure to update in all the places it appears in a schedule.

Another area we’re focussing on is Equivalence. Similarly to schedules, Equivalence relationships are stored as references so all the resources that make up an Equivalence set need to be resolved from their own rows, which doesn’t perform well for large sets. To reduce the number to rows, in deer each equivalence set is written into its own row. Again, whenever a resource or equivalence set changes the rows need to be updated.

So what about schedules for Equivalent Content? We hope to be able to solve that problem in a similar way. It all relies on change messages being propagated to the relevant stores to trigger updates to the appropriate rows. These denormalized stores don’t replace the current stores: the Equivalent Content store will sit alongside the Content store as a sort of materialized view.

The main aim of these denormalized stores is to massively simplify reading data for requests and make them much faster. Of course, they complicate writing and maintenance but that’s the right trade-off to make: data in Atlas, as with almost any web-service, is read many orders of magnitude more times than it is written.

blog comments powered by Disqus