or a little on equivalence in atlas
Atlas pulls in metadata from an ever-increasing number of sources. Some sources describe when a programme was broadcast and on what channel, some list on-demand locations where you can catch-up with a show, while others might simply provide the topics covered by a particular episode. Often Atlas will ingest data about a given show from a number of different sources, each giving a slightly different view on the content.
While having all this data on its own is great it becomes massively more useful if Atlas knows that two pieces of data actually describe the same content. For example, the BBC’s data on Doctor Who might show an episode was broadcast on BBC One at 7pm on Saturday whereas iTunes’ view would provide links to where you could buy that episode on the iTunes Store.
To find these links between data from discrete sources Atlas has an internal background task known as Equivalence. Once these links have been found the equivalent data is composed on-the-fly for each request to the API.
the 5 stages of equivalence
A high-level overview of the Equivalence job is as follows:
1. for each piece of content: 2. for every other piece of content: 3. if the two are equivalent: // this is the hard part 4. record that equivalence
You can imagine it filling a table with a row and column for each piece of content, marking the shared cells of equivalent content. Of course, the described process takes time proportional to the square of the number of pieces of content, i.e. it’s O(n2), so even for a modest amount of data it’s going to take a long time. To speed-up and refine this process the Equivalence job goes through 5 steps for each piece of content.
Generation: This is the process of finding candidate equivalents. The accuracy and performance of this stage is critical: a generator missing candidates is bad but finding too many candidates is also undesirable. Generators usually rely on an external index of the data, for example, the schedule index is a very effective basis: if two episodes are on at the same time on the same channel, they’re likely to be the same.
Scoring: Once candidate equivalents have been proposed, it’s this stage’s responsibility to objectively determine the quality of the match. This inspects a number of different properties of the candidate and assigns each property a score between -1 and 1. For instance, a perfect title match would score 1 while totally different titles might score -1.
Combination: This stage creates an overall final score for each candidate equivalent. This gives the opportunity to modulate the scores on some properties and in general defines how the separate property scores are reduced to single score.
Filtration: The nature of the Equivalence process means that sanity checks are always a good idea. If there are combinations of content that are known to never be equivalent then this stage provides a final backstop to ensure they don’t get through.
Extraction: By this stage there will be a number of surviving candidates some of which may be from the same external source. To keep things simple for each piece of content there can only be one direct equivalent for each source so this stage narrows down the candidates to one per source. This is done by taking the highest-scoring candidate and determining if its match is strong enough relative to all other candidates from that source. If it is then it’s recorded as a direct equivalent otherwise no equivalent is recorded for that source.
but wait, there’s more!
The Equivalence task computes direct equivalences between content but that’s not quite the end of the story. Equivalence in Atlas is a true equivalence relation: it is reflexive, symmetric and most importantly transitive. This means long chains of direct equivalences can be combined into equivalence partitions. These sets of transitively equivalent content are computed and recorded at write-time. At read-time the requested members of the set are merged together to produce a single content document with data from all available sources.
Depending on how you count it, this is the 3rd generation of the Equivalence algorithm in Atlas and we are still learning and tweaking it to better our content matching. There’s lots to be done, particularly for off-schedule content, and we’re looking at investigating transforming the task to use MapReduce and possibly even Storm. If you want to get involved or have any comment or suggestions please get in touch on the Atlas mailing list or on github.