One of Atlas’ strongest features is it’s ability to match and merge data from different sources about the same content together to form a single representation of that content. Since an ID is assigned to each sources content data as it’s ingested by Atlas each merged dataset actually has a set of connected IDs. We’re changing the way that Atlas behaves in terms of how it will resolve and display IDs of equivalent content to make IDs for an equivalence set consistent. Here’s how.
This section is a brief recap of how content resolution and equivalence matching works in Atlas. If you know how it all works feel free to skip to the changes in the sections below.
Atlas ingests data from multiple sources about the same content and matches them together transitively into equivalence sets. Each separate segment of data from a source is assigned an identifier. An example is shown, in graph form, in Fig. 1. Each node represents a piece of data with its identifier and the colour represents its source. The edges show how the data have been matched together into an equivalence set.
Often the configuration of an API key means that it won’t have access to all sources. In this case Atlas will simply not show data from inaccessible sources in the final merged result. For example in Fig. 2, the key only has access to the blue, green and yellow sources, but the links through
cf2 are still used to merge the remaining five pieces together.
API keys can also be configured so there’s a total order of precedence in its active sources. In this case the value for field in the merged result is taken from the most precendent source which provides data for that field.
Now, on to the changes
1. earliest identifier wins
Previously, when requesting by ID the merged result would have the ID of the data from the most precedent source active on the API key. Looking at Fig. 2, if the yellow source is most precedent a request for
cyp would return a result with ID
Now, the merged result to a request will have the ID of the first-seen data in the complete equivalence set, including data from sources which are not active. For example, the result in a response to a request for
cyp will have ID
cf2, even though its source is not active.
2. query on any identifier
Previously, if you queried Atlas with an ID from an inactive source you’d receive an empty response. Continuing with Fig. 2, requesting
cf2 would result in nothing being returned.
Now, querying from an inactive source’s ID will return active data for that equivalence set. So, a query for
cf2, from an inactive source, will behave in the same way as a query for and ID from an active source, such as
3. complete equivalence sets
Previously, the equivalence sets represented in the
equivalents fields of the output contained only references to data from active sources for that API key. So, with the IDs in Fig. 2, the result for
cmnc74 would contain only IDs from the green and blue sources.
equivalents fields are populated with all IDs from the complete equivalence set, so all 10 IDs in the case of Fig. 2. N.B. Since equivalence is reflexive, the ID for the merged result will be an element in these fields:
cf2 is equivalent to
These changes provide a consistent, stable ID for an equivalence set of content making it easier to work with Atlas’ data. If you have any comments or queries please get in touch either in the comments below or on the Atlas mailing list.