Fred van den Driessche

atlas: changes to id resolution and output

One of Atlas’ strongest features is it’s ability to match and merge data from different sources about the same content together to form a single representation of that content. Since an ID is assigned to each sources content data as it’s ingested by Atlas each merged dataset actually has a set of connected IDs. We’re changing the way that Atlas behaves in terms of how it will resolve and display IDs of equivalent content to make IDs for an equivalence set consistent. Here’s how.

recap

This section is a brief recap of how content resolution and equivalence matching works in Atlas. If you know how it all works feel free to skip to the changes in the sections below.

Atlas ingests data from multiple sources about the same content and matches them together transitively into equivalence sets. Each separate segment of data from a source is assigned an identifier. An example is shown, in graph form, in Fig. 1. Each node represents a piece of data with its identifier and the colour represents its source. The edges show how the data have been matched together into an equivalence set.

Graph representing an equivalence set of content in Atlas Figure 1: Graph representing an equivalence set of content in Atlas and how data from different sources are matched together.

Graph representing an equivalence set in which only some sources active Figure 2: Graph representing an equivalence set of content for an API key where not all sources are active. Data from active sources is linked through data from inactive sources.

Often the configuration of an API key means that it won’t have access to all sources. In this case Atlas will simply not show data from inaccessible sources in the final merged result. For example in Fig. 2, the key only has access to the blue, green and yellow sources, but the links through cf2 are still used to merge the remaining five pieces together.

API keys can also be configured so there’s a total order of precedence in its active sources. In this case the value for field in the merged result is taken from the most precendent source which provides data for that field.

Now, on to the changes

1. earliest identifier wins

Previously, when requesting by ID the merged result would have the ID of the data from the most precedent source active on the API key. Looking at Fig. 2, if the yellow source is most precedent a request for cyp would return a result with ID cmnc74.

Now, the merged result to a request will have the ID of the first-seen data in the complete equivalence set, including data from sources which are not active. For example, the result in a response to a request for cyp will have ID cf2, even though its source is not active.

2. query on any identifier

Previously, if you queried Atlas with an ID from an inactive source you’d receive an empty response. Continuing with Fig. 2, requesting cgfnw4 or cf2 would result in nothing being returned.

Now, querying from an inactive source’s ID will return active data for that equivalence set. So, a query for cf2, from an inactive source, will behave in the same way as a query for and ID from an active source, such as cmnc74.

3. complete equivalence sets

Previously, the equivalence sets represented in the same_as and equivalents fields of the output contained only references to data from active sources for that API key. So, with the IDs in Fig. 2, the result for cmnc74 would contain only IDs from the green and blue sources.

Now, the same_as and equivalents fields are populated with all IDs from the complete equivalence set, so all 10 IDs in the case of Fig. 2. N.B. Since equivalence is reflexive, the ID for the merged result will be an element in these fields: cf2 is equivalent to cf2.

questions?

These changes provide a consistent, stable ID for an equivalence set of content making it easier to work with Atlas’ data. If you have any comments or queries please get in touch either in the comments below or on the Atlas mailing list.

blog comments powered by Disqus