Data flows through three layers on its way to becoming front-end pages: tabular, linked data as statements, and linked data as documents.
The first of these data layers is simply the data as we get it from a source system. In the case of museum collections data, this data usually arrives at the pipeline as CSV exports or a JSON API. For archival materials we ingest EAD3/XML documents and for library materials we ingest MARCXML.
There are not currently "fetchable" data sources (APIs that can be accessed directly by scripts). There would be individual pipeline commands for each of these, and the contents of the fetch would be deposited into the data/
directory of the data pipeline and treated from that point forward like any other data source.
In some of these cases source data needs special handling--branch selection of archival hierarchies, escaping embedded HTML, etc. If these are needed the pipeline generates an intermediate representation that is then passed to the linked data transforms.
Every data source is passed through a linked data transform:
At the end of the transformation process, linked data is loaded into the triplestore. More details on transformations can be found in the collections app repository.
At this stage we begin enhancing the graph we've loaded into the database. We:
Then we take the graph full of triples, query via SPARQL construct queries for entities and their underlying properties, and frame the results so that we have JSON-LD documents that show us objects with their identifiers, descriptive cataloguing, participants, etc.
At this stage we also create a data release with a versioned tag that indicates the date it was produced.
Finally, we take the JSON-LD documents and produce simplified versions that the page builder uses to produce HTML documents for the site, and by the search index for the browse and search pages.