Common language: integration and reconciliation

Overview

Reconciliation is the process of unifying instances in a dataset with a common vocabulary (basically a dictionary of terms, people, things, places, or concepts). By making all references to "Georgia O'Keeffe" point to the same person record we ensure that data expressing her roles in an object's history can be accessed along with other relevant data about her. Likewise, reconciling object classifications lets us move efficiently across art and archival systems.

Vocabularies used

The O'Keeffe Museum Collections Site uses four primary vocabularies to unify handling of entities:

  • AAT – Getty Art & Architecture Thesaurus
  • ULAN – Getty Union List of Artist Names
  • LCNAF – Library of Congress Name Authority File
  • Wikidata

Basic identifier patterns

URIs from the AAT are used directly in entity classifications (such as object classifications as 'paintings', identifier classifications as 'primary', etc.) and role technique classifications, eg:

{
    "id": "http://data.okeeffemuseum.org/object/998",
    "classified_as": [
        {
            "id": "aat:300133025",
            "label": "works of art",
            "type": "Type"
        },
        {
            "id": "aat:300033618",
            "label": "paintings",
            "type": "Type"
        }
    ]
}

People and organizations are connected to their vocabulary terms using the same skos:exact_match pattern that linked.art uses:

{
    "id": "http://data.okeeffemuseum.org/person/1459",
    "exact_match": [ "http://id.loc.gov/authorities/names/n82220933" ]
}

AAT labels

Preferred AAT labels are re-fetched from the Getty Vocabulary (GVP) on data refreshes and used throughout the application. Using this SPARQL query:

SELECT ?entity_uri ?pref_label ?label {
  ?entity_uri a gvp:Concept ;
    gvp:prefLabelGVP ?pref_label .

  ?pref_label a xl:Label ;
    gvp:term ?label_with_lang .

  FILTER(LANG(?label_with_lang) = "" || LANGMATCHES(LANG(?label_with_lang), "en"))
  BIND(STR(?label_with_lang) AS ?label)    
}

The results are cached for use when the endpoint is unreachable. In cases where the GVP expresses multiple preferred labels, we use the shorter of the two.

People and organization names

Person vocabularies tend to be opinionated and highly specialized, with editorial standards varying widely for things like name shortening, born names vs given names, and name language or kind. As a result, we use a report spreadsheet that is managed by the website managers to provide the application's "preferred" labels.

To generate the reconciliation candidates report, we use Wikidata to walk between possible names in ULAN, LCNAF, and Wikidata. Along with relevant metadata, the report includes candidate biographies and Wikipedia links (in case that data is needed to be inserted into source systems).

References