Design Patterns: The Triple Provenance Pattern

April 17, 2020 Data & AI, MarkLogic

MarkLogic design patterns are reusable solutions for many of the commonly occurring problems encountered when designing MarkLogic applications. These patterns may be unique to applications on MarkLogic or may be industry patterns that have MarkLogic specific considerations. Unlike recipes, MarkLogic design patterns are generally more abstract and applicable in multiple scenarios.

Triple Provenance with Document Annotations Design Pattern

Intent

Semantics applications often need to capture provenance information at the triple-level.

Using the Envelope Pattern, annotate JSON/XML serialization of triples with provenance details.

Motivation

When building applications that leverage data from disparate sources, especially in a semantics context, it is common to want to capture provenance information, such as source and last updated time. With RDF alone, reification (i.e. statements about statements, see Reification on Semantic web) is a technique that can be used, but it results in a significant expansion in the number of triples needed and can greatly complicate SPARQL queries.

A solution that can provide provenance details for a triple without the added complexity of reification would be ideal. Fortunately, triples stored on documents in MarkLogic can take advantage of their serialization as JSON and XML to provide an additional level of context. This is achieved through additional metadata on the document, specifically on the triple objects.

Applicability

This pattern is applicable when you need to capture triple-level provenance details. This pattern requires that triples be persisted on documents, not using MarkLogic Managed Triples. This pattern is suitable for cases where the provenance details does not need to be returned directly as part of a SPARQL query but rather it is acceptable to retrieve it off of the document.

Participants

The participants involved implementing this pattern are as follows:

  • Update code for annotating triples
  • Retrieval code for getting provenance detail for triple

Examples of each can be found under Sample Code below.

Collaborations

The retrieval code must be aware of how the update code has persisted the provenance details.

Consequences

This pattern enables the persistence of provenance details for a given triple by storing annotations on triples serialized in JSON or XML. Retrieval of provenance details is facilitated through use of JavaScript or XQuery to path into documents, identify matching target triple and return the annotations.

To take advantage of this pattern, you cannot use Managed Triples and need to add provenance annotations during document insertion / update or prior to ingestion. You must also be able to identify the document where the triple resides.

This can be achieved through use of an identity triple that links the subject IRI to the URI of the document:

const subject = sem.iri("http://marklogic.com/resources/myEntity");
const uri = sem.iri("/content/myEntity.json");
sem.triple(subject,sem.iri("http://www.w3.org/2000/01/rdf-schema#isDefinedBy"), uri);

A trade-off using this pattern is that you cannot use pure SPARQL to get to the provenance details.

Implementation

If you are implementing this pattern, it is important that there is a consistent process for adding and retrieving provenance details.

If your application uses Template Driven Extraction (TDE), you can wrap elements/properties you would like to annotate with provenance details like this:

{
    "metadata": [
        {
            "propertyWrapper": {
                "systemOwner": "Joe Smith",
                "source": "DB1",
                "updateTime": "2017-05-17T14:17:38.786Z"
            }
        },
        {
            "propertyWrapper": {
                "id": "ABC",
                "source": "DB1",
                "updateTime": "2017-05-17T14:17:38.786Z"
            }
        }
    ]
}

Here’s a sample TDE template:

{
 "template": {
 "context": "/metadata",
 "vars": [
 {
 "name": "prefix-subjects",
 "val": "'http://example.org/subjects'"
 },
 
 {
 "name": "prefix-predicates",
 "val": "'http://example.org/predicates'"
 },
 {
 "name": "doc-id",
 "val": "sem:iri($prefix-subjects || '/' || fn:root(.)/metadata/propertyWrapper/id/fn:string())"
 }
 ],
 "triples": [
 {
 "subject" : {"val" : "$doc-id"},
 "predicate" : {"val" : "sem:iri($prefix-predicates || '/id')"},
 "object" : { "val" : "propertyWrapper/id/fn:string()", "invalidValues" : "ignore"}
 },
 {
 "subject" : {"val" : "$doc-id"},
 "predicate" : {"val" : "sem:iri($prefix-predicates || '/systemOwner')"},
 "object" : { "val" : "propertyWrapper/systemOwner/fn:string()", "invalidValues" : "ignore"},
 
 }
 ]
 }
 }

Sample Code

On XML documents this can be most easily achieved by using attributes on the triples:

declare namespace prov = "http://marklogic.com/designPatterns/prov";
declare function prov:add-provenance($triple as sem:triple, $source as xs:string, $timestamp as xs:dateTime) {
  element sem:triple {
    attribute source {$source},
    attribute updatedTime {$timestamp},
    document {  $triple  }/element()/node()
  }
};
let $triple := sem:triple(sem:iri("myIri"),sem:iri("myProperty"),"a value")
return
prov:add-provenance($triple,"DB", fn:current-dateTime())

Here is the sample approach in JSON, but instead of using attributes, we instead add properties to the triple object:

function addProvenance(triple, source, timestamp) {
 const t = xdmp.toJSON(triple).toObject();
 t.triple.source = source;
 t.triple.timestamp = timestamp;
 return t;
}
const t = sem.triple(sem.iri("JSON Annotation"), sem.iri("testProp"), "value");
addProvenance(t, "myDB", new Date().toJSON());

Here is an example of how you might retrieve the provenance details:

function getProvenance(uri, predicate, value) {
 const triple = fn.head(cts.doc(uri).xpath(`/triples/triple[predicate = '${predicate}' and object/value eq '${value}']`));
 const result = {};
 result.source = triple.source;
 result.timestamp = triple.timestamp;
 return result;
}

Related Patterns

Envelope Pattern

Conclusion

With triples alone, it can be challenging to capture provenance details without introducing complexity that negatively impacts the usability of your triples and query performance. Through use of MarkLogic’s multi-model support, we are able to take advantage of embedding triples on documents with annotations that provide additional context and can be retrieved easily using a small amount of JavaScript or XQuery code.

For more information on additional ways to take advantage of embedding triples on documents, see the Semantics guide and the chapter on Unmanaged Triples.

Tom Ternquist