Managed vs Unmanaged Triples

February 18, 2016 Data & AI, MarkLogic

Managed triples are triples that are automatically inserted into a database by MarkLogic. Unmanaged triples, on the other hand, require you to insert them into a database yourself. But when does it make sense to use unmanaged triples?

Triple Store vs Document Store

To put it simply:

  • Managed triples are for using MarkLogic as triple store to load large numbers of triples, and have MarkLogic figure out how to store them.
  • Unmanaged triples are for using MarkLogic as document store, while embedding triples in those documents.

That doesn’t really tell the full story though. To start, let’s dive into what a triple index really is.

Triple Index

MarkLogic introduced its triple index and SPARQL support with the release of version 7. Currently, SPARQL 1.1 is supported (pretty much entirely), which includes SPARQL Update. SPARQL code is primarily evaluated against its triple index. This index is not very different from the other indexes in the sense that it looks for certain constructs in documents, and puts those in an index. There are no additional settings to configure, so you simply choose to enable the triple index or not.

Once enabled, it will look within any fragment for anything that matches a triple construct. The indexer currently supports triple constructs in XML format and JSON format. Below you will find an example of each:

Triple expressed as XML:

<sem:triple
xmlns:sem="http://marklogic.com/semantics">
  <sem:subject>subject</sem:subject>
  <sem:predicate>predicate</sem:predicate>
  <sem:object>object</sem:object>
</sem:triple>

Triple expressed as JSON:

{
  "triple": {
    "subject": {     
      "value": "subject"   
    },   
    "predicate": {     
      "value": "predicate"   
    },   
    "object": {     
      "value": "object"   
    }
  }
}

Loading Triples

You can directly load triples stored in most of the common RDF serializations. MarkLogic can parse the following RDF formats out-of-the-box:

All these get parsed into internal sem:triple objects, which can be persisted in a MarkLogic database in either XML or JSON.

The support for all those formats opens a multitude of ways to get hold of RDF data. There are a lot of Linked (Open) Data sources on the web (think of DBPediaGeonames, and Open Calais), but also governmental sites like data.gov.uk, and many more datasets. Most of these data sets have exports available for download, but some also allow running ad hoc SPARQL queries against them to retrieve specific data.

There are many other tools you can use to enrich your data and return information as RDF. Semantic tools are particularly useful for this. For example, the Open Calais API includes a semantic enrichment service and is a public endpoint that can be used for free (with some limitations). You can post any piece of text to it, and in return you get RDF/XML describing the semantic enrichments found by Open Calais.

Managed Triples

With managed triples, you load triples yourself but leave it to MarkLogic to wrap them in documents. MarkLogic inserts XML documents into the target database, where each file will contain about a 100 triples. You can load triples with the function sem:rdf-load, for instance like this:

xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics" 
  at "/MarkLogic/semantics.xqy";
sem:rdf-load("http://dbpedia.org/data/George_Clooney.n3")

If you look into the database after running above code, you will find XML documents with database URIs starting with /triplestore/, and a collection of “http://marklogic.com/semantics#default-graph”. The content looks something like:

<sem:triples xmlns:sem="http://marklogic.com/semantics"> 
  <sem:triple>   
    <sem:subject>http://dbpedia.org/resource/Argo_(2012_film)</sem:subject>   
    <sem:predicate>http://dbpedia.org/ontology/producer</sem:predicate>   
    <sem:object>http://dbpedia.org/resource/George_Clooney</sem:object> 
  </sem:triple> 
  <sem:triple>   
    <sem:subject>http://dbpedia.org/resource/Argo_(2012_film)</sem:subject>   
    <sem:predicate>http://dbpedia.org/property/producer</sem:predicate>   
    <sem:object>http://dbpedia.org/resource/George_Clooney</sem:object> 
  </sem:triple> 
  ...
</sem:triples>

You can also use SPARQL Update (as of MarkLogic 8):

PREFIX dc: <http://purl.org/dc/elements/1.1/>
INSERT DATA{ 
  <http://example/book1/> dc:title "My favorite book" 
}

This will result in an XML document being written to the target database that is similar to the previous example.

Unmanaged Triples

Working with triples is very simple: if you insert a triple into any XML or JSON document, it will get indexed. Even XML triples inside document properties will get indexed. Remember, a triple in a document or property is an unmanaged triple, as they are not managed by MarkLogic automatically.

It doesn’t matter in which kind of document or property you insert them into. It could be a large book file, with triple data embedded inline, or at the end. It could be a record-style document produced by loading delimited text with MLCP and adding some triple data into it. It could be a small document property containing just one triple, or a large one containing many triples. It makes no difference to the triple index.

Document collections are what’s important here. MarkLogic collections are used to represent the notion of graphs in SPARQL. Graphs are useful in addressing subsets of triples. For instance, you could use graphs to distinguish triples from different sources, or triples about different topics, or triples with different quality measures. These are just a few of the many ways in which you could use graphs. Document collections have the very same purpose, but for documents. Since all triples are persisted in documents in the database, using document collections for graphs makes a lot of sense.

If you don’t use graphs in your SPARQL queries, however, then you don’t need to worry about document collections (nor graphs); MarkLogic will simply evaluate against all triples by default, managed or unmanaged, and in any graph or collection.

Embedded Triples

Now that we have reviewed what managed and unmanaged triples are, we should define embedded triples, which are triples embedded in documents that don’t have the sem:triples element as root. It is possible to manually insert documents with sem:triples as root, for instance, as part of migrating triple data. However, we recommend against constructing such documents yourself, or touching those that have been created automatically. Using semantic functions to create triplexml is less error-prone.

Additionally, MarkLogic will treat any sem:triples document as if it contains managed triples. This applies in particular to SPARQL Update, which only affects managed triples. That does include any document with sem:triples as root. Any custom changes inside sem:triples documents can get lost when MarkLogic touches triples via SPARQL Update.

If you use MarkLogic as a triple store, triples get loaded as managed triples, and therefore can be updated using SPARQL Update. On the other hand, if you are embedding triples inside documents, you wouldn’t expect your documents to be changed by SPARQL Update, and MarkLogic will not allow that. Use the document update APIs, in that case.

In summary, don’t create or touch sem:triples documents yourself. Effectively, the terms embedded and unmanaged triples are synonyms.

Managed or Unmanaged

Now that we have learned what managed and unmanaged really means, we come to the key question: how do you store the RDF data? As triples of course — but the question is, managed or unmanaged?

Keeping information close together if it belongs together seems like the most logical move. Take, for instance, triples with semantic enrichment information about a particular document in your MarkLogic database. For such triples, it makes a lot of sense to embed them either inside the document itself, or in its properties — in other words, store them as unmanaged triples. Doing so also makes it very easy to maintain the information. If you delete the document, the triples will get deleted along with it automatically, so you don’t need to worry about deleting the triples as well.

For RDF data that comes from an entirely different source than your other data, and stands on its own, it makes a lot of sense to store it separately using managed triples.

If you use MarkLogic as a pure triple store, you would probably use managed triples only and have the full capabilities of SPARQL at your disposal. If you use MarkLogic as a ‘pure’ document store, you would embed triples in your documents, and not use SPARQL (or only very limited).

This distinction, however, isn’t always as clear cut as you might want it to be. The RDF data could be a mixture of generic information and document-specific information, particularly if it comes from one source. In this case, it’s best to embed only the document-specific triples and store the other triples separately (probably as managed triples).

But there may be a lot to gain by deliberately mixing the two worlds! MarkLogic is perfectly happy with having plain documents, documents with embedded triples, and managed triples all sitting next to each other, and running queries across all of them.

Combination Queries

When you have both documents (with or without embedded triples) and managed triples living next to each other within MarkLogic, you could run a search or lookup against one of the two, and use the outcome as input for a search or lookup in the second set. That is how you would perform joins in MarkLogic with plain documents as well.

This is perfectly fine. If tuned properly, each search would take less than 1/100th of a second, so doing a several searches and lookups to do some joins would hardly be noticed by end users, provided you execute all of them in one request on server-side.

However, you can combine triple and SPARQL queries with document queries in combination queries. The REST API endpoint to run SPARQL (/v1/graphs/sparql), as well as the internal commands to run SPARQL (sem:sparql and related), all take extra parameters to constrain the SPARQL code to documents (with triples) matching those queries.

The SPARQL engine simply truncates the documents that don’t match the document queries, only using the triples from the documents that are left. This builds on top of how MarkLogic combines query terms already, so requires very little overhead — ideal for embedded triples.

You can also do it the other way around, and include a so-called cts:triple-range-query within a more traditional search across documents. However, that query only filters on individual triples, and does not, for instance, take a full SPARQL statement to filter search results. It will also not apply inference rules, and only include materialized triples.

Also worth mentioning, but less efficient, is the fact that you can use cts:contains within the FILTER part of SPARQL, which basically allows you to do full-text searching inside SPARQL with the full power of MarkLogic’s capabilities.

Best of Both Worlds

Such combination queries could get you beyond where you could go if you could use only one kind of query at the same time. It also allows for a much more efficient calculation of search and query results.

Imagine RDF data with a time angle: “tell me what we knew about the MH17 plane crash a year ago”, a perfect case for bi-temporal triples.

Or what about RDF data curated for quality: “show me all data about Barack Obama from LOD sources, but validated by approved curators”, a good case for triples annotated with curation details.

Or documents with semantic enrichments as triples, with supplementary information as (potentially) managed triples: “search across all documents mentioning a US president born between 1900 and 2000”.

Less obvious, but very powerful, is the fact that you can apply document permissions on triples. For managed triples you do that via graphs. Access to unmanaged triples is controlled via the document permissions on the document in which they are embedded.

More examples and details on embedding triples can be found in the Semantics Developer’s Guide.

Faceted Search

You can even consider building faceted search from triples. MarkLogic comes with built-in functionality that can return top-values with frequency counts very fast. This leans on the document approach however, and works best with denormalized data.

The idea is that you select a set of documents: your search result. For that search result, MarkLogic can pull up values sorted on frequency directly from range indexes that you define on elements, properties, paths, etc.

With the same kind of effort, MarkLogic can also pull up value combinations, also known as co-occurrences or value-tuples. For this process, it is important that data that belongs together, lives together in one document (or more accurately in one fragment).

Unfortunately with managed triples, you are never sure in which document a triple will end up as that doesn’t really have meaning with managed triples, nor if it will be stored together with triples that are about one specific topic. So, that won’t work. That is the benefit with embedding triples. With those you have the opportunity to keep related triples together, and embed them in the same fragment as other data they relate to.

It is possible to build facets on managed triples leveraging the triple index with a custom facet. Inside a custom facet you could run SPARQL code, or do counts on cts:triples calls. With MarkLogic 8 and beyond, you could even use SPARQL aggregate functions like count. Keep in mind though that the triple index and SPARQL is about triples, not documents, where facets are focused around documents. What meaning will selecting such a facet value have regarding your search result? With triples embedded inside documents, functions like sem:database-nodes will have a much clearer meaning. Also, keep in mind that generating facet information using SPARQL will likely be less performant.

Embedding in Documents versus Properties

The triple index looks for triple constructs in both documents and in properties. The cost of storing triples in properties, however, is that it requires a second database fragment for each document — meaning extra storage overhead. Constraining document searches with a properties-query also takes a slight performance hit, since MarkLogic will need to join between document fragments and properties fragments. Showing results might also mean you have to pull information from two places, which could be more cumbersome than having your triples and document content in one fragment.

The benefit of embedding triples in properties, on the other hand, is that you automatically have clean separation between document and triples. And if you are handling binary or plain text documents, for instance, you don’t have the option to embed triples other than by embedding them in properties.

Conclusion

As soon as you start embedding your triples inside documents (or properties) you will have unmanaged triples. Unmanaged triples come with a few down-sides like not being able to use SPARQL Update on them, but it opens a lot of interesting possibilities that are unique to MarkLogic. No other database allows querying XML, Text, JSON, Binary, and RDF data in a single query statement.

Special thanks to Patrick McElwee, Eric Poilvet, Dave Cassel, John Snelson, and Stephen Buxton for their feedback and contributions!

Geert Josten