Design Patterns: The Envelope Design Pattern

MarkLogic design patterns are reusable solutions for many of the commonly occurring problems encountered when designing MarkLogic applications. These patterns may be unique to applications on MarkLogic or may be industry patterns that have MarkLogic specific considerations. Unlike recipes, MarkLogic design patterns are generally more abstract and applicable in multiple scenarios.

Envelope Design Pattern

Intent

Separate data intended for consumption by external processes from data intended to make the MarkLogic database system more powerful and flexible. Create an overall envelope parent element or object that contains a “headers” section and an “instance” data section, which are separate within the document. This aligns with how the MarkLogic Data Hub Framework and MarkLogic Entity Services create their envelopes.

Motivation

Additional, Richer Indexing

In MarkLogic, the JSON or XML document that is stored becomes the “interface to the indexes.” This means that when you add an element to an XML document, or add a nested object to a JSON structure, you are also causing that data and its relationship to parents and values to be indexed. Thus, the primary mechanism to adding indexed data is simply adding elements and nested objects.

It is useful to separate out the data that is used to do something in MarkLogic from the data that is stored because your data services API needs it.

Data services may want data that:

Includes human readable dates such as Jan 20, 2005
Conforms to some standard schema that cannot be changed
Uses nested or ambiguous JSON and XML structures that are hard to query efficiently
Preserves original input data without any change for compliance or policy reasons

In contrast, MarkLogic processing and indexing may improve if data:

Uses XML standard data-types including dates, numbers and timestamps
Materializes important derived information such as internal item counts
Materializes missing data to allow faster queries for the absence of an element
Combines a number of alternatives into one uniquely-named element, such as the “primary” address being selected from home, shipping, and business addresses, based on business rules

While the same goal can be achieved using extensive transforms on the data as it is ingested, and then again as it is accessed by the data services, it is much more efficient and clear to have the externally-accessed “core” data stored as-is.

For the sake of simplicity, we will discuss this pattern in XML terms (documents, elements and nodes) going forward, but all concepts also apply to JSON.

In summary, we address two conflicting goals by using the envelope pattern:

Store data in the same format as the desired output format
Define indexed, normalized, and otherwise convenient structures for use within MarkLogic that are in a different format from the input/output formats sent to the database.

Some systems have additional goals as well, such as multiple APIs that consume data in radically different formats (e.g., CSV vs. JSON, XML vs. RDF). In that case, there may be more sections than the headers and instance sections (such as “triples” or “html-preformatted”).

Integrating Heterogeneous Data (“Silo Busting”)

This pattern is often used to quickly integrate large sets of data together into MarkLogic. With this approach, raw or “good enough” data is directly ingested into MarkLogic, and a relatively small number of elements are initially included in the “headers” section to maintain uniform indexing, retrieval, and analysis across many data sets.

All data in the “instance” section can be accessed, rendered using default rendering, exported, and managed, and the system accessing the data can be developed very quickly using the most valuable data first.

When used with the Data Hub Framework—where raw content is initially ingested into a Staging database—the “instance” section would include more uniform or harmonized data.

Preserve System Flexibility

Keeping data used purely by MarkLogic processes separate from data accessed by data services allows developers to add data to the “headers” section as needed without breaking external layers or sub-systems. This can reduce time to analyze, re-code, test, and coordinate on large projects.

Applicability

The envelope design pattern should generally be used in all designs. You should have a specific and compelling reason not to use this pattern before omitting it from your design.

We recommend to use this pattern when:

You need to add a facet or chart on a value that is not stored in the documents you have already modeled or inherited, or not in the format you need.
- More generally, you have to index some fact or value for any purpose (not necessarily charting or faceting), but it is not suitable for indexing.
You wish to expose a certain SQL view using range indexes, but the raw data is not in the format that the SQL caller needs.
You have to index some value but it is not in the right format.
You are dealing with a schema you cannot change, but want to preserve the ability later to add facts and other data to the stored records for indexing and convenience purposes.
You are storing both document data (XML and JSON) and RDF. Put the RDF in an envelope as a separate section.
You have multiple callers that require substantially different version of the same data, and a transform from one format to the other at runtime is too much of a performance burden. (Separate documents may also work in this case)
There is an internal process at your organization or project making it slow to change what is perceived as the “real” data model, making it important to have an index-only section that can be changed based on developer needs, without coordinating with external groups.

Participants

Retrieval code: Code to retrieve data must know that the data has two or more sections, search or filter based on data in the “headers” section, and return the “instance” data section needed by the data services.
Updating code: Code to update or ingest documents must know to keep the “headers” section up-to-date by applying a transform that builds the “headers” section from the instance or input data. Triggers can be used, but they are less efficient than using a structural “All access through a service” pattern (TBD).
Indexes: The “headers” section will often have elements formatted specifically to drive certain indexes, such as range indexes or the RDF-based triple index.

Collaborations

“All access through a service” is a pattern that ensures that all updates add the “headers” section and that all queries remove it. This makes the “headers” section invisible to callers, preserving flexibility within the MarkLogic data layer (within .js and .xqy code inside MarkLogic itself).

Consequences

Adding “index-able” data is separated from returning data formats. A change to headers will not be externally visible to clients depending on the “instance” data.

Performance is increased when retrieving a different data format than the one stored in MarkLogic. Typically, retrieval code simply ends with “return $result/core/element()” to exclude all the envelope and headers elements.
There is a low to moderate performance cost to create the headers data on insert and to return only the core section on query.
RESTful data access must use resource extensions to allow XQuery/JavaScript code to intercept and modify data on both insert and extract. Out-of-the-box REST access does not include transforms of data on ingest/extract.
MarkLogic mlcp does not transform data on extract, so XQSync should be used for bulk data movement that does not include the headers data.

Implementation

Consider the following issues when implementing the envelope pattern:

You will have to perform all access through a set of .xqy or .sjs files to ensure the pattern is transparent, or you must add a trigger, which adds some update overhead.
You must use different elements in the header section vs. the instance section, typically by using a different namespace in XML for the headers section. Be aware that text or word queries will see data in both sections unless you use an enclosing element-query() to restrict searches to the instance or headers section. If using this kind of element query is important, you will probably want positional indexes turned on to make them efficient from the indexes without filtering.

Sample Code

Article Repository

Consider a set of articles like this one in XML format that need to be stored, searched, and accessed:

    
<article>
  <abstract>
    <para>You can build a fence by deciding the areas to separate, and then making a barrier from wood or metal that sits between them.</para>
  </abstract>
  <para>It is often said that good fences make good neighbors.</para>
  <para>Choosing areas to divide with your fence is the first step. Jim Smith has built a lot of fences, and says that in Paris, France, people divide garden areas from other areas, but in Cleaveland, OH, people divide chidren's play areas from the street most often</para>
  <articleInfo>
    <title>How to build a fence</title>
    <revision>
      <date>1/15/2002</date>
      <revnumber>1.0</revnumber>
    </revision>
    <author><firstname>Nihal</firstname><surname>Jain</surname></author>
  </articleInfo>
</article>

Figure 1: Sample article XML document

Figure 1 shows a simplified approximation of the docBook schema. Let’s assume that callers need this data in this exact format or it will be considered invalid. There are two problems you should consider if you want to search or facet using a range index on the revision date. First, the desired data is in a non-specific <date> element; therefore, adding a range index on “date” is likely to also include other dates if <date>is ever used in other contexts. Second, the date is in a format that is not compatible with the XML spec for an xs:dateTime. To solve these two issues, we run this transform on ingest:

    
declare variable $article external;
declare namespace meta = "http://marklogic.com/patternExample/meta";
let $textDate := $article/articleInfo/revision/date/text()
let $xsDate := xdmp:parse-dateTime("mm/dd/yyyy", $textDate)
let $internalDate := <meta:revisionDate>{$xsDate}</meta:revisionDate>
return
  <envelope
    xmlns="http://marklogic.com/entity-services">
    <headers>
      {$internalDate}
    </headers>
    <instance>
      {$article}
    </instance>
  </envelope>

Figure 2: Transform code that extracts a date and creates a new date element for searching

Code in Figure 2 extracts a transformed/formatted version of the date and creates a more specifically-named element in another namespace, <meta:revisionDate>, which allows for unambiguous indexing and access to the desired xs:date value.

Now, to search for all articles in January of 2002, we would add a date range index to <meta:revisionDate> and query like this:

    
declare namespace es = "http://marklogic.com/entity-services";
declare namespace meta = "http://marklogic.com/patternExample/meta";
(: generic function to query documents, including headers, but return only the instance data :)
declare function es:queryData($q) {
 for $envelope in cts:search(/es:envelope, $q)
 return $envelope/es:instance/element()
};
let $fromQ := cts:element-range-query(xs:QName("meta:revisionDate"),
  ">=", xs:date("2002-01-01"))
let $toQ := cts:element-range-query(xs:QName("meta:revisionDate"),
  "<=", xs:date("2002-01-31"))
let $jan2002Q := cts:and-query(($fromQ, $toQ))
return es:queryData($jan2002Q)

Note that the function es:queryData($q) returns any child element of the <es:instance> element, so it is not specific to articles.

Social Network Relationships

For data representing profiles in a social network, such as LinkedIn or Facebook, we may store a person’s profile as XML, but their relationships as RDF. The RDF may go in the “triples” section.

Here is a hypothetical person profile in a social network application:

              
declare namespace sn = "http://marklogic.com/patterns/example/social-network";
<sn:person>
  <sn:name>Alfred</sn:name>
  <sn:uniqueUserName>Alfred_Jones_1974</sn:uniqueUserName>
  <sn:interests>
    <sn:interest levelofinterest="7">Semantics</sn:interest>
    <sn:interest levelofinterest="10">MarkLogic</sn:interest>
    <sn:interest levelofinterest="3">Polyglot Persistence</sn:interest>
  </sn:interests>
  <sn:friends>
    <sn:friend>Sally2227</sn:friend>
    <sn:friend>MargaretTheProgrammer</sn:friend>
    <sn:friend>Neeraj</sn:friend>
  </sn:friends>
</sn:person>

Figure 3: Sample profile data

Each user is ideally modeled as a document, because it is self-contained and hierarchical. However, the social network itself is a graph, so the relationship data is ideally modeled using RDF triples:

      
Alfred <foaf:knows> Sally
Alfred <foaf:knows> Margaret
Alfred <foaf:knows> Neeraj

To augment the profile in Figure 3 with semantic triple information about the social network “Alfred” is part of, run this code when each document is inserted or updated:


let $thisPersonName := $newPerson/sn:uniqueUserName/text()
let $knowsGraph :=
  for $friendName in $newPerson/sn:friends/sn:friend/text()
  return sem:triple(
    sem:iri($thisPersonName),
    sem:iri("http://xmlns.com/foaf/0.1/knows"), 
    sem:iri($friendName) )
let $envelope :=
  <envelope xmlns="http://marklogic.com/entity-services">
    <es:triples>
      {$knowsGraph}
    </es:triples>
    <es:instance>
    {$newPerson}
    </es:instance>
  </es:envelope>
return $envelope

Figure 4: Transform code that adds triples to the envelope

Running the code in Figure 4 results in the structure we want: the “person” record is left as-is, bundled into an envelope with semantic triples that describe the social network derived from this profile:

    
  <es:envelope xmlns:es="http://marklogic.com/entity-services">
  <es:triples>
    <sem:triple xmlns:sem="http://marklogic.com/semantics">
      <sem:subject>Alfred_Jones_1974</sem:subject>
      <sem:predicate>http://xmlns.com/foaf/0.1/knows</sem:predicate>
      <sem:object>Sally2227</sem:object>
    </sem:triple>
    <sem:triple xmlns:sem="http://marklogic.com/semantics">
      <sem:subject>Alfred_Jones_1974</sem:subject>
      <sem:predicate>http://xmlns.com/foaf/0.1/knows</sem:predicate>
      <sem:object>MargaretTheProgrammer</sem:object>
    </sem:triple>
    <sem:triple xmlns:sem="http://marklogic.com/semantics">
      <sem:subject>Alfred_Jones_1974</sem:subject>
      <sem:predicate>http://xmlns.com/foaf/0.1/knows</sem:predicate>
      <sem:object>Neeraj</sem:object>
    </sem:triple>
  </es:triples>
  <es:instance>
    <sn:person xmlns:sn="http://marklogic.com/patterns/example/social-network">
      <sn:name>Alfred</sn:name>
      <sn:uniqueUserName>Alfred_Jones_1974</sn:uniqueUserName>
      <sn:interests>
        <sn:interest levelofinterest="7">Semantics</sn:interest>
        <sn:interest levelofinterest="10">MarkLogic</sn:interest>
        <sn:interest levelofinterest="3">Polyglot Persistence</sn:interest>
      </sn:interests>
      <sn:friends>
        <sn:friend>Sally2227</sn:friend>
        <sn:friend>MargaretTheProgrammer</sn:friend>
        <sn:friend>Neeraj</sn:friend>
      </sn:friends>
    </sn:person>
  </es:instance>
</es:envelope>

This example is slightly different than the article repository example in that we introduce a triples section to highlight its purpose. The instance section is simply the original “person” record.

Related Patterns

Related patterns (TBD) include all patterns to add data outside of the actual documents being inserted and returned. These include patterns to store additional information in the URI scheme, collections, properties fragments, or RDF triples.

Uses

The envelope pattern has become ubiquitous in MarkLogic implementations. The pattern is leveraged heavily in the MarkLogic Data Hub Framework, and is likely found in any MarkLogic implementation that involves data integration.

MarkLogic

Damon Feldman

Damon is a passionate “Mark-Logician,” having been with the company for over 7 years as it has evolved into the company it is today. He has worked on or led some of the largest MarkLogic projects for customers ranging from the US Intelligence Community to HealthCare.gov to private insurance companies.

Prior to joining MarkLogic, Damon held positions spanning product development for multiple startups, founding of one startup, consulting for a semantic technology company, and leading the architecture for the IMSMA humanitarian landmine remediation and tracking system.

He holds a BA in Mathematics from the University of Chicago and a Ph.D. in Computer Science from Tulane University.