Do you ever ask yourself: how can I evaluate an XPath expression to return a lexicon of values, such as cts:uris
, without pulling each document from disk?
This issue often arises from scenarios where you have to process documents in bulk, using tools like corb or MarkLogic’s Task Server to spawn processing across your cluster. When performing bulk operations, you need to ensure you can process documents that meet or do not meet a specific condition. Additionally, you must ensure that if the processing fails you can continue where you left off without reprocessing all documents.
We are going to focus on answering one particular question: how do you find the URIs of documents that do not have a deeply nested structure? Generally, this problem isn’t easily solvable using pure cts:query constructs — that was, until now.
Consider the following code for two documents having similar nested structures, with the second missing the /p:parent/p:outer/p:last
element:
declare namespace p = "p"; let $doc1 := <p:parent> <p:outer> <p:first>pf1</p:first> <p:last>pl1</p:last> </p:outer> <p:child> <p:inner> <p:first>cf1</p:first> <p:last>cl1</p:last> </p:inner> </p:child> </p:parent> let $doc2 := <p:parent> <p:outer> <p:first>pf2</p:first> </p:outer> <p:child> <p:inner> <p:first>cf2</p:first> <p:last>cl2</p:last> </p:inner> </p:child> </p:parent> return ( xdmp:document-insert("doc1",$doc1), xdmp:document-insert("doc2",$doc2) )
Determining which URIs have this structure is very simple using XPath Expression as below:
doc()[p:parent/p:outer/p:last]/xdmp:node-uri(.) [Returns] doc1
The problem with this approach is that the XPath expression runs “filtered”, which requires fragments to be pulled from disk to return the URI. While this works for a small database with a few thousand records, at some point you will hit the dreaded "XDMP-EXPNTREECACHEFULL"
error. This essentially means that you tried to return more documents than can fit into memory for the transaction. So putting on your MarkLogic black belt, you construct a complex cts:query using nested cts:element-query‘s to simulate a path structure such as:
cts:uris((),(), cts:element-query(xs:QName("p:parent"), cts:element-query(xs:QName("p:outer"), cts:element-query(xs:QName("p:last"), cts:and-query(())) )) )
WOW! Not only is that complicated, but it is also incorrect, as it returns both doc1 and doc2. So why did this happen? Well, the short answer is that MarkLogic resolves the cts:query “unfiltered”, relying on indexes in memory and not the fragments themselves. To resolve this correctly, the index must determine that p:parent
is a parent element of p:outer
, that has a child element of p:last
. Sure, you could try to tinker with positions and proximity, but even then may not yield the correct result. So how come we can do this in XPath, but not perform the same thing using cts:query? To answer this question, we will look deeper into a handy function called xdmp:plan
. The documentation for the xdmp:plan function states the following:
xdmp:plan( $expression as item()*, [$maximum as xs:double?] ) as element()
Returns an XML element recording information about how the given expression will be processed by the index. The information is a structured representation of the information provided in the error log when query trace is enabled. The query will be processed up to the point of getting an estimate of the number of fragments returned by the index.
So let’s dig a bit deeper into what exactly is inside the plan by wrapping our XPath Expression with the xdmp:plan
function.
xdmp:plan(/p:parent/p:outer/p:last)
The output is an XML fragment with the following information:
<qry:query-plan >"http://marklogic.com/cts/query"> <qry:info-trace>xdmp:eval("declare namespace p = &quot;p&quot;;&#10;xdmp:plan(/p:parent/p:o...", (), <options ><database>14817900035712326498</database><root>c:usersgvidalw...</options>)</qry:info-trace> <qry:info-trace>Analyzing path: fn:collection()/p:parent/p:outer/p:last</qry:info-trace> <qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace> <qry:info-trace>Step 2 is searchable: p:parent</qry:info-trace> <qry:info-trace>Step 3 is searchable: p:outer</qry:info-trace> <qry:info-trace>Step 4 is searchable: p:last</qry:info-trace> <qry:info-trace>Path is fully searchable.</qry:info-trace> <qry:info-trace>Gathering constraints.</qry:info-trace> <qry:info-trace>Executing search.</qry:info-trace> <qry:final-plan> <qry:and-query> <qry:term-query weight="0"> <qry:key>4523426088818201359</qry:key> <qry:annotation>descendant(doc-root(element(p:parent),doc-kind(document)) )</qry:annotation> </qry:term-query> <qry:term-query weight="0"> <qry:key>11698328636857559070</qry:key> <qry:annotation>descendant(element-child(p:parent/p:outer))</qry:annotation> </qry:term-query> <qry:term-query weight="0"> <qry:key>17573168699309579415</qry:key> <qry:annotation>element-child(p:outer/p:last)</qry:annotation> </qry:term-query> </qry:and-query> </qry:final-plan> <qry:info-trace>Selected 1 fragment</qry:info-trace> <qry:result estimate="1"/> </qry:query-plan>
Now, if you noticed from the output above, the query is now fully resolvable from indexes denoted by the following lines:
<qry:info-trace>Analyzing path: fn:collection()/p:parent/p:outer/p:last</qry:info-trace> <qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace> <qry:info-trace>Step 2 is searchable: p:parent</qry:info-trace> <qry:info-trace>Step 3 is searchable: p:outer</qry:info-trace> <qry:info-trace>Step 4 is searchable: p:last</qry:info-trace> <qry:info-trace>Path is fully searchable.</qry:info-trace> <qry:info-trace>Gathering constraints.</qry:info-trace> <qry:info-trace>Executing search.</qry:info-trace>
This is good, but how does this resolve our issue? The simple answer is that it doesn’t. But what is returned after does. Once each step in the plan is resolvable, then the end result is the query plan itself. Now if you notice from the excerpt below, the qry:final-plan
expresses a series of qry:term-query elements that define a qry:key.
<qry:final-plan> <qry:and-query> <qry:term-query weight="0"> <qry:key></qry:key> <qry:annotation>descendant(doc-root(element(p:parent),doc-kind(document)) )</qry:annotation> </qry:term-query> <qry:term-query weight="0"> <qry:key></qry:key> <qry:annotation>descendant(element-child(p:parent/p:outer))</qry:annotation> </qry:term-query> <qry:term-query weight="0"> <qry:key></qry:key> <qry:annotation>element-child(p:outer/p:last)</qry:annotation> </qry:term-query> </qry:and-query> </qry:final-plan> 45234260888182013591169832863685755907017573168699309579415
These keys actually resolve to term keys in the Universal Index. Terms within the Universal Index cover both words and structure. Each of the term-query’s annotations describe what each key represents. Notice that the first key descendant(doc-root(element(p:parent), doc-kind(document)))
represents the doc() axis to the p:parent element, and the next key descendant(element-child(p:parent/p:outer))
represents the relationship between the p:parent
and p:outer
element, until you get the final key element-child(p:outer/p:last)
which completes the path step between the p:outer
and p:last
elements.
Okay, this is getting more interesting, but we still have not seen how to resolve the problem. So now we are going to go into undocumented territory and hack the plan FTW.
A little known feature outside of MarkLogic’s walls is a function called cts:term-query(xs:unsignedLong)
, which resolves a query based on a term key. Now if we take the keys from the plan above, we can craft a cts:query to combine all of those term keys into a single composable query. Since the results of the plan are xml this is as simple as the following statement:
cts:uris((),(), cts:and-query( xdmp:plan(/p:parent/p:outer/p:last)//*:key/cts:term-query(.) ) ) [returns] doc1
Whoa!!!! Is that for real? Indeed it is. So if that is true, what other things can we query using this method?
How about finding all URIs for a given root element?
cts:uris((),(), cts:and-query( xdmp:plan(/p:parent)//*:key/cts:term-query(.) ) ) [Returns] doc1 doc2
What about all binary documents?
cts:uris((),(), cts:and-query( xdmp:plan(/binary())//*:key/cts:term-query(.) ) ) [Returns all binary document uris]
What about all documents that don’t have the /p:parent/p:outer/p:last path?
cts:uris((),(), cts:not-query(cts:and-query( xdmp:plan(/p:parent/p:outer/p:last)//*:key/cts:term-query(.) )) ) [Returns] All documents in database not 'doc1'
Why did this not work? We wanted all documents that had p:parent
that did not have the p:outer/p:last
element. The simple answer is by using a not-query you inverted the query to return all documents that did not resolve to each step in the plan,including all p:parent
elements So, head scratching, how can we fix this?
I will get into another neat and unknown feature of a structure called map:map.
Maps are mutable key/value structures that perform extremely fast hash insert/lookup operations. The map:map
structure has been available for quite some time (since MarkLogic 5) and most lexicon functions (cts:uris, cts:element-x-values) support maps as an alternative output to list sequences. But what is unknown about these structures is they support operators such as (+, -, *, div, mod) to mutate and combine maps together. Again, this is merely a brief overview, but you explore a more detailed review of map:map operators functionality in the post Understanding map:map operators, aggregates and use cases.
For the purposes of solving our original problem, we will use maps to compute the difference (map - map
) of two cts:uris
calls. Revisiting our original example, we wanted to return all p:parent
documents who did not have the p:outer/p:last
element. The solution is returned using the following code:
map:keys( cts:uris((),("map"), cts:element-query(xs:QName("p:parent"), cts:and-query(()))) - cts:uris((), ("map"), cts:and-query(xdmp:plan(/p:parent/p:outer/p:last)//qry:term-query/qry:key ! cts:term-query(.)) )) [Returns] doc2
Which translates to:
cts:uris((), ("map"), cts:element-query(xs:QName("p:parent"),cts:and-query(())))
Return all URIs that match (/p:parent
) as a map:map
- (: Notice the minus sign :) cts:uris((), ("map"), cts:and-query(xdmp:plan(/p:parent/p:outer/p:last)//qry:term-query/qry:key ! cts:term-query(.))
Return the difference (-
) of all URIs that have match (/p:parent/p:outer/p:last
) as a map:map
map:keys($map1 - $map2)
The outer map:keys
flattens the map back to a sequence of URI values.
I knwo this was quite a lot to digest, and it may seem like what I’ve discussed is juju or dark magic, but in reality, you can see that these methods unlock powerful tools that allow you to use MarkLogic in ways not previously possible. Good luck and happy coding!
DISCLAIMER: The techniques in this article, including the use of cts:term-query(), may or not be sanctioned by MarkLogic and are subject to change in the product. So use at your own risk!!!. But hey — “no risk, no reward”.