Lazy Evaluation Magic

October 23, 2017 Data & AI, MarkLogic

Laziness is rarely ever viewed as a good quality, but when used appropriately, lazy evaluation can be a really great technique!

Lazy evaluation is the ability to process data only when needed. The example presented here shows how a call to cts:search() can be processed but actually evaluated at a later point when it is directly bound into an expression. This is particularly useful when we want to process a very long sequence (e.g. the search matches millions of documents) and we don’t really want to access all members of the sequence. Lazy Evaluation can be very useful however it is not always desired. For example, if we need to read out the contents of a range index, we may explicitly want to read all the contents of the index into memory. See When to be lazy, and when to be eager, when querying the database for an example of lazy evolution in action.

We recently implemented a common request for one of our customers: to produce a sequence of search results for “child” documents grouped by their “parent” document, and return only the most relevant hit for each parent. To clarify further, let’s imagine that the parent documents are chapters with tables of contents and the child documents are sections with the actual content. We want to search all sections and receive a list of chapters whose children match the query. This list is in descending relevance order. Each chapter with a hit should only appear once in the list.

Child documents can be identified by an element is-chapter containing a Boolean value set to false. Each child document has a parent chapter whose id is stored in the element chapter-id.

To help you visualize this, let’s look at some sample XML.

Here is a sample of a chapter document:

<document id="0001">
    <is-chapter>true</is-chapter>
    <content>...details....</content>
</document>

And here is a sample section document. Note that it has an element chapter-id which identifies the chapter to which it belongs:

<document id="0001-0001">
    <chapter-id>0001</chapter-id>
    <is-chapter>false</is-chapter>
    <content>...details...</content>
</document>

And here is an XQuery implementation of this requirement that makes use of lazy evaluation:

let $qtxt := "foo"
let $query := cts:and-query((
                  cts:element-range-query(xs:QName("is-chapter"),"=", "false"),
                  cts:word-query($qtxt)
                ))
(:extract the parent doc id for every hit:)
let $p := function($result as node()) { $result/chapter-id/fn:string()  }                        
let $options :=  (cts:score-order("descending"), "unfiltered" )                     
return
    fn:distinct-values( fn:map($p, cts:search(/document, $query, $options)  ) )[1 to 10]

In the and-query we are performing a word search in all “non-chapter” docs (is-chapter = false). We then set up the final call of the query which returns a list of parents of the matched children in most relevant order.

Profiling demonstrates how lightning-fast the query can be executed, and how an increased amount of distinct values (the chapter-id's)can be found with remarkably little overhead even though the actual search query could match many documents.

In order to get the first 10 distinct values in our test case, for example, it needed to look at the first 12 search hits (the inline function $p above is called 12 times). To retrieve the first 20 distinct values, it needed to examine the first 31 search hits.

But shouldn’t there be massive performance penalty for retrieving all search results and then extracting the distinct values? Luckily, lazy evaluation and some useful features of the MarkLogic Search API can prevent that.

The call to cts:search returns an iterator and the profiler shows that this function is called just once. The call to fn:map also returns an iterator (although a profiler bug shows this as being invoked many times, it is only invoked once). The function fn:distinct-values takes advantage of these iterators and uses them to fill up the sequence of unique values. Thanks to these iterators, the query does not need to initially retrieve all search results, and in turn is super-efficient!

In summary, lazy evolution can be a very useful technique when selectively processing large result sets!!!

Joe Crean