Working with Ranged Buckets: Custom Constraints

December 13, 2014 Data & AI, MarkLogic

One of my colleagues ran into an interesting problem a while ago where he had data with high and low values for some field and wanted to display bucketed facets on those values. Let’s take a look at how to implement that.

Note: all the code for this post is available at my ranged-bucket GitHub repo, so you’re welcome to clone the repository and follow along.

The Data

To illustrate the point, let’s look at some sample data.

<doc>
  <lo>2</lo>
  <hi>9</hi>
  <id>1154</id>
</doc>

This represents a document whose valid values range from two to nine. Now, suppose we want to get a bucketed facet on these documents, showing how many fall into ranges like 0-4, 5-8, 9-12, etc. This is different from how we usually do facets or buckets. The sample document should be counted for the 5-8 bucket, even though no value from five to eight appears in the document.

Note also that a document may well fall into more than one bucket. The example document will be represented in the three buckets we’ve specified so far.

Generating Data

We need some data to work with, so let’s generate some. The repository has a Query Console workspace that you can import, with a buffer to generate sample data with “lo” values ranging from zero to ten (inclusive) and “hi” values ranging from zero to twenty. The high value is a random number added to the low, ensuring that the high is always greater than the low.

xquery version "1.0-ml";

for $i in (1 to 10000)
return 
  xdmp:document-insert(
    "/" || $i || ".xml",
    <doc>
      {
        let $lo := xdmp:random(10)
        let $hi := $lo + xdmp:random(10)
        return (
          <lo>{$lo}</lo>,
          <hi>{$hi}</hi>
        )
      }
      <id>{$i}</id>
    </doc>
  )

The Code

To implement this, two approaches occurred to me: a custom constraint facet and a User-Defined Function (UDF). This post shows the custom constraint approach; I also discuss working with ranged buckets using UDF.

Custom Constraint

To implement a custom constraint facet, there are three functions we need to know about. The first is used when someone selects a facet value, or otherwise makes use of a constraint, allowing the function to parse the request and turn it into a cts:query; this function is important for any constraint, whether used as a facet or not.

declare function bucket:parse(
  $query-elem as element(),
  $options as element(search:options))
as schema-element(cts:query)
{
  <root>
  {
    let $bucket := fn:tokenize($query-elem/search:text, "-")
    return
      cts:and-query((
        cts:element-range-query(xs:QName("lo"), "<=", xs:int($bucket[2])),
        cts:element-range-query(xs:QName("hi"), ">=", xs:int($bucket[1]))
      ))
  }</root>/*
};

The text part of the incoming request is expected to look like “5-8”, or some other pair of numbers. These are split and used to build an and-query.

To make a custom constraint work as a facet, you need to implement functions to return values and counts. These are split into start-facet and finish-facet functions. The job of the start function is to make the lexicon API calls needed to identify the values; the finish function formats the results as the Search API and REST API expect.

You’re not technically required to implement the start function, since you can make the lexicon calls in the finish function. That’s actually a simple way to get started. You will get some performance improvement if you split the work properly, however. To illustrate this, I implemented both ways. I’ll only show the split code here, but you can see the single-function approach at GitHub.

Here’s the split implementation:

declare function bucket:start(
  $constraint as element(search:constraint),
  $query as cts:query?,
  $facet-options as xs:string*,
  $quality-weight as xs:double?,
  $forests as xs:unsignedLong*)
as item()*
{
  let $buckets := $constraint/search:custom/search:bucket
  for $bucket in $buckets
  return
    xdmp:estimate(
      cts:search(
        fn:doc(),
        cts:and-query((
          cts:element-range-query(xs:QName("lo"), "<=", xs:int($bucket/@hi)),
          cts:element-range-query(xs:QName("hi"), ">=", xs:int($bucket/@lo)),
          $query
        ))))
};

declare function bucket:finish(
  $start as item()*,
  $constraint as element(search:constraint),
  $query as cts:query?,
  $facet-options as xs:string*,
  $quality-weight as xs:double?,
  $forests as xs:unsignedLong*)
as element(search:facet)
{
  element search:facet {
    attribute name { $constraint/@name },
    let $buckets := $constraint/search:custom/search:bucket
    for $bucket at $i in $buckets
    return
      element search:facet-value {
        attribute name { $bucket/@lo || "-" || $bucket/@hi },
        attribute count { $start[$i] }
      }
  }
};

You can see the call to xdmp:estimate() in the start function. The values returned end up in the $start parameter to the finish function. Why split them up this way? Because MarkLogic can do some of this work in the background, allowing for a faster overall response.

Sidebar: why estimate and not count?

Note that what you return from the start function is important. In my first attempt, my start function constructed elements with attributes for count, hi, and low, then the finish function pulled out what it needed to make the search:facet-value elements. That was (not surprisingly) slower than just doing everything in the finish function. My revised implementation just returns the results of the xdmp:estimate() calls. The finish function already knows what order they will be in, so it’s able to map those to the correct hi-lo values to construct the search:facet-values.

It’s fair to ask how much difference the one-function versus two-function approaches makes. I generated 100,000 sample documents and ran some simple tests on my MacBook Pro in MarkLogic 7.0-4.1. (I should caveat this by saying I didn’t trouble to shut everything else down, I just wanted to get an idea about the difference.) I threw in a single-value, standard bucket facet for comparison. Each approach was run as a single facet, making calls through the REST API.

ApproachMedian Facet-resolution Time
Two function0.003622 sec
One function0.009369 sec
Single-value buckets0.001049 sec

Take these as rough numbers, but if you’d like to run more precise tests, you can get the full code and configuration from GitHub.

As you can see in Working with Ranged Buckets: User-Defined Functions, the UDF approach provides some advantage over the approach discussed here. Have a look!

Dave Cassel