Working with Ranged Buckets: User-Defined Functions

December 19, 2014 Data & AI, MarkLogic

In my previous post about working with Ranged Buckets using custom constraints, we discussed one approach to handling ranged buckets; here, we delve into an approach using user-defined functions (UDFs). To summarize the issue of working with ranges in documents, we have data that look like this:

<doc>
  <lo>2</lo>
  <hi>9</hi>
  <id>1154</id>
</doc>

We want to build a facet with buckets like 0-4, 5-8, 9-12, 13-16, and 17-20. The “lo” and “hi” values in the sample document represent a range, so the document should be counted for the 0-4, 5-8, and 9-12 buckets, even though no value from five to eight explicitly appears in the document.

In Working with Ranged Buckets: Custom Constraints, we solved this problem using a normal custom constraint. Here, we use a more involved technique: a User-Defined Function (UDF). Also referred to as “Aggregate User Defined Functions”, UDFs let MarkLogic application developers write C++ code to implement map/reduce jobs. Personally, I have had only some experience writing meaningful C++ code (the notable exception being the other UDF that I wrote). I got through it, though, and found some interesting results. (Feel free to suggest improvements to the code, which you can clone and follow along.)

Implementation

I’ll refer you to the documentation for the general background on UDFs, but essentially, you need to think about four functions:

start

The start function handles any arguments used to customize this run of the UDF. In my case, I needed to pass in the buckets that I wanted to use. I dynamically allocate an array of buckets that I’ll use throughout the job.

map

Two range indexes get passed in, one for the “lo” element and one for the “hi” element. The map function gets called for each forest stand in the database, examining the values in the input range indexes. When two indexes are passed in, the map function sees the values as tuples. For instance, the values in the sample document above show up as the tuple (2, 9). Always check the frequency of that tuple, in case the same pair occurs in multiple documents. Once this function has been called for a stand, we know the counts for each bucket for the values in that particular stand.

reduce

The reduce function combines the per-stand counts, aggregating them until a set of values for the entire database is known. My implementation just needed to add the counts for each bucket.

finish

The last step is to organize the results in a way that they can be sent back to XQuery. The finish function builds a map, using “0-4” as the key for the first bucket and the count as the value.

Encoding and Decoding

When working in a cluster, encode and decode functions are important too. I implemented them for my simple tests, but used the UDF on a single MarkLogic instance, so these functions weren’t called.

Deploying

Building the UDF is pretty simple using the Makefile provided by MarkLogic. You’ll find the MarkLogic Makefile for UDFs in:

  • /opt/MarkLogic/Samples/NativePlugins/ (Linux)
  • ~/Library/MarkLogic/Samples/NativePlugins/ (Mac)
  • C:Program FilesMarkLogicSamplesNativePlugins (Windows)

I customized the two places where the name needed to match my filename, but otherwise left it alone. After compiling, I uploaded the UDF to MarkLogic using Query Console. I exported the workspace to GitHub, which is available for your reference.

You can call a UDF using the /v1/values endpoint, but I decided to wrap it in a custom constraint to provide a straightforward comparison with the custom constraint built in the previous post. After all, the goal is to provide a facet. A custom constraint requires XML for the search options and XQuery.

The Results

I figured UDFs would be more interesting with multiple forests, as mapping a job to a single forest that has just one stand doesn’t gain any parallelism. With that in mind, I bumped my database up to four forests, then to six, and compared my UDF implementation with the two-function approach I described in the previous approach. I tested with the same 100,000 documents used in the previous post.

Median Seconds4 Forests6 Forests
UDF0.0028980.002858
two-function0.0039090.004261

The numbers are the median seconds returned in the facet-resolution-time part of the response to /v1/search?options=udf or /v1/search?options=startfinish. A couple of things jumped out at me. First, the UDF out-performed the two-function XQuery custom facet. Second, the UDF had a very slight improvement while moving from four forests to six — slight enough that let’s call it even. The two-function approach, however, increased a noticeable amount.

Concluding Thoughts on UDFs

When should you reach for a UDF? When your data don’t support directly getting your values, it might be worthwhile. For instance, when working with ranged buckets, we can’t simply do a facet on “lo” or “hi” because we wouldn’t represent the values in between. Writing a UDF is more complicated and more dangerous than other approaches, but appears to have some performance benefits, as we saw here.

There is usually an alternative. For instance, I could have supplemented my data in our example such that the sample document would have all values from two through nine inclusive, allowing me to use a standard facet. However, this ultimately leads to a tradeoff where you must ask yourself: do I want to spend a little more time at ingest and take up a little more space, or do I want to dynamically compute the values I need? The answer is certainly application-specific, but UDFs provide a handy (and sharp!) tool to work with.

Dave Cassel