Anchor Dates for Finding Recent Documents

In a previous post, I had written a recipe for finding documents containing recent dates and used the following as part of the query:

cts.elementRangeQuery(
  fn.QName("", "pubdate"), "<=", fn.currentDateTime(),
  "score-function=reciprocal")

cts:element-range-query(
  xs:QName("pubdate"), "<=", current-dateTime(),
  "score-function=reciprocal")

One reviewer asked, “is there a disadvantage to specifying pubdate>=xs:dateTime(xs:date(‘0001-01-01’)) score-rating=linear?” It turns out that there indeed is.

When using score-function=reciprocal or score-function=linear, values near the anchor value will be more differentiated (and thus more useful for scoring) than values that are far away.

To illustrate this, let’s generate some sample data. Using the code below, we can generate 100 simple documents, each containing a date that is some months behind the current date.

for $i in (1 to 100)
return
  xdmp:document-insert(
    '/content/doc' || $i || ".xml",
    <doc>
      <pubdate>{ fn:current-dateTime() - xs:yearMonthDuration("P" || $i || "M") }</pubdate>
    </doc>
  )

We’re going to use a range query, so add a date element range index.

The first query uses score-function=reciprocal to see how far the dates in the documents are from today:

var jsearch = require('/MarkLogic/jsearch.sjs');
jsearch.documents()
  .where([
    cts.elementRangeQuery(
      fn.QName("", "pubdate"), "<=", fn.currentDate(),
      "score-function=reciprocal")
  ])
  .slice(0, 100)
  .result()

When we run this, documents come back in the correct order. The search items with indexes 15 & 16 (zero-based index) show the first score collision, with clumps of gradually increasing size coming after. We’re getting some reasonable differentiation based on how far back the documents dates go; when combined with other relevancy factors, this should produce a good ordering.

Now let’s take a look at the opposite approach: how far away are the documents from an ancient time?

var jsearch = require('/MarkLogic/jsearch.sjs');
jsearch.documents()
  .where([
    cts.elementRangeQuery(
      fn.QName("", "pubdate"), ">=", xs.date("0001-01-01"),
      "score-function=linear")
  ])
  .slice(0, 100)
  .result()

All my documents have dates later than year 0001, and the further they are from that year, the higher the score should be. Sounds good, but the math behind the scenes emphasizes dates close to the anchor. In this case, the dates are far enough away that all documents got the same score. Thus, this score contribution is not useful for ordering recent results.

I also ran the experiment with replacing dates with dateTimes and the results were even more dramatic. With the difference in granularity, the equations expect small differences to be significant. Therefore, big differences are poorly differentiated.

Conceptually, you might think you can approach distance scoring from either direction. In practice, if there’s an endpoint you care more about, use that as your anchor.

Anchor Dates for Finding Recent Documents

Further Reading

Dave Cassel