Anchor Dates for Finding Recent Documents

by Dave Cassel Posted on July 19, 2017

In a previous post, I had written a recipe for finding documents containing recent dates and used the following as part of the query:

cts.elementRangeQuery(
  fn.QName("", "pubdate"), "<=", fn.currentDateTime(),
  "score-function=reciprocal")
cts:element-range-query(
  xs:QName("pubdate"), "<=", current-dateTime(),
  "score-function=reciprocal")

One reviewer asked, “is there a disadvantage to specifying pubdate>=xs:dateTime(xs:date(‘0001-01-01’)) score-rating=linear?” It turns out that there indeed is.

When using score-function=reciprocal or score-function=linear, values near the anchor value will be more differentiated (and thus more useful for scoring) than values that are far away.

To illustrate this, let’s generate some sample data. Using the code below, we can generate 100 simple documents, each containing a date that is some months behind the current date.

for $i in (1 to 100)
return
  xdmp:document-insert(
    '/content/doc' || $i || ".xml",
    <doc>
      <pubdate>{ fn:current-dateTime() - xs:yearMonthDuration("P" || $i || "M") }</pubdate>
    </doc>
  )

We’re going to use a range query, so add a date element range index.

The first query uses score-function=reciprocal to see how far the dates in the documents are from today:

var jsearch = require('/MarkLogic/jsearch.sjs');
jsearch.documents()
  .where([
    cts.elementRangeQuery(
      fn.QName("", "pubdate"), "<=", fn.currentDate(),
      "score-function=reciprocal")
  ])
  .slice(0, 100)
  .result()

When we run this, documents come back in the correct order. The search items with indexes 15 & 16 (zero-based index) show the first score collision, with clumps of gradually increasing size coming after. We’re getting some reasonable differentiation based on how far back the documents dates go; when combined with other relevancy factors, this should produce a good ordering.

Now let’s take a look at the opposite approach: how far away are the documents from an ancient time?

var jsearch = require('/MarkLogic/jsearch.sjs');
jsearch.documents()
  .where([
    cts.elementRangeQuery(
      fn.QName("", "pubdate"), ">=", xs.date("0001-01-01"),
      "score-function=linear")
  ])
  .slice(0, 100)
  .result()

All my documents have dates later than year 0001, and the further they are from that year, the higher the score should be. Sounds good, but the math behind the scenes emphasizes dates close to the anchor. In this case, the dates are far enough away that all documents got the same score. Thus, this score contribution is not useful for ordering recent results.

I also ran the experiment with replacing dates with dateTimes and the results were even more dramatic. With the difference in granularity, the equations expect small differences to be significant. Therefore, big differences are poorly differentiated.

Conceptually, you might think you can approach distance scoring from either direction. In practice, if there’s an endpoint you care more about, use that as your anchor.

Further Reading

Recipe — Sort results to promote recent documents

Documentation — Relevance Scores: Understanding and Customizing


Dave Cassel
View all posts from Dave Cassel on the Progress blog. Connect with us about all things application development and deployment, data integration and digital business.
More from the author

Related Tags

Prefooter Dots
Subscribe Icon

Latest Stories in Your Inbox

Subscribe to get all the news, info and tutorials you need to build better business apps and sites

Loading animation