Grokking the cts API

by Gabo Manuel Posted on March 26, 2012

As of the MarkLogic 10.0-3 release, the total number of built-in cts (“core text search”) functions comes in at 367! That already excludes deprecated functions. Given how central the cts functions are for building applications on MarkLogic, I thought it would help to provide some pointers in navigating this potentially overwhelming API.

But first of all, if you’re just getting started building a standard search application, you should start with the Search API (which uses and provides hooks into the cts API).

Having said that, here you go!

Just kidding. While word clouds can be fun, they’re not always very useful. (I generated the above based on each function’s number of search hits on this website, so I suppose the result is somewhat interesting; just don’t put too much stock in it.)

Let’s take a tour through the cts API, using some categories I’ve chosen. We’ll knock down all functions, without necessarily explaining how they work. You’ll want to refer to the cts API documentation for those details.

The following list summarizes my breakdown by category:

  • Query Execution (3)
  • Query Objects (220)
  • Lexicon Functions (69)
  • Lexicon Reference (21)
  • Geospatial shapes (18)
  • Sorting (7)
  • Search result meta-data (6)
  • Miscellaneous (23)

Now let’s take a quick tour through each one.

Query execution (3)

The most important function of them all is cts:search, which is concerned with executing cts queries (we’ll get to those next). A related and also important function is cts:contains which matches a given node sequence against a given cts query, returning true if it matches and false otherwise. cts:walk is used in a similar manner as cts:contains except it returns the actual match instead of just true or false. Three down, 384 to go!

Query objects (220)

MarkLogic extends the XPath data model with an object type called “cts:query”, which is the super-type of a number of more specific cts:query sub-types. Queries can be composed together using the cts query constructor functions. They can then be executed by passing them to cts:search() or passed to other functions, such as lexicon calls or functions in other libraries, including search:resolve(), jsearch’s where clause and many others. All of these function names end in “-query”. If you see a cts function whose name ends in “-query”, you can be assured that it’s a cts:query constructor.

Query constructors can be categorized into different kinds. I’m going to call them leaf, composite, and “special” (for lack of a better word).

Composite query constructors (12)

The composite query constructors build up new queries from other queries, whether leaf queries or other composite queries. Here they are broken down into a few sub-categories:

CategoryComposite query constructor
Logical compositioncts:and-query

cts:and-not-query

cts:or-query

cts:not-query

cts:not-in-query

Element/Property scopingcts:element-query

cts:json-property-scope-query

Fragment scopingcts:document-fragment-query

cts:locks-fragment-query

cts:properties-fragment-query

Special queriescts:boost-query

cts:near-query

Leaf query constructors (35)

The leaf query constructors are for queries that can stand on their own, i.e. can be constructed without the help of another query constructor. The following list breaks them down into several categories, depending on what the query searches for (collection URIs, directories, words, values, etc.). I’ve marked some of the text with bold type to draw attention to the consistent naming conventions.

Object being searchedLeaf query constructors
collection URIscts:collection-query
document URIscts:document-query
directoriescts:directory-query
wordscts:element-attribute-word-query

cts:element-word-query

cts:field-word-query

cts:json-property-word-query

cts:word-query

valuescts:element-attribute-value-query

cts:element-value-query

cts:field-value-query

cts:json-property-value-query

range indexcts:element-attribute-range-query

cts:element-range-query

cts:field-range-query

cts:json-property-range-query

cts:path-range-query

cts:period-range-query

cts:range-query

cts:triple-range-query

geospatialcts:element-attribute-pair-geospatial-query

cts:element-child-geospatial-query

cts:element-geospatial-query

cts:element-pair-geospatial-query

cts:geospatial-region-query

cts:json-property-child-geospatial-query

cts:json-property-geospatial-query

cts:json-property-pair-geospatial-query

cts:path-geospatial-query

timestampcts:after-query

cts:before-query

cts:lsqt-query

cts:period-compare-query

booleancts:false-query

cts:true-query

Words and values differ in how they compare content against the search. A JSON document containing {“Text”: “some content”} will match cts:word-query(“some”) but not cts:json-property-value-query(“some”).

Another thing worth noticing about the word, value, and range queries above is that they have consistent ways of scoping queries: by element, by attribute, or by field. So we see a function for each pairing of scope (element, attribute, or field) and object (word, value, or range). We’ll see something similar with the lexicon functions. Stay tuned.

This scoping applies to filtered search, i.e. we expect documents for element-***-query to return only XML documents while json-***-query would only return JSON documents. For unfiltered search, element-***-query and json-***-query will return both JSON and XML documents that match the query. Of course this does not apply to element-attribute-***-query since there is no such thing for JSON documents.

Special query constructors (5)

While the functions below each return a cts:query value, they don’t really fall into the above (leaf vs. composite) categories:

FunctionDescription
cts:queryconstructs a cts:query from its XML representation
cts:registered-queryreturns a previously registered query (using cts:register)
cts:reverse-queryreturns a reverse query (for finding stored queries given a document, rather than stored documents given a query)
cts:similar-queryreturns a query matching nodes similar to the given model nodes
cts:parseconverts a search string to an equivalent cts:query using a defined grammar.

Okay, only 332 functions to go. (I promise the pace will pick up soon.)

Query accessors (168)

The query accessor functions aren’t very interesting at all—and there are 168 of them! They’re accessors for the various components of a cts:query value. You can recognize them using this failsafe technique: if you see a cts function whose name includes the string “-query-“, then it’s just an accessor. An example would be cts:word-query and its three accessors: cts:word-query-options, cts:word-query-text, and cts:word-query-weight. See a pattern?

Lexicon functions (69)

Lexicon functions are much more interesting. Whereas cts queries are about efficiently finding documents, lexicon functions are about efficiently retrieving unique values (or words or URIs, etc.) from across a potentially large number of documents. They all require a particular index setting to be enabled. For “search,” think cts:search. For “analytics,” think lexicon functions.

String lexicons (14)

Below are the 24 non-geospatial lexicon and lexicon wildcard functions grouped by lexicon type. Note the consistent naming conventions (at the end of the function names).

Aggregate FunctionWildcard functionSource
cts:uriscts:uri-matchURI lexicon
cts:collectionscts:collection-matchCollection lexicon
cts:wordscts:word-matchWord lexicon
cts:element-wordscts:element-word-matchElement word lexicon
cts:element-attribute-wordscts:element-attribute-word-matchAttribute word lexicon
cts:json-property-wordscts:json-property-word-matchElement word lexicon
cts:field-wordscts:field-word-matchField word lexicon (inside Fields)

Lexicons are typically found at the database configuration page of the Admin UI, except for Field word lexicon as noted above.

Scalar type specific lexicons (18)

Aggregate FunctionWildcard functionSource
cts:valuescts:value-matchRange index
cts:element-valuescts:element-value-matchElement range index
cts:element-attribute-valuescts:element-attribute-value-matchAttribute range index
cts:field-valuescts:field-value-matchField range index
cts:value-rangesRange index
cts:element-value-rangesElement range index
cts:element-attribute-value-rangesAttribute range index
cts:field-value-rangesField range index
cts:value-co-occurrencesRange index
cts:element-value-co-occurrencesElement range index
cts:element-attribute-value-co-occurrencesAttribute range index
cts:field-value-co-occurrencesField range index
cts:value-tuplesRange index
cts:triplesTriples range index

The range index above is a combination of element, attribute and field range index. “Range index” also includes the collection and uri lexicon. Indexes are found on the left-hand side of the Admin UI when you click on a database (Configure >> Databases >> {database name} >> *** Index. These functions can be used to generate aggregate reports.

Geospatial lexicons (17)

Aggregate FunctionWildcard functionShape
cts:element-geospatial-valuescts:element-geospatial-value-matchPoints
cts:element-child-geospatial-valuescts:element-child-geospatial-value-matchPoints
cts:element-pair-geospatial-valuescts:element-pair-geospatial-value-matchPoints
cts:element-attribute-pair-geospatial-valuescts:element-attribute-pair-geospatial-value-matchPoints
cts:geospatial-co-occurrencesPoint pairs
cts:element-value-geospatial-co-occurrencesPoint pairs
cts:element-attribute-value-geospatial-co-occurrencesPoint pairs
cts:geospatial-boxesBoxes
cts:element-geospatial-boxesBoxes
cts:element-pair-geospatial-boxesBoxes
cts:element-child-geospatial-boxesBoxes
cts:element-attribute-pair-geospatial-boxesBoxes
cts:match-regionsPolygon

Requires corresponding geospatial index (element, element pair, element-child, element attribute pair). Which of these you use depends on how you chose to represent geospatial coordinates in your data.

Math-specific aggregates (19)

These are functions that will perform the mathematical computations for you.

cts:aggregatects:linear-modelcts:rank*
cts:correlationcts:maxcts:stddev
cts:avg-aggregatects:median*cts:stddev-p
cts:covariancects:mincts:sum-aggregate
cts:covariance-pcts:percent-rank*cts:variance
cts:count-aggregatects:percentile*cts:variance-p
cts:triple-value-statistics

*These functions take in a sequence (or an array) of values. The rest of the functions require a range index or collation.

Tuple meta-data (1)

This only contains the function cts:frequency.

Constructors (17)Lexicon reference functions (21)

Reference FunctionTarget
cts:uri-referenceURI lexicon
cts:collection-referenceCollection lexicon
cts:element-referenceElement range index
cts:json-property-referenceElement range index
cts:element-attribute-referenceAttribute range index
cts:field-referenceField range index
cts:path-referencePath range index
cts:geospatial-element-referenceGeospatial element point range index
cts:geospatial-json-property-referenceGeospatial element point range index
cts:geospatial-attribute-pair-referenceGeospatial element attribute point range index
cts:geospatial-element-child-referenceGeospatial element child point range index
cts:geospatial-json-property-child-referenceGeospatial element child point range index
cts:geospatial-element-pair-referenceGeospatial element pair point range index
cts:geospatial-json-property-pair-referenceGeospatial element pair point range index
cts:geospatial-path-referenceGeospatial path point range index
cts:geospatial-region-path-referenceGeospatial region range index
cts:reference-parseAny index represented by the XML to be parsed.

These functions are often times used with the String and Scalar type-specific lexicon functions, as mentioned in the previous section.

Accessors (4)

cts:reference-collationcts:reference-nullable
cts:reference-coordinate-systemcts:reference-scalar-type

Geospatial shapes and accessors (18)

ShapeAccessor
cts:pointcts:point-latitude

cts:point-longitude

cts:linestringcts:linestring-vertices
cts:circlects:circle-center

cts:circle-radius

cts:boxcts:box-east

cts:box-north

cts:box-south

cts:box-west

cts:polygoncts:polygon-vertices
cts:complex-polygoncts:complex-polygon-inner

cts:complex-polygon-outer

Note that functions like cts:***-intersects and cts:***-contains are now deprecated. Switch to the geo library.

Most commonly, you use these shapes to construct geospatial queries. So first you construct a cts:region (using one or more of the above constructor functions). Then, you construct a geospatial cts:query (using a geospatial query function such as cts:element-geospatial-query), passing it the cts:region(s) you constructed. Finally, you pass the query to cts:search to run a geospatial search, or to a lexicon function to perform some geospatial-related analytics.

Sorting (7)

These constructors are typically used to specify which document information to use to “pre-sort” the response of cts:search, jsearch, and search:search.

ConstructorSorted by
cts:index-orderSort based on range-index
cts:document-orderSort based on the hash of the document URI
cts:quality-orderSort based on document quality
cts:score-orderSort based on search score. Affected by document quality and document frequency
cts:fitness-orderSort based on fitness. Not affected by document quality nor by document frequency
cts:confidence-orderSort by confidence. Not affected by document quality
cts:unordered#iDon’tCare

Search result meta-data functions (6)

The result of a call to cts:search() is a sequence of nodes that reside in your database. But these node references also contain some special properties (five, to be precise) that extend beyond the XPath data model. They’re very handy for building search applications since they relate to things like search relevance:

FunctionPurpose
cts:scorelog(term frequency) * (inverse document frequency) + (QualityWeight * Quality)
cts:qualityDocument quality
cts:confidenceScore without document frequency
cts:fitnessConfidence without the effect document quality
cts:relevance-infoRelevance score
cts:remainderEstimate of the remaining fragments to process.

Miscellaneous categories (23)

“Miscellaneous” is a popular category in my family’s monthly budget, but I digress. I’ll try to break down these last remaining functions into some sub-categories:

CategoryFunction
Parsing/tokenizationcts:stem

cts:tokenize

cts:part-of-speech

cts:distinctive-terms

Registered querycts:deregister

cts:register

Classifiercts:classify

cts:thresholds

cts:train

Temporalcts:period

cts:period-compare

Clusteringcts:cluster
Entity Servicescts:entity

cts:entity-dictionary

cts:entity-dictionary-parse

cts:entity-highlight

Result node manipulationcts:element-walk

cts:highlight

XPath validationcts:valid-document-patch-path

cts:valid-extract-path

cts:valid-index-path

cts:valid-optic-path

cts:valid-tde-context

I’m not going to explain these (or fall on any swords defending their categorization). The important thing is that the cts API looks a lot less overwhelming to you now, right? There’s a hidden wisdom to it all—an underlying logic, a latent brilliance, a method to the madness…sorry, got a little carried away there.

Conclusion

Congratulations, you made it through the whole tour! As a reward, here’s a little code to look at. It’s the query I ran to generate the data for the Wordle shown at the beginning of the article. And, yes, it does use the cts API:

for $func-name in cts:element-attribute-values(xs:QName("function"),
                                               xs:QName("fullname"))
where starts-with($func-name,"cts:")
return
  concat($func-name,":",xdmp:estimate(cts:search(collection(),$func-name)))

And if you’re thinking to yourself that I must have a range index enabled on my database since I’m calling a value lexicon, you’re right. Well done.


Gabo Manuel
View all posts from Gabo Manuel on the Progress blog. Connect with us about all things application development and deployment, data integration and digital business.
More from the author

Related Tags

Prefooter Dots
Subscribe Icon

Latest Stories in Your Inbox

Subscribe to get all the news, info and tutorials you need to build better business apps and sites

Loading animation