Data & AI MarkLogic

Grokking the cts API

by Gabo Manuel Posted on March 26, 2012

As of the MarkLogic 10.0-3 release, the total number of built-in cts (“core text search”) functions comes in at 367! That already excludes deprecated functions. Given how central the cts functions are for building applications on MarkLogic, I thought it would help to provide some pointers in navigating this potentially overwhelming API.

But first of all, if you’re just getting started building a standard search application, you should start with the Search API (which uses and provides hooks into the cts API).

Having said that, here you go!

Just kidding. While word clouds can be fun, they’re not always very useful. (I generated the above based on each function’s number of search hits on this website, so I suppose the result is somewhat interesting; just don’t put too much stock in it.)

Let’s take a tour through the cts API, using some categories I’ve chosen. We’ll knock down all functions, without necessarily explaining how they work. You’ll want to refer to the cts API documentation for those details.

The following list summarizes my breakdown by category:

Query Execution (3)
Query Objects (220)
Lexicon Functions (69)
Lexicon Reference (21)
Geospatial shapes (18)
Sorting (7)
Search result meta-data (6)
Miscellaneous (23)

Now let’s take a quick tour through each one.

Query execution (3)

The most important function of them all is cts:search, which is concerned with executing cts queries (we’ll get to those next). A related and also important function is cts:contains which matches a given node sequence against a given cts query, returning true if it matches and false otherwise. cts:walk is used in a similar manner as cts:contains except it returns the actual match instead of just true or false. Three down, 384 to go!

Query objects (220)

MarkLogic extends the XPath data model with an object type called “cts:query”, which is the super-type of a number of more specific cts:query sub-types. Queries can be composed together using the cts query constructor functions. They can then be executed by passing them to cts:search() or passed to other functions, such as lexicon calls or functions in other libraries, including search:resolve(), jsearch’s where clause and many others. All of these function names end in “-query”. If you see a cts function whose name ends in “-query”, you can be assured that it’s a cts:query constructor.

Query constructors can be categorized into different kinds. I’m going to call them leaf, composite, and “special” (for lack of a better word).

Composite query constructors (12)

The composite query constructors build up new queries from other queries, whether leaf queries or other composite queries. Here they are broken down into a few sub-categories:

Category	Composite query constructor
Logical composition	cts:and-query cts:and-not-query cts:or-query cts:not-query cts:not-in-query
Element/Property scoping	cts:element-query cts:json-property-scope-query
Fragment scoping	cts:document-fragment-query cts:locks-fragment-query cts:properties-fragment-query
Special queries	cts:boost-query cts:near-query

Leaf query constructors (35)

The leaf query constructors are for queries that can stand on their own, i.e. can be constructed without the help of another query constructor. The following list breaks them down into several categories, depending on what the query searches for (collection URIs, directories, words, values, etc.). I’ve marked some of the text with bold type to draw attention to the consistent naming conventions.

Object being searched	Leaf query constructors
collection URIs	cts:collection-query
document URIs	cts:document-query
directories	cts:directory-query
words	cts:element-attribute-word-query cts:element-word-query cts:field-word-query cts:json-property-word-query cts:word-query
values	cts:element-attribute-value-query cts:element-value-query cts:field-value-query cts:json-property-value-query
range index	cts:element-attribute-range-query cts:element-range-query cts:field-range-query cts:json-property-range-query cts:path-range-query cts:period-range-query cts:range-query cts:triple-range-query
geospatial	cts:element-attribute-pair-geospatial-query cts:element-child-geospatial-query cts:element-geospatial-query cts:element-pair-geospatial-query cts:geospatial-region-query cts:json-property-child-geospatial-query cts:json-property-geospatial-query cts:json-property-pair-geospatial-query cts:path-geospatial-query
timestamp	cts:after-query cts:before-query cts:lsqt-query cts:period-compare-query
boolean	cts:false-query cts:true-query

Words and values differ in how they compare content against the search. A JSON document containing {“Text”: “some content”} will match cts:word-query(“some”) but not cts:json-property-value-query(“some”).

Another thing worth noticing about the word, value, and range queries above is that they have consistent ways of scoping queries: by element, by attribute, or by field. So we see a function for each pairing of scope (element, attribute, or field) and object (word, value, or range). We’ll see something similar with the lexicon functions. Stay tuned.

This scoping applies to filtered search, i.e. we expect documents for element-***-query to return only XML documents while json-***-query would only return JSON documents. For unfiltered search, element-***-query and json-***-query will return both JSON and XML documents that match the query. Of course this does not apply to element-attribute-***-query since there is no such thing for JSON documents.

Special query constructors (5)

While the functions below each return a cts:query value, they don’t really fall into the above (leaf vs. composite) categories:

Function	Description
cts:query	constructs a cts:query from its XML representation
cts:registered-query	returns a previously registered query (using cts:register)
cts:reverse-query	returns a reverse query (for finding stored queries given a document, rather than stored documents given a query)
cts:similar-query	returns a query matching nodes similar to the given model nodes
cts:parse	converts a search string to an equivalent cts:query using a defined grammar.

Okay, only 332 functions to go. (I promise the pace will pick up soon.)

Query accessors (168)

The query accessor functions aren’t very interesting at all—and there are 168 of them! They’re accessors for the various components of a cts:query value. You can recognize them using this failsafe technique: if you see a cts function whose name includes the string “-query-“, then it’s just an accessor. An example would be cts:word-query and its three accessors: cts:word-query-options, cts:word-query-text, and cts:word-query-weight. See a pattern?

Lexicon functions (69)

Lexicon functions are much more interesting. Whereas cts queries are about efficiently finding documents, lexicon functions are about efficiently retrieving unique values (or words or URIs, etc.) from across a potentially large number of documents. They all require a particular index setting to be enabled. For “search,” think cts:search. For “analytics,” think lexicon functions.

String lexicons (14)

Below are the 24 non-geospatial lexicon and lexicon wildcard functions grouped by lexicon type. Note the consistent naming conventions (at the end of the function names).

Aggregate Function	Wildcard function	Source
cts:uris	cts:uri-match	URI lexicon
cts:collections	cts:collection-match	Collection lexicon
cts:words	cts:word-match	Word lexicon
cts:element-words	cts:element-word-match	Element word lexicon
cts:element-attribute-words	cts:element-attribute-word-match	Attribute word lexicon
cts:json-property-words	cts:json-property-word-match	Element word lexicon
cts:field-words	cts:field-word-match	Field word lexicon (inside Fields)

Lexicons are typically found at the database configuration page of the Admin UI, except for Field word lexicon as noted above.

Scalar type specific lexicons (18)

Aggregate Function	Wildcard function	Source
cts:values	cts:value-match	Range index
cts:element-values	cts:element-value-match	Element range index
cts:element-attribute-values	cts:element-attribute-value-match	Attribute range index
cts:field-values	cts:field-value-match	Field range index
cts:value-ranges	Range index
cts:element-value-ranges	Element range index
cts:element-attribute-value-ranges	Attribute range index
cts:field-value-ranges	Field range index
cts:value-co-occurrences	Range index
cts:element-value-co-occurrences	Element range index
cts:element-attribute-value-co-occurrences	Attribute range index
cts:field-value-co-occurrences	Field range index
cts:value-tuples	Range index
cts:triples	Triples range index

The range index above is a combination of element, attribute and field range index. “Range index” also includes the collection and uri lexicon. Indexes are found on the left-hand side of the Admin UI when you click on a database (Configure >> Databases >> {database name} >> *** Index. These functions can be used to generate aggregate reports.

Geospatial lexicons (17)

Aggregate Function	Wildcard function	Shape
cts:element-geospatial-values	cts:element-geospatial-value-match	Points
cts:element-child-geospatial-values	cts:element-child-geospatial-value-match	Points
cts:element-pair-geospatial-values	cts:element-pair-geospatial-value-match	Points
cts:element-attribute-pair-geospatial-values	cts:element-attribute-pair-geospatial-value-match	Points
cts:geospatial-co-occurrences	Point pairs
cts:element-value-geospatial-co-occurrences	Point pairs
cts:element-attribute-value-geospatial-co-occurrences	Point pairs
cts:geospatial-boxes	Boxes
cts:element-geospatial-boxes	Boxes
cts:element-pair-geospatial-boxes	Boxes
cts:element-child-geospatial-boxes	Boxes
cts:element-attribute-pair-geospatial-boxes	Boxes
cts:match-regions	Polygon

Requires corresponding geospatial index (element, element pair, element-child, element attribute pair). Which of these you use depends on how you chose to represent geospatial coordinates in your data.

Math-specific aggregates (19)

These are functions that will perform the mathematical computations for you.

cts:aggregate	cts:linear-model	cts:rank*
cts:correlation	cts:max	cts:stddev
cts:avg-aggregate	cts:median*	cts:stddev-p
cts:covariance	cts:min	cts:sum-aggregate
cts:covariance-p	cts:percent-rank*	cts:variance
cts:count-aggregate	cts:percentile*	cts:variance-p
cts:triple-value-statistics

*These functions take in a sequence (or an array) of values. The rest of the functions require a range index or collation.

Tuple meta-data (1)

This only contains the function cts:frequency.

Constructors (17)Lexicon reference functions (21)

Reference Function	Target
cts:uri-reference	URI lexicon
cts:collection-reference	Collection lexicon
cts:element-reference	Element range index
cts:json-property-reference	Element range index
cts:element-attribute-reference	Attribute range index
cts:field-reference	Field range index
cts:path-reference	Path range index
cts:geospatial-element-reference	Geospatial element point range index
cts:geospatial-json-property-reference	Geospatial element point range index
cts:geospatial-attribute-pair-reference	Geospatial element attribute point range index
cts:geospatial-element-child-reference	Geospatial element child point range index
cts:geospatial-json-property-child-reference	Geospatial element child point range index
cts:geospatial-element-pair-reference	Geospatial element pair point range index
cts:geospatial-json-property-pair-reference	Geospatial element pair point range index
cts:geospatial-path-reference	Geospatial path point range index
cts:geospatial-region-path-reference	Geospatial region range index
cts:reference-parse	Any index represented by the XML to be parsed.

These functions are often times used with the String and Scalar type-specific lexicon functions, as mentioned in the previous section.

Accessors (4)

cts:reference-collation	cts:reference-nullable
cts:reference-coordinate-system	cts:reference-scalar-type

Geospatial shapes and accessors (18)

Shape	Accessor
cts:point	cts:point-latitude cts:point-longitude
cts:linestring	cts:linestring-vertices
cts:circle	cts:circle-center cts:circle-radius
cts:box	cts:box-east cts:box-north cts:box-south cts:box-west
cts:polygon	cts:polygon-vertices
cts:complex-polygon	cts:complex-polygon-inner cts:complex-polygon-outer

Note that functions like cts:***-intersects and cts:***-contains are now deprecated. Switch to the geo library.

Most commonly, you use these shapes to construct geospatial queries. So first you construct a cts:region (using one or more of the above constructor functions). Then, you construct a geospatial cts:query (using a geospatial query function such as cts:element-geospatial-query), passing it the cts:region(s) you constructed. Finally, you pass the query to cts:search to run a geospatial search, or to a lexicon function to perform some geospatial-related analytics.

Sorting (7)

These constructors are typically used to specify which document information to use to “pre-sort” the response of cts:search, jsearch, and search:search.

Constructor	Sorted by
cts:index-order	Sort based on range-index
cts:document-order	Sort based on the hash of the document URI
cts:quality-order	Sort based on document quality
cts:score-order	Sort based on search score. Affected by document quality and document frequency
cts:fitness-order	Sort based on fitness. Not affected by document quality nor by document frequency
cts:confidence-order	Sort by confidence. Not affected by document quality
cts:unordered	#iDon’tCare

Search result meta-data functions (6)

The result of a call to cts:search() is a sequence of nodes that reside in your database. But these node references also contain some special properties (five, to be precise) that extend beyond the XPath data model. They’re very handy for building search applications since they relate to things like search relevance:

Function	Purpose
cts:score	log(term frequency) * (inverse document frequency) + (QualityWeight * Quality)
cts:quality	Document quality
cts:confidence	Score without document frequency
cts:fitness	Confidence without the effect document quality
cts:relevance-info	Relevance score
cts:remainder	Estimate of the remaining fragments to process.

Miscellaneous categories (23)

“Miscellaneous” is a popular category in my family’s monthly budget, but I digress. I’ll try to break down these last remaining functions into some sub-categories:

Category	Function
Parsing/tokenization	cts:stem cts:tokenize cts:part-of-speech cts:distinctive-terms
Registered query	cts:deregister cts:register
Classifier	cts:classify cts:thresholds cts:train
Temporal	cts:period cts:period-compare
Clustering	cts:cluster
Entity Services	cts:entity cts:entity-dictionary cts:entity-dictionary-parse cts:entity-highlight
Result node manipulation	cts:element-walk cts:highlight
XPath validation	cts:valid-document-patch-path cts:valid-extract-path cts:valid-index-path cts:valid-optic-path cts:valid-tde-context

I’m not going to explain these (or fall on any swords defending their categorization). The important thing is that the cts API looks a lot less overwhelming to you now, right? There’s a hidden wisdom to it all—an underlying logic, a latent brilliance, a method to the madness…sorry, got a little carried away there.

Conclusion

Congratulations, you made it through the whole tour! As a reward, here’s a little code to look at. It’s the query I ran to generate the data for the Wordle shown at the beginning of the article. And, yes, it does use the cts API:

for $func-name in cts:element-attribute-values(xs:QName("function"),
                                               xs:QName("fullname"))
where starts-with($func-name,"cts:")
return
  concat($func-name,":",xdmp:estimate(cts:search(collection(),$func-name)))

And if you’re thinking to yourself that I must have a range index enabled on my database since I’m calling a value lexicon, you’re right. Well done.

Gabo Manuel

View all posts from Gabo Manuel on the Progress blog. Connect with us about all things application development and deployment, data integration and digital business.

Related Tags

MarkLogic

MarkLogic

Semaphore

OpenEdge

DataDirect

Sitefinity

Telerik

Kendo UI

Corticon