As of the MarkLogic 10.0-3 release, the total number of built-in cts (“core text search”) functions comes in at 367! That already excludes deprecated functions. Given how central the cts functions are for building applications on MarkLogic, I thought it would help to provide some pointers in navigating this potentially overwhelming API.
But first of all, if you’re just getting started building a standard search application, you should start with the Search API (which uses and provides hooks into the cts API).
Having said that, here you go!
Just kidding. While word clouds can be fun, they’re not always very useful. (I generated the above based on each function’s number of search hits on this website, so I suppose the result is somewhat interesting; just don’t put too much stock in it.)
Let’s take a tour through the cts API, using some categories I’ve chosen. We’ll knock down all functions, without necessarily explaining how they work. You’ll want to refer to the cts API documentation for those details.
The following list summarizes my breakdown by category:
Now let’s take a quick tour through each one.
The most important function of them all is cts:search, which is concerned with executing cts queries (we’ll get to those next). A related and also important function is cts:contains which matches a given node sequence against a given cts query, returning true if it matches and false otherwise. cts:walk is used in a similar manner as cts:contains except it returns the actual match instead of just true or false. Three down, 384 to go!
MarkLogic extends the XPath data model with an object type called “cts:query”, which is the super-type of a number of more specific cts:query sub-types. Queries can be composed together using the cts query constructor functions. They can then be executed by passing them to cts:search() or passed to other functions, such as lexicon calls or functions in other libraries, including search:resolve(), jsearch’s where clause and many others. All of these function names end in “-query”. If you see a cts function whose name ends in “-query”, you can be assured that it’s a cts:query constructor.
Query constructors can be categorized into different kinds. I’m going to call them leaf, composite, and “special” (for lack of a better word).
The composite query constructors build up new queries from other queries, whether leaf queries or other composite queries. Here they are broken down into a few sub-categories:
Category | Composite query constructor |
---|---|
Logical composition | cts:and-query
cts:and-not-query cts:or-query cts:not-query cts:not-in-query |
Element/Property scoping | cts:element-query
cts:json-property-scope-query |
Fragment scoping | cts:document-fragment-query
cts:locks-fragment-query cts:properties-fragment-query |
Special queries | cts:boost-query
cts:near-query |
The leaf query constructors are for queries that can stand on their own, i.e. can be constructed without the help of another query constructor. The following list breaks them down into several categories, depending on what the query searches for (collection URIs, directories, words, values, etc.). I’ve marked some of the text with bold type to draw attention to the consistent naming conventions.
Object being searched | Leaf query constructors |
---|---|
collection URIs | cts:collection-query |
document URIs | cts:document-query |
directories | cts:directory-query |
words | cts:element-attribute-word-query
cts:element-word-query cts:field-word-query cts:json-property-word-query cts:word-query |
values | cts:element-attribute-value-query
cts:element-value-query cts:field-value-query cts:json-property-value-query |
range index | cts:element-attribute-range-query
cts:element-range-query cts:field-range-query cts:json-property-range-query cts:path-range-query cts:period-range-query cts:range-query cts:triple-range-query |
geospatial | cts:element-attribute-pair-geospatial-query
cts:element-child-geospatial-query cts:element-geospatial-query cts:element-pair-geospatial-query cts:geospatial-region-query cts:json-property-child-geospatial-query cts:json-property-geospatial-query cts:json-property-pair-geospatial-query cts:path-geospatial-query |
timestamp | cts:after-query
cts:before-query cts:lsqt-query cts:period-compare-query |
boolean | cts:false-query
cts:true-query |
Words and values differ in how they compare content against the search. A JSON document containing {“Text”: “some content”} will match cts:word-query(“some”) but not cts:json-property-value-query(“some”).
Another thing worth noticing about the word, value, and range queries above is that they have consistent ways of scoping queries: by element, by attribute, or by field. So we see a function for each pairing of scope (element, attribute, or field) and object (word, value, or range). We’ll see something similar with the lexicon functions. Stay tuned.
This scoping applies to filtered search, i.e. we expect documents for element-***-query to return only XML documents while json-***-query would only return JSON documents. For unfiltered search, element-***-query and json-***-query will return both JSON and XML documents that match the query. Of course this does not apply to element-attribute-***-query since there is no such thing for JSON documents.
While the functions below each return a cts:query value, they don’t really fall into the above (leaf vs. composite) categories:
Function | Description |
cts:query | constructs a cts:query from its XML representation |
cts:registered-query | returns a previously registered query (using cts:register) |
cts:reverse-query | returns a reverse query (for finding stored queries given a document, rather than stored documents given a query) |
cts:similar-query | returns a query matching nodes similar to the given model nodes |
cts:parse | converts a search string to an equivalent cts:query using a defined grammar. |
Okay, only 332 functions to go. (I promise the pace will pick up soon.)
The query accessor functions aren’t very interesting at all—and there are 168 of them! They’re accessors for the various components of a cts:query value. You can recognize them using this failsafe technique: if you see a cts function whose name includes the string “-query-“, then it’s just an accessor. An example would be cts:word-query and its three accessors: cts:word-query-options, cts:word-query-text, and cts:word-query-weight. See a pattern?
Lexicon functions are much more interesting. Whereas cts queries are about efficiently finding documents, lexicon functions are about efficiently retrieving unique values (or words or URIs, etc.) from across a potentially large number of documents. They all require a particular index setting to be enabled. For “search,” think cts:search. For “analytics,” think lexicon functions.
Below are the 24 non-geospatial lexicon and lexicon wildcard functions grouped by lexicon type. Note the consistent naming conventions (at the end of the function names).
Aggregate Function | Wildcard function | Source |
cts:uris | cts:uri-match | URI lexicon |
cts:collections | cts:collection-match | Collection lexicon |
cts:words | cts:word-match | Word lexicon |
cts:element-words | cts:element-word-match | Element word lexicon |
cts:element-attribute-words | cts:element-attribute-word-match | Attribute word lexicon |
cts:json-property-words | cts:json-property-word-match | Element word lexicon |
cts:field-words | cts:field-word-match | Field word lexicon (inside Fields) |
Lexicons are typically found at the database configuration page of the Admin UI, except for Field word lexicon as noted above.
Aggregate Function | Wildcard function | Source |
cts:values | cts:value-match | Range index |
cts:element-values | cts:element-value-match | Element range index |
cts:element-attribute-values | cts:element-attribute-value-match | Attribute range index |
cts:field-values | cts:field-value-match | Field range index |
cts:value-ranges | Range index | |
cts:element-value-ranges | Element range index | |
cts:element-attribute-value-ranges | Attribute range index | |
cts:field-value-ranges | Field range index | |
cts:value-co-occurrences | Range index | |
cts:element-value-co-occurrences | Element range index | |
cts:element-attribute-value-co-occurrences | Attribute range index | |
cts:field-value-co-occurrences | Field range index | |
cts:value-tuples | Range index | |
cts:triples | Triples range index |
The range index above is a combination of element, attribute and field range index. “Range index” also includes the collection and uri lexicon. Indexes are found on the left-hand side of the Admin UI when you click on a database (Configure >> Databases >> {database name} >> *** Index. These functions can be used to generate aggregate reports.
Aggregate Function | Wildcard function | Shape |
cts:element-geospatial-values | cts:element-geospatial-value-match | Points |
cts:element-child-geospatial-values | cts:element-child-geospatial-value-match | Points |
cts:element-pair-geospatial-values | cts:element-pair-geospatial-value-match | Points |
cts:element-attribute-pair-geospatial-values | cts:element-attribute-pair-geospatial-value-match | Points |
cts:geospatial-co-occurrences | Point pairs | |
cts:element-value-geospatial-co-occurrences | Point pairs | |
cts:element-attribute-value-geospatial-co-occurrences | Point pairs | |
cts:geospatial-boxes | Boxes | |
cts:element-geospatial-boxes | Boxes | |
cts:element-pair-geospatial-boxes | Boxes | |
cts:element-child-geospatial-boxes | Boxes | |
cts:element-attribute-pair-geospatial-boxes | Boxes | |
cts:match-regions | Polygon |
Requires corresponding geospatial index (element, element pair, element-child, element attribute pair). Which of these you use depends on how you chose to represent geospatial coordinates in your data.
These are functions that will perform the mathematical computations for you.
cts:aggregate | cts:linear-model | cts:rank* |
cts:correlation | cts:max | cts:stddev |
cts:avg-aggregate | cts:median* | cts:stddev-p |
cts:covariance | cts:min | cts:sum-aggregate |
cts:covariance-p | cts:percent-rank* | cts:variance |
cts:count-aggregate | cts:percentile* | cts:variance-p |
cts:triple-value-statistics |
*These functions take in a sequence (or an array) of values. The rest of the functions require a range index or collation.
This only contains the function cts:frequency.
Reference Function | Target |
cts:uri-reference | URI lexicon |
cts:collection-reference | Collection lexicon |
cts:element-reference | Element range index |
cts:json-property-reference | Element range index |
cts:element-attribute-reference | Attribute range index |
cts:field-reference | Field range index |
cts:path-reference | Path range index |
cts:geospatial-element-reference | Geospatial element point range index |
cts:geospatial-json-property-reference | Geospatial element point range index |
cts:geospatial-attribute-pair-reference | Geospatial element attribute point range index |
cts:geospatial-element-child-reference | Geospatial element child point range index |
cts:geospatial-json-property-child-reference | Geospatial element child point range index |
cts:geospatial-element-pair-reference | Geospatial element pair point range index |
cts:geospatial-json-property-pair-reference | Geospatial element pair point range index |
cts:geospatial-path-reference | Geospatial path point range index |
cts:geospatial-region-path-reference | Geospatial region range index |
cts:reference-parse | Any index represented by the XML to be parsed. |
These functions are often times used with the String and Scalar type-specific lexicon functions, as mentioned in the previous section.
cts:reference-collation | cts:reference-nullable |
cts:reference-coordinate-system | cts:reference-scalar-type |
Shape | Accessor |
cts:point | cts:point-latitude
cts:point-longitude |
cts:linestring | cts:linestring-vertices |
cts:circle | cts:circle-center
cts:circle-radius |
cts:box | cts:box-east
cts:box-north cts:box-south cts:box-west |
cts:polygon | cts:polygon-vertices |
cts:complex-polygon | cts:complex-polygon-inner
cts:complex-polygon-outer |
Note that functions like cts:***-intersects and cts:***-contains are now deprecated. Switch to the geo library.
Most commonly, you use these shapes to construct geospatial queries. So first you construct a cts:region (using one or more of the above constructor functions). Then, you construct a geospatial cts:query (using a geospatial query function such as cts:element-geospatial-query), passing it the cts:region(s) you constructed. Finally, you pass the query to cts:search to run a geospatial search, or to a lexicon function to perform some geospatial-related analytics.
These constructors are typically used to specify which document information to use to “pre-sort” the response of cts:search, jsearch, and search:search.
Constructor | Sorted by |
cts:index-order | Sort based on range-index |
cts:document-order | Sort based on the hash of the document URI |
cts:quality-order | Sort based on document quality |
cts:score-order | Sort based on search score. Affected by document quality and document frequency |
cts:fitness-order | Sort based on fitness. Not affected by document quality nor by document frequency |
cts:confidence-order | Sort by confidence. Not affected by document quality |
cts:unordered | #iDon’tCare |
The result of a call to cts:search() is a sequence of nodes that reside in your database. But these node references also contain some special properties (five, to be precise) that extend beyond the XPath data model. They’re very handy for building search applications since they relate to things like search relevance:
Function | Purpose |
cts:score | log(term frequency) * (inverse document frequency) + (QualityWeight * Quality) |
cts:quality | Document quality |
cts:confidence | Score without document frequency |
cts:fitness | Confidence without the effect document quality |
cts:relevance-info | Relevance score |
cts:remainder | Estimate of the remaining fragments to process. |
“Miscellaneous” is a popular category in my family’s monthly budget, but I digress. I’ll try to break down these last remaining functions into some sub-categories:
Category | Function |
Parsing/tokenization | cts:stem
cts:tokenize cts:part-of-speech cts:distinctive-terms |
Registered query | cts:deregister
cts:register |
Classifier | cts:classify
cts:thresholds cts:train |
Temporal | cts:period
cts:period-compare |
Clustering | cts:cluster |
Entity Services | cts:entity
cts:entity-dictionary cts:entity-dictionary-parse cts:entity-highlight |
Result node manipulation | cts:element-walk
cts:highlight |
XPath validation | cts:valid-document-patch-path
cts:valid-extract-path cts:valid-index-path cts:valid-optic-path cts:valid-tde-context |
I’m not going to explain these (or fall on any swords defending their categorization). The important thing is that the cts API looks a lot less overwhelming to you now, right? There’s a hidden wisdom to it all—an underlying logic, a latent brilliance, a method to the madness…sorry, got a little carried away there.
Conclusion
Congratulations, you made it through the whole tour! As a reward, here’s a little code to look at. It’s the query I ran to generate the data for the Wordle shown at the beginning of the article. And, yes, it does use the cts API:
for $func-name in cts:element-attribute-values(xs:QName("function"), xs:QName("fullname")) where starts-with($func-name,"cts:") return concat($func-name,":",xdmp:estimate(cts:search(collection(),$func-name)))
And if you’re thinking to yourself that I must have a range index enabled on my database since I’m calling a value lexicon, you’re right. Well done.
Subscribe to get all the news, info and tutorials you need to build better business apps and sites