Semantics, Search, MarkLogic 11 and Beyond

March 14, 2023 Data & AI, MarkLogic

Following the release of MarkLogic 11 in December 2022, I sat down with John Snelson, head of the MarkLogic product architecture team, to get his thoughts on MarkLogic Semantics and Search in MarkLogic 11, and plans for MarkLogic 12.

Q: John, can you kick this off by introducing yourself?

Sure! I’m John Snelson; I lead the Product Architecture Team — the team oversees the architecture of all the MarkLogic products. I also manage the MarkLogic Engineering teams for Search, Semantics, and SQL — I’ve been managing that team for about 10 years.

Q: So you’ve seen a lot of changes at MarkLogic. How do you see the changes in the last few years, with the change of ownership and the acquisition of Semaphore?

MarkLogic has a long history of introducing groundbreaking technology — we were NoSQL before NoSQL had a name! In the past couple of years, we’ve had an opportunity to re-energize and re-focus the company. Part of that was the acquisition of Semaphore and their market-leading Semaphore software — and of course, now we’re part of Progress.

Q: What was that like for MarkLogic Engineering?

Change can be hard. But the release of MarkLogic 11 this past December shows we’re maintaining a regular cycle of major MarkLogic Server releases.

It’s got some important new features and enhancements, but it’s also heavy on stabilization. We’re already building on that, for a blockbuster MarkLogic 12.

Q: And the Semaphore acquisition — what was that like for Engineering?

Well, we’ve worked with Semaphore, and a handful of other Semantics companies, as partners for many years. We have a lot of customers who use MarkLogic Server and the Semaphore software together, very successfully.

The technology is obviously complementary:

  • MarkLogic Server is all about managing and analyzing multi-model data (documents, triples, tables, geo) — the kind of rich data that’s important to global organizations.
  • The Semaphore software provides a world-class solution for ontology management, which complements MarkLogic’s Semantics and triples capabilities; and for metadata creation via Semantic AI, which strengthens any multi-model application. For example, FACTs extracts facts as triples from free-flowing text in documents.

Now we, in Engineering, get to work even more closely with the Semaphore technologies. We’ve already identified a handful of areas where we can improve integration between the two products. Some of that is happening in MarkLogic 11 — for example, in Semaphore 5.6 (released alongside MarkLogic 11), when a document is submitted for classification, the Classification Server checks whether it is in a given MarkLogic database; adds it if it’s not there; and saves / updates the classification metadata. Look for even more integration in the next couple of release cycles.

Q: What about the merging of people — how’s that going?

It’s going very well indeed! Both sides include some highly experienced professionals, recognized experts in their fields. The Semaphore folks are very smart and easy to get along with, and they are merging well with the MarkLogic teams. For example, Matthieu Jonglez of Semaphore has taken over the Product Management team, and he’s doing a great job.

Q: Sounds great! Let’s get down to some of the details of MarkLogic 11.

You can read more about MarkLogic 11 in Matthieu’s blog, Introducing MarkLogic 11, and of course in the MarkLogic 11 release notes.

At a high level, MarkLogic 11 represents a push for more predictable, stable major release cycles. We’re also looking at increasing the certainty around our support coverage. Our customers are governments and global corporations, and they need this kind of predictability and certainty. MarkLogic 11 is also easier to manage, and even more scalable.

As part of a manageability push, the Admin UI has been revamped. This has been in the cards for some time — look for more improvements in coming releases.

MarkLogic 11 also includes storage failure (not just node failure) detection and failover, and better healthcheck monitoring.

Q: So there’s plenty in MarkLogic 11 for admins. What about query writers?

Well, at the heart of MarkLogic’s multi-model capabilities is the ability to not only query each model, but to query across models in the same query, easily and efficiently. So you can ask questions like “show me all employment contracts (documents) that mention people who earn more than $100k (tables) who work for a pharmaceutical company (triples) that’s based within 500 miles of my office (Geo).”

So how do you create that query? You can extend SPARQL to include Geo (GeoSPARQL); you can extend SQL to include Search; and so on. What you really want is a single query language to query across all models in the same breath, as it were. Enter MarkLogic’s Optic query language! Optic was introduced in MarkLogic 9 as a single language that can query across multiple data models — it’s been extended in MarkLogic 11 to include geo queries.

MarkLogic 11 also includes support for GraphQL, which has grown popular as an alternative to REST as a simple way to send queries to a database. Of course with MarkLogic’s implementation, that can include multi-model data — made possible by Optic.

Q: Great! What else?

As with any major release, there are also some scalability and performance improvements — our customers’ data needs continue to grow every year.

For example, if you want to do a huge sort or join with Optic that can’t possibly fit in memory, it can now overflow to disk.

We’ve added a negative cache for LDAP security. And HTTP compression and chunking for MarkLogic REST endpoints.

Q: So that’s a pretty substantial release! What’s next?

Looking forward to MarkLogic 12, we’re looking at closer integration with Semaphore technologies; more performance enhancements; and support for more Search and Semantics standards.

Many of the Semaphore integration improvements will be under-the-covers, invisible to users. You’ll just notice that apps are faster, easier, more scalable.

Some will be visible — for example, we’re looking at a very close integration of the SES (Semantic Enhancement Server) and the Semaphore Classification Server with a MarkLogic cluster. It’s too early to talk about how that might work, but watch this space.

On the search side, a couple of our big search customers have asked for support for the BM25 ranking algorithm, so we’re working on that. The Semaphore team brings experience there too — Mike Gatford, one of the authors of the TREC paper that introduced BM25, was a long-time Semaphore employee.

For multi-model, we’re looking at ways to make TDEs easier to deploy, and more scalable. TDE (Template Driven Extraction) lets you project triples and tables out of documents automatically. Currently TDE projections are entirely indexed, but if you could have unindexed columns, or even entire Template projections that don’t involve an index, that would be a big deal. It would be much easier to modify a Template, or to try out a Template over a very large set of data; then you’d create an index when you’ve settled on something useful.

And on the Semantics front, we’re looking at built-in support for a handful of standards that our customers currently use outside of the Server. These include SHACL and RDF-Star.

We’re also considering index support for some of the more popular graph algorithms, such as (weighted) shortest path across a graph.

As always, there’s more that we’d like to do than can possibly fit into a single release cycle, so we’re talking to customers, prospects, and industry stars to figure out which of these features would provide the most value. We’re always looking for input, especially on real-world use cases. Readers should contact us with ideas and use cases.

Q: Thanks John! Anything else you’re looking forward to this year?

Yes, I’m also looking forward to the pilot release of our re-vamped ‘software as a service’ cloud offering that will have both MarkLogic Server and Semaphore services in a managed environment. Look for an announcement on that very soon!

Sounds great! Thanks again for sharing your insights.

Stephen Buxton

Stephen Buxton is the president of BTC, an independent consulting firm. Previously he was the Product Manager for Search and Semantics at MarkLogic, where he was a member of the Product team since 2005.

Stephen is the co-author of "Querying XML" and a contributor to "Database Design", a book in Morgan Kaufman's "Know It All" series.

Before joining MarkLogic, Stephen was Director of Product Management for Text and XML at Oracle Corporation.