I’ve been itching to write up a post about the NoSQL (“not only SQL”) category of technologies because there’s such a dearth of practical and specific information on this topic, and because so many people are unclear about how MarkLogic compares to these technologies.
This post is targeted at folks who have already come to the realization that a relational database won’t meet their needs, and who are trying to figure out which of the available alternatives is a plausible option. It’s also meant to be a general tutorial for anyone curious about this emerging space.
The NoSQL “movement” was born from a growing recognition that there are certain types of data-based problems that are quite difficult, or inefficient, or impossible-within-practical-constraints to tackle using relational database technology. The most common factors behind these problems are unstructured data and scalability. Why?
The NoSQL technologies (I don’t include MarkLogic in this category) are not databases “in the traditional sense”, meaning none of them provide both transactional integrity and real-time results, and some of them provide neither.
Each resulted from an effort to alleviate specific limitations found in the RDBMS world which were preventing their architects from completing a specific type of task, and they all made trade-offs to get there. Whether it was tackling “hot row” lock contention, horizontal scale, sparse data performance problems, or single schema induced rigidity, they are much more narrowly focused in purpose than their RDBMS predecessors. They weren’t designed to be enterprise platforms for building ACID transactional, real-time applications.
Data consistency is the most prominent NoSQL trade-off made in order to achieve horizontal scale, so it’s worth understanding generally why and generally what the ramifications are. There is an obvious data consistency challenge with cache-based NoSQL options (such as memcached) so I’ll focus on the issues with persistent stores here.
The concept is relatively straightforward–figure out how to partition your data such that it will be most evenly distributed across a cluster of commodity servers. This requires some upfront understanding about what kinds of questions you’ll be asking, such that all of the results aren’t clumped together within the cluster. (If you simply rely on the default consistent hash algorithm provided by some of the NoSQL technologies, you’ll not be fully optimized because that technique is not content-aware). If the questions change, repartitioning might be needed to maintain performance. In order to avoid “hot spots”, (which introduce lock-contention based performance problems), multiple copies of the data exist within the cluster. With some NoSQL options, joins are not possible, so data cannot be normalized, and again you end up with multiple copies of data. Either way, you have a data consistency issue, because when updates occur, for some window of time some of the nodes in the cluster will have stale data. This is known as “eventual consistency”, a model used by Amazon Dynamo and copied by many. In short, updates are treated as “always” writeable, with the complexity of potential conflict resolution left to the read operation. None of the NoSQL options have anything other than rudimentary conflict resolution functionality, so this is left as an exercise for the application/business-process layers.
Obviously, I cannot provide exhaustive descriptions of each NoSQL subcategory here, but since many people expect NoSQL options to be “databases”, I’ve tried to highlight those characteristics that might be least expected.
You’ll notice that full text search capability isn’t a strong suit of NoSQL technologies. One basic reason is that most partition the data across a cluster of commodity servers based on a key used for retrieval and don’t maintain global indexes. To conduct a full text search against a cluster, you have to run the search on every node in the cluster. As a result, throughput is limited by the slowest machine in the cluster, not the size of the cluster. Moreover, if the target of your full-text search query is not well aligned with your partitioning, then an I/O bottleneck is introduced if data needs to be copied to different locations to facilitate computations such as joins.
The NoSQL crowd expects folks to rely on full-text search engines such as Lucene, Solr, or Sphinx. But that’s not ideal either. You may have turned to NoSQL for its horizontal scale capabilities, but having to scale two-point solutions at once is not trivial. Full-text search engines have other drawbacks as well:
…to name a few.
The last thing for consideration when surveying alternatives to an RDBMS solution is what I’ll call “enterprise worthiness”.
Ok, finally how does MarkLogic compare? I’ll try to be as brief as possible—you’ve found your way to the MarkLogic website already, where there are plenty of materials to provide details. The important take-aways are that, unlike NoSQL technologies, MarkLogic is proven to scale horizontally on commodity hardware up to petabyte range:
MarkLogic wasn’t built to solve a specialized problem. It was architected from the ground up to be an enterprise class platform for building applications at any scale which rely on sophisticated real-time search and retrieval functionality. If you’re looking for something that is as reliable as your trusty RDBMS, but is better suited for unstructured data and horizontal scale, then MarkLogic is the first place to look.
So if MarkLogic is not really suitably grouped with the NoSQL technologies, where does it fit? It’s in a next-generation database class of it’s own. Here’s how I see it:
E.F. Codd’s vision of the relational database was revolutionary because it separated database organization from the physical storage. Rather than worrying about data structures and retrieval procedures, developers could simply employ declarative methods for specifying data and queries. Oracle capitalized on this model and built a business essentially selling agility.
Christopher Lindblad’s vision of the unstructured database is revolutionary because it separates data ingestion from data organization, and because it combines search and retrieval. Rather than worrying about schema (or data partition) maintenance, developers can simply employ expressive methods for searching and retrieving information in real-time. MarkLogic is building a business essentially selling agility.
Subscribe to get all the news, info and tutorials you need to build better business apps and sites