We all have an Uncle Mort – that pragmatic, no bull, calls-it-like-he-sees-it type of guy. Today, I’m going to divulge a secret about how to figure out if you need a document database or not. I’ll start with some technical background, and then we will have a sit down with Uncle Mort!
Let’s just review a few things. NoSQL once meant exactly that — “No SQL” — to describe the category of new databases that did not store rows and columns and were not accessed using SQL. Back then, databases meant SQL, so it was quite a shock that the biggest companies (Facebook, Google, Twitter, LinkedIn) were not using SQL databases because they could not scale to handle tons and tons of “unstructured” data. Now NoSQL has been re-branded to mean “Not Only SQL” since MarkLogic and some other vendors have added SQL views on top of their fundamentally non-relational data stores.
Here are a few types of NoSQL databases with very brief descriptions:
- Key/Value Stores These are simple, fast technologies that allow large chunks of data to be stored and retrieved by key, but they handle data items as black boxes, with no ability to query or see inside the data.
- Triple Stores and Graph Stores Triple stores store large networks of individual facts rather than larger, more complex documents or “values,” and these facts are often represented using the RDF standard. Facts are very easy to feed to AI programs or inference engines to create more facts, and are easy to combine. This inference capability is sometimes referred to as semantics. Computers work well with facts, and walking links, so it’s a great model that enables all this goodness. The downside is that humans need well-structured, grouped data, so a lot of joining is needed to efficiently present data to the end user. (Note that MarkLogic has a built-in triple store, but augments it with document storage to solve this problem.)
- Document Databases Document Databases are the big brother to Key/Value stores. They store larger items (documents) that each have a unique key, but also provide the flexibility to query based on the content inside those documents. Document Databases can store shipping invoices, bills, medical records, insurance applications, financial transactions and the like. A Key/Value store can store these things too, but can’t see the contents inside them or query them meaningfully. The documents (the value in a key/value store) is no longer a black box, and can be queried along with all its structure and content more like you’d expect a database to do.
So a Document Database is more powerful than a key/value store, and gives the ability to see into the records and query all the fields and structure, much like a traditional Relational Database. This raises the question: When do you need a Document Database rather than a Relational Database?
Uncle Mort Weighs In on Document Databases
Let’s imagine what may happen if you ask your professor of Computer Science, and then your CTO what kind of data storage system you need: Triple Store, Key/Value Store or Document Database. You may get complicated – even cryptic – answers:
You: “Joan – do I need a triple store to track my patient’s health records? I have all sorts of test results and blood pressure readings, medications, medical problems and visits.”
Joan the Professor of Computer Science: “Oh, yes! Triple stores provide semantic interoperability by categorizing and equating disparate facts according to an OWL ontology. That will allow everyone to know that ‘Blood Pressure’ actually means LOINC 55284‑4.”
You: “Actually, we already know what Blood Pressure means.”
Joan: “Ok, right. Still, it would be nice to federate your SPARQL requests across other enterprise triple stores, and they are great for that.”
You: “Hmmm. Not so much. All my data legally has to be sent as a ‘Continuity of Care Document’ rather than a Federal Triple.”
Joan: “Federated! But I see your point. If you have Continuity of Care Documents and need to store and send them, use a Document Database.”
So your imaginary CS Professor can help you out — but it took a while to get a straightforward answer. Let’s ask your CTO about key/value stores. He may have more of an idea of what you really need.
You: “Pankaj – can I use a key/value store to track my patient’s health records? I have all sorts of test results and blood pressure readings, medications, medical problems and visits.”
Pankaj the CTO: “Oh, yes! Key/value stores can store a huge amount of data quickly without all that pesky indexing and transaction management. You may lose some, but we can work around that.”
You: “Some of it is very important. I’d rather not lose it.”
Pankaj: “Well, the CAP Theorem dictates you need to lose some data to store it fast! A seeming paradox, I know! I can get a system to keep all the data by backing a new key/value store with a separate relational database. But that relational database means lots of data modeling, so I’ll need some budget to stand it up – I can get you a 3 year staffing plan by next week.”
You: “I don’t want to pay to integrate things together and I don’t want to wait 3 years. Why can’t I store my data fast, keep all my data with ZERO loss and query it too?”
Pankaj: ”If you want something that’s cheaper to stand up, never loses data and lets you query it, you need Enterprise NoSQL. We’ll have to use MarkLogic because it has ACID transactions and is the only Enterprise NoSQL Database.”
You: “No acid needed, Pankaj. I just want it quick and I can’t lose it.”
Pankaj: “Actually, ACID means that you don’t lose the data. If you want ACID and you have documents, that pretty much narrows it down to MarkLogic.”
You: “But I was hoping to use something open source. They’re ‘Free like Beer’, right?”
Pankaj: “Ha! No. They are ‘Free like a puppy.’ But that’s irrelevant – you need ACID transactions and Enterprise NoSQL, so that’s MarkLogic.”
Interestingly, Pankaj the CTO came around to Document Databases (and MarkLogic in particular) for your use case, but with only after exploring a complex integration and data modeling approach first. Let’s change tack and ask Uncle Mort. Almost everyone has an Uncle Mort, who is retired now but handled all these sorts of records before they were computerized and was always the smartest guy in the room.
You: “Uncle Mort – how are you! It’s good to see you again. I actually have a business question for you today. Do I need a Document Database to store and query documents? I have all sorts of test results and blood pressure readings, medications, problems and visits.”
Uncle Mort: “What the hell is a Document Database?”
You: “It stores documents and then you can query for them based on the information inside.”
Uncle Mort: “Ah. Let’s see. I used to keep test results in folders behind the receptionist’s desk up until I sold my practice in ’83. I would have loved to be able to find things in there. Test results are definitely documents – I got them faxed over from the lab back in the day. So they should be fine in your Document Database. So far it sounds just right – what was that other stuff?”
You: “Medications, problems, patient visits.”
Uncle Mort: “Sure. Medications are written on prescription slips. It’s a sort of small document so that should work. Each visit has a write-up that’s a document too. Notes from visits were included in those same file folders. Finding stuff in the notes was the worst – I had to skim through every note to find stuff in there and it was easy to miss critical bits that were written down once for a reason.”
You: “Well, there’s one Document Database with all the querying, but also a built-in search engine sort of like Google but for the documents.”
Uncle Mort: “I would have killed to Google my patient files.”
You: Not actually killed anyone, I hope, but I get your drift. Thank you, Uncle Mort! One more question – I was thinking of putting all this information into an older database with rows and columns – you know, tabular like a spreadsheet.”
Uncle Mort: “You can’t put medical records into a spreadsheet – don’t be a mushdoob!”
You: “Thanks again! I’ll be back next week to ask if Insurance Applications and Billing Statements are documents too, or if they can go into rows and columns.”
Hmm. That was a lot easier to understand.
In reality, almost nobody asks or trusts their Uncle Mort about technology, but perhaps they should. When it comes to representing information, you can rely on hundreds or even thousands of years of human experience that indicates that documents are the most natural form for humans to structure, send and read information.
There’s still a place for computer scientists (full disclosure – I’m one) and tech gurus, but in this case documents are motivated by our existing, natural ways of handling information, so it’s better to understand the simple facts first. Trying to store these types of documents in a complex mass of linked relational tables is terribly unnatural.
It just doesn’t take a tech genius to figure that out … Or does it?
Damon Feldman
Damon is a passionate “Mark-Logician,” having been with the company for over 7 years as it has evolved into the company it is today. He has worked on or led some of the largest MarkLogic projects for customers ranging from the US Intelligence Community to HealthCare.gov to private insurance companies.
Prior to joining MarkLogic, Damon held positions spanning product development for multiple startups, founding of one startup, consulting for a semantic technology company, and leading the architecture for the IMSMA humanitarian landmine remediation and tracking system.
He holds a BA in Mathematics from the University of Chicago and a Ph.D. in Computer Science from Tulane University.