Lee Pollington and I recently gave a talk at MarkLogic World about schema-on-read approaches to data management in the financial services in industry. As part of preparing for that session it became clear to me that this term carries different meanings to different people, and that some clarification might be in order.
Some of the confusion around the term Schema on Read has to do with comparing different technologies against relational databases, which need a rigid schema defined before they can ingest a single piece of data. Alternative technologies such as Hadoop and NoSQL have done away with this requirement, and as a result are more agile in ingesting any type of data. But each technology takes different approaches to enabling data structuring and transformation, which inevitably become required to make use of the data stored in the system.
One approach is – forget about schema. Throw all the data into the Hadoop filesystem and be done with it! Or, use a simple key-value datastore that doesn’t care much about the data that the keys point to. This is a wonderful approach for quickly storing any type of data, but what happens when you actually need to make some sense of it? Aye, there’s the rub. If you’re only using Hadoop, you’re likely going to run some MapReduce jobs (or pig/hive/etc.) to create the structures that would make sense of the data.
Many people think of this is the meaning of schema on read. I see it as a step backwards, to the days when you wrote an application that read some data from a bunch of files, did some crunching and produced a result. There are plenty of use cases where this approach is a great fit, but querying your data isn’t one of them – that’s what databases were invented for. And using a simple key-value store is not much better – you’d still have to do a lot of work on the way out just to make sense of the data, rather than making use of it.
MarkLogic represents a completely different approach to data management, which takes into account the fact that data collections can (each) contain their own structure. Choices in the past have been to force a common structure on all the data, or do away with structure all together. MarkLogic’s document data model lets you have many structures, or as we like to say it is schema-agnostic.
This model lends itself to efficient transformations using a variety of tools that can be applied anytime during the lifecycle of the document – whether it’s upon ingestion, during data access, or anywhere in between. The beauty of this approach is that it doesn’t confine you to any one choice. So schema-on-read carries a very distinct meaning in the context of MarkLogic, which is much richer than the term would have you believe.
Another distinction about MarkLogic is that it actually supports more than just a document model. As of version 7, MarkLogic is also a Semantic RDF triple store, which means that it supports a hybrid model that allows the expression of structure in more than one way.
Semantic RDF triples provide an expressive, flexible way to describe relationships between data elements and records, which can easily change over time. So that in addition to transforming the data at any point during its lifetime we can decorate it with triples along the way. That’s a very powerful approach to data transformation and structuring, which is a far cry from some of the common notions of schema-on-read we’ve discussed earlier. I like to think of it as “schema on demand.”