For decades now the database world has been oriented towards the schema-on-write approach. First you define your schema, then you write your data, then you read your data and it comes back in the schema you defined up-front. This approach is so deeply ingrained in our thinking that many people would ask, “how else would you do it?” The answer is schema-on-read.
Schema on read follows a different sequence – just load the data as-is and apply your own lens to the data when you read it back out. You might say, “OK, fine. But why would you want to do that?” There are several really compelling reasons. I’ll cover the main ones here.
- More and more these days, data is a shared asset among groups of people with differing roles and differing interests – who want to get different insights from that data. With schema-on-write, you have to think about all of these constituencies in advance and define a schema that has something for everyone, but isn’t a perfect fit for anyone. When you are talking about huge volumes of data, it just isn’t practical.With schema-on-read you can present data in a schema that is adapted best to the queries being issued. You’re not stuck with a one-size-fits-all schema. By the way, if you do schema-on-write and develop a structure that you think fills the needs of all of your user categories; I guarantee a new category will emerge. With schema-on-read, you’re not tied to a predetermined structure so you can present the data back in a schema that is most relevant to the task at hand.
- The next benefit is closely related. One of the places where projects often go off the rails is when multiple datasets are being consolidated. With schema-on-write, you have to do an extensive data modeling job and develop and über-schema that covers all of the datasets that you care about. Then you have to think about whether your schema will handle the new datasets that you’ll inevitably want to add later. If you’re lucky enough to get through that process, Murphy will strike again and you’ll be asked to add, change, or drop a column (or two or three). With schema-on-read, this upfront modeling exercise disappears.
- Here’s the biggest benefit in my mind. The problems I mentioned above are so burdensome that they can sink a data project or increase the time-to-value past the point of relevance. Using a schema-on-read approach means you can load your data as-is and start to get value from it right away. This is important when dealing with structured data, but even more important when dealing with semi-structured, poly-structured, and unstructured data which is the vast majority by volume.
At this point people often say, “Well sure, but you need a predefined schema or it will be slow.” That’s absolutely true for traditional technologies, but not for an Enterprise NoSQL database like MarkLogic. We are built from the ground up to excel at this approach. [Ed. There’s not enough room to go into how we accomplish that here, but if you’re curious, we’ve got a great paper you can read on the topic.]
The other important thing to keep in mind is that just because we don’t force you to do an extensive data-modeling task up front, doesn’t mean that you can’t learn from your data over time. Get your data loaded, start using it, get value from it. Over time you may well find that you want to normalize certain aspects of your data or otherwise optimize your representation. With MarkLogic, that evolution can happen over time as you gain real-world experience with your use cases and datasets. Imposing too much structure too soon and trying to optimize before you really understand the bottlenecks is a common trap. Schema-on-read can help you avoid it.
Schema-on-read is just one of the ways that MarkLogic can help you solve problems that are a major challenge with traditional technologies.
Joe Pasqua
Joe Pasqua brings over three decades of experience as both an engineer and a leader. He has personally contributed to several game changing initiatives including the first personal computer at Xerox, the rise of RDBMS in the early days of Oracle, and the desktop publishing revolution at Adobe. In addition to his individual contributions, Joe has been a leader at companies ranging from small startups to the Fortune 500.
Most recently, Joe established Neustar Labs which is responsible for creating strategies, technologies, and services that enable entirely new markets. Prior to that, Joe held a number of leadership roles at Symantec and Veritas Software including VP of Strategy, VP of Global Research, and CTO of the $2B Data Center Management business.
Joe’s technical interests include system software, knowledge representation, and rights management. He has over 10 issued patents with others pending. Joe earned simultaneous Bachelor of Science Degrees in Computer Science and Mathematics from California Polytechnic State University San Luis Obispo where he is a member of the Computer Science Advisory Board.