MarkLogic’s flexible support for schemas is an important feature that can help maintain data quality while avoiding the costly burden of complex enterprise data models. If you read the MarkLogic marketing materials, you’ll encounter the phrase “schema agnostic.” Many other NoSQL vendors however describe themselves as “schema free,” and as a result MarkLogic is sometimes mistakenly referred in this way as well. The two phrases are not synonyms — and I’ll explore the distinctions between the two and why you should care.
A schema is an enforceable set of rules about the structure of a database. They are used to help maintain data quality, which is an abstract attribute of data that indicates the degree to which data is consistent and semantically correct. For example, a database might define a customer record as a customer id, a name, and the date the record was created. Relational databases describe this using DDL (Data Definition Language), colloquially referred to as “create table” statements. XML has a couple of schema definition languages but the most used one is called XML Schemas. The database uses the schema to reject data that doesn’t meet the schema’s requirements. A database with high data quality will provide more reliable and actionable information than a database with poor data quality.
Originally, databases were largely schema-free. Unfortunately this did not always work out well in large, enterprise databases. Different programmers would make different assumptions about the expected contents of the database. A maintenance program might break the nightly invoice batch by adding invalid or unexpected data. Consequently, schemas became mandatory in relational databases, which came to dominate the database landscape in the 1980s and 1990s. Mandatory schema conformance met both a data quality need and allowed relational databases to efficiently store data on the limited hardware available.
The need for strict schema enforcement became enterprise orthodoxy. Developers have recently challenged that orthodoxy because designing, maintaining and evolving schemas over time is a complex and costly burden. Changes in large production database often require significant preparation and planning. In many cases relational databases became sclerotic, with even poor structure being enforced by the administrators, who are pressed into service as gatekeepers, stopping or slowing changes because any change is perceived as inherently risky. In an era when significant opportunities may arise and disappear in the span of a few months, a difficult to change, complex data model may do more damage than just increase development costs.
Many believe that strict schema enforcement was actually a band-aid over the real issue of multiple applications reading and writing shared data. Their solution is to abandon shared data and uniformly encapsulate data behind services. While I believe micro services offer tremendous potential, a service-only approach isn’t always feasible. The most common problems are interoperability in service frameworks for advanced features such as distributed transactions and the cadre of legacy applications that currently share data. Also, data quality problems can still arise when bugs or poorly considered features introduce unwanted artifacts into the database.
A schema free database is not bound by any schema rules outside of correct syntax. While it does diminish the need for database specialists to tune a schema, and for paid gatekeepers to protect that schema, data quality can quickly deteriorate. A schema free database has no mechanism with which to enforce the rules of good structure at the database level. Some development libraries for those databases provide a way to enforce a structure in the application code. However, this is voluntary and two applications may not have share a consistent set of rules. Not being able to enforce structure at the database level may be just as limiting as requiring that structure must always be enforced.
Schema agnostic databases are not bound by schemas — but are aware of the schemas – and specific schemas can be enforced at at the database level if desired/necessary. Moreover, XML Schemas, used by MarkLogic, can serve to auto-generate libraries that express that structure in common languages such as Java. The distinction is MarkLogic does not require you to adopt a schema – but if you have one (or many) and you wish to enforce it — you can. You can avoid the exercise of developing and enforcing a schema when it just isn’t warranted. It is also possible to enforce schema constraints in certain environments, such as in test environments, to validate that applications adhere to the prescribed data model prior to entering production. In a nutshell, a schema agnostic database allows you to enforce a structure when the business needs indicate that it’s necessary but do not require it when the schema provides little value.
All of this discussion, at a high “nosebleed” level, has an impact on cost and agility. A database often outlives the applications that depend on it. A poorly crafted user interface may stay for years, but a database can haunt an enterprise for decades. The relational approach of mandatory schemas can lead to longer, more costly, and complex development efforts. The other extreme, of never enforcing structure, can quickly lead down a path of poor data quality and a database of low value data. Neither extreme is really tenable or appropriate in the long run. Clearly a middle ground solution that allows schemas to be enforced only when necessary is better – and that is exactly what MarkLogic offers.
Subscribe to get all the news, info and tutorials you need to build better business apps and sites