The Secret to Surviving Painful Schema Changes — Even Weekly!!

January 03, 2017 Data & AI, MarkLogic

Some groups I’ve worked with take eight to ten weeks to deal with a schema change. The math doesn’t work. You will never catch up with that pace of change.

Schema changes are, in a word, BRUTAL.

As an application developer on a large team, they were a fact of life, but it still meant taking time to align our group’s code with the latest schema even if my team wasn’t using that information. Database tables are rigid, unforgiving beasts which must be fed, often with late night coding sessions. I hated when I had to manage schema changes. It felt like a tax on my productivity. No one likes taxes, but most agree they are a necessary evil. It was the price paid for a business-wide information model.

As you likely know, a database’s schema defines the name, type, format, and meaning of the data stored in a database. The schema is part of what helps turn data into useful information. When a schema is changed, it creates a ripple through all the applications that depend on that schema. With relational databases, a schema change can take weeks for developers to deal with while they adapt their code to the new model. As a long-time developer, I, too, have been guilty of forcing schema changes back on the business, so the change problem arises from both sides.

Let’s look at a simple example, a single telephone number may have been collected years ago when most people had one number, but now we need to differentiate between cell, home, and office. Now, every application that already uses a phone number will need some rewrite.

Old Customer Table

Column NameNull?Type
LastNameNot NullVARCHAR(30)
PhoneNumberVARCHAR(20)

Consider the change from old system to the new phone requirements. This is important because the new data will help solve real world problems. A cell phone is a good proxy for identity. A home phone can help identify household members. An office number can establish other relationships which may be useful to determine employment or distance to and from work. It makes sense to collect more than one phone number, but how do you do that on legacy systems without a tremendous impact on applications?

New Customer Table

Column NameNull?Type
LastNameNot NullVARCHAR(30)
HomePhoneNumberVARCHAR(20)
CellPhoneNumberVARCHAR(20)
WorkPhoneNumberVARCHAR(20)

The simple fact is that you can’t get around the schema hit on applications when using a table-based schema. We’ll get back to my phone example in a minute. There is a solution, but I want to examine this in a much larger project.

Supporting a Corporate-Wide Canonical Model

I was talking with a colleague of mine who is a solution architect working on an insurance company’s global data initiative. They are retooling their digital business, and at the heart of that work is schema governance. The goal is to have all aspects of the business sharing the same information model. They currently have dozens of data initiatives and application development projects using MarkLogic all sharing a common schema. In fact, they had been working on their canonical model for a few years, so they did have a bit of a head start in thinking about a global business view. What I found fascinating, from a developer’s perspective, is the frequency at which they were able to change their model.

As an information model gets used, it gets refined. Changes get made to better support the applications and the business, so being able to manage schema change is important for keeping development efforts on track.

Looking at the simple phone number example, by refining the information model to include cell phones, I can now build applications that can take advantage of that fact with the customer. I can send her a text, I can suggest she download an app, or I can see she is within a geographical location. This is an example of how business processes can evolve – and the data model has to be refined to support them.

When you have a corporate-wide canonical model, you run right into the problem of developer team efficiency and keeping up with businesses canonical data models. Development teams are trying to solve a problem (or two!) which is separate from managing all these schema changes. Checkpoints are good — but also disruptive. Here’s an interesting piece on how DBA’s are falling way behind … due to all the schema changes.

Getting back to my solution architect friend, he was describing the frequency of changes being propagated around the business: weekly rollups of schema changes within sub-projects and monthly publishing of the master model business wide. Imagine, if your project’s relational schema changed weekly. Some groups I’ve worked with take eight to ten weeks to deal with a schema change, the math doesn’t work, and you’d never catch up with that pace of change.

Schema Changes Simplified Using Multi-Model Database

The key difference here is that this company isn’t using a relational database, they are using MarkLogic.

The reason their development teams are so agile in the face of frequently changing schema is the fact that XML and JSON are self-describing. Every field in XML and JSON has an element name — and a value. So those names and values tell you what it is. And it is not fixed, so a little bit of variance in one document will not throw the whole thing. If the change comes from a different part of the business that I don’t care about — I can disregard it. Relational is not forgiving; all the data needs to match the new rules.

These documents hold all the information on how to use them. When a new schema is published, it is used with the data you have. You don’t have to re-load or change your data to fit the new view. Teams can independently add or change elements in complex models without interfering with each other at all. Even if there are overlapping concerns, data unification can be done when needed and you don’t have to stop work to reformat data. This is the power that comes from having each record or document describe its contents.

Returning to the simple phone number example, in XML you might use a detailed model like the one below.

<phone> 
  <country-code>1</country-code>
  <area-code>212</area-code>
  <line-number>555-1212</line-number>
</phone>

It’s easy to understand, has the details and covers the need of the original app. I can search the XML for records and easily find them.

Now when addressing the requirement change, tracking multiple phones, and indicating if they are home, office, cellular….With XML this is trivial. It can even be done without breaking existing queries code. I just need to annotate or add an element to the phone document.

<phone type="cellular"> 
  <country-code>1</country-code>
  <area-code>212</area-code>
  <line-number>555-1212</line-number>
</phone>
<phone type="work" > 
  <country-code>1</country-code>
  <area-code>212</area-code>
  <line-number>555-8899</line-number>
  <extension>x21</extension>
</phone>

The new annotation or added elements won’t break my existing application. I can update those apps later when they need to take advantage of the new information.

So does MarkLogic make schema changes fun now? No, of course not. And in truth, not every XML modeling change can be done this simply, but you would be surprised how many fit this pattern, as evidenced by my colleague’s story. Schema changes are always going to be part of business (if they aren’t, your business is dying). But MarkLogic takes it from a pain of 10 to an annoyance of 1.

Want to learn more?

  • Get the ETL Out of Here In this blog, Matt Allen provides a step-by-step on how you can significantly reduce the cost and time associated with ETL (and schema changes!)
  • Progressive Transformation On-Demand Tutorial – MarkLogic University provides this 16-minute overview of how to do progressive transformation. You’ll get a deeper dive, including a look at the queries to run.
  • Flexible Data Model Data sheet on how MarkLogic provides native storage for JsoN, xMl, rDF, geospatial, and large binaries (e.g., PDFs, images, videos) to make it easy to get all of your data in, and easy to make changes later on.

Derek Laufenberg