Relational databases have resulted in accidental complexity that keeps most organizations spinning in circles. Organizations simply cannot keep up with the many shapes, sizes, and types data that are quickly growing in volume and changing.
In the previous post, I discussed why today’s dynamic, constantly changing data is a problem for relational databases. In this post, I am going to discuss a somewhat related, but unique problem that is also not easy for relational databases to handle. It’s the problem of “heterogeneous data.” This term is a fancy way of talking about data variety.
Organizations used to store only some key transactional data and a few basic things about their customers. Today, however, organizations can no longer cherry-pick a few key pieces of data. They need to store just about everything. As the cost of storage drops (even for SSDs), organizations are doing just that. There is also an expectation from customers, partners, and regulators that the organization should store everything in a usable format that will benefit them as well.
When considering the explosion of heterogeneous data, or data variety, it’s important to understand that the problems with structured data are just as significant as the problems with unstructured data. Today, organizations struggle to handle the many shapes, sizes, and types of structured data that are quickly growing in volume and changing. New applications, mergers and acquisitions, and re-purposing of data are common reasons that lead to the disparity of structured data.
Problems with Structured Data: Data Silos and ETL
In most organizations, it is the structured data that is causing the bulk of the overhead and it’s because relational databases have trouble with managing data that isn’t uniform. In other words, this data is also heterogeneous. Relational databases require pre-defined schemas before loading data and any changes that are required to handle a new data source are cumbersome and result in increasing schema complexity. To get around this, it is often faster and easier to just setup another database, which results in data silos and ETL.
Data silos are a huge problem today and they are even reported to be the number one impediment to Big Data success. This is obvious when looking at most complex enterprise architecture diagrams that show incompatible legacy systems woven together with other legacy systems to create a complex, brittle architecture in which data is un-shareable and un-usable.
In most organizations, only a few experts may still be around to understand how it maps to the intricate business rules of the organization. It is thus not surprising that for most business intelligence initiatives, the majority of time is just spent identifying and profiling data sources (Forrester, 2015).
Data silos were not an intentional part of the design, but the result of short-term solutions. Most databases are only designed to support a certain application or certain type of data. To get data out of those databases and use it for another purpose in another database, a process of ETL (Extract, Transform, Load) must occur to ensure it matches the schema of the new target database. ETL occurs frequently in most organizations, creating another data silo each time it is done.
As silos proliferate it becomes more and more difficult to maintain and connect them. Eventually, developers begin using “duct tape” maintenance code to connect various applications together, avoiding the true source of the problem. This only creates more complexity and eventually something either stops working, developers get too frustrated and leave, or new projects are slowed down to such an extent that progress becomes hopeless.
Relational databases have resulted in accidental complexity that keeps most organizations spinning in circles. The below diagram illustrates the problem.
Organizations face a growing inability to handle their disparate, varied, and changing data.
Let’s consider something else that further accentuates the problem: naming conventions with relational databases. And let’s consider something that relational database should be pretty good at handling, like a column for “revenue.” Unfortunately, in a large organization, each department may have completely different understandings of what “revenue” means and how it is calculated and reported. This means that in the physical models for the databases used in the organization, there may be hundreds of tables that all define “revenue” differently.
Any basic entity can be described any number of ways, and this is a huge problem when it comes to aggregating and reconciling data from various sources. This is not a new problem. It’s been causing DBAs headaches for years. The problem is just much more amplified today. One real example of this naming problem comes from a recent talk I heard at the NoSQL Now/ Smart Data conference in San Jose. One presenter remarked that a bank in Norway had 31 different definitions of the word “nominal amount.” I feel really bad for the folks on that project.
Problems with Unstructured Data: Lots of Labor and Compromise
While most organizations are hung up handling their structured data, the important unstructured data that accounts for 80 percent of all organizational data is completely ignored. As the graph below shows, there is a disconnect between database spending and how much data is unstructured. You can see from the graph that nearly 95 percent of total database spending is on relational databases (IDC, June 2014), while unstructured data is largely orphaned. Only 5 percent of database spending goes directly to non-relational databases designed to handle unstructured data!
Data on spending is from IDC’s June 2014 report on the database market. The statistic about 80% of data being unstructured is a widely accepted statistic though some estimates say it is even higher.
The growing amount of unstructured data presents a problem for relational databases. The rows and columns of a relational database are ideal for storing sets of values, but most information is composed of much more than that. Consider something like a person’s medical record. It is incredibly heterogeneous. It includes values (name, date of birth), relationships (to family members or care providers, to symptoms and medications), geospatial data (addresses), metadata (provenance, security attributes), images (CAT scan), and free text (doctors’ notes, transcripts).
Above is a screenshot of the Practice Fusion EHR user interface showing various types of data within an EHR.
Now, imagine putting all of that data into a Microsoft Excel spreadsheet…
That is exactly what it’s like trying to put this type of data into a relational database. It’s not easy and it requires a lot of ingenuity, and many difficult choices:
- Should large blocks of text be broken up or stuffed into a cell in the table?
- What about storing new data sources that come in later?
- How many columns should there be for metadata?
- What about the relationships between various entities?
- What about the structure within the document?
- What indexes should be created?
- What if I want to filter the data by an element that is not defined by a row or column?
Regardless of the amount of labor and compromise that has been put into trying to make the relational database work for everything, the fact remains that it was not designed for heterogeneous data.
Read the next blog in this series that discusses why relational databases are not designed for scale or download the white paper, Beyond Relational, which covers all of the reasons why relational databases aren’t working in more depth.
All posts in this series:
- RDBMS Are Not Designed To Handle Change
- RDBMS Are Not Designed for Heterogeneous Data
- RDBMS Are Not Designed For Scale
- RDBMS Are Not Designed for Mixed Workloads
- RDBMS Are A Mismatch for Modern App Development
Matt Allen
Matt Allen is a VP of Product Marketing Manager responsible for marketing all the features and benefits of MarkLogic across all verticals. In this role, Matt interfaces with the product and engineering team and with sales and marketing to create content and events that educate and inspire adoption of the technology. Matt is based at MarkLogic headquarters in San Carlos, CA and in his free time he is an artist who specializes in large oil paintings.