Of Data Warehouses, Data Marts, Data Lakes … and Data Hubs

October 07, 2021 Data & AI, MarkLogic

Data is created once, and ideally used many times and in many ways. Aggregating and connecting data creates more value. Our thinking about the best way to do this evolves over time.

“The right information in the right hands at the right time” may sound simple, but these platforms are serious investments in time and money. The business rationale is clear: invest in improved data access, and we’ll see the payback. Usually there’s one or two burning use cases of intense interest, but there’s often more candidates behind those.

The Early Days of Data Aggregation

Our first model — data warehouses — have been with us for quite a while.

A typical use case would be to load tabular data from finance, manufacturing, sales, etc. and get a cross-functional view of the business using reporting tools.

The early going was not easy: I clearly remember many struggles with now-familiar issues: data integrity, performance, security, and so on.

But things improved to the point where these systems went from simple reports to real-time analytics (OLAP) and automatic decisions, hence the term operational data warehouse. For example, if you saw heavy demand for a particular item on your website, automate the decision to step up supply. Or perhaps change the price.

Data warehouses continue to be useful, but it’s important to point out that most are designed to work with simpler forms of data, usually a dump of relational tables from some other primary system.

An interesting variant — data marts — arose when primary data warehouses weren’t meeting the needs of particular business functions.

These people wanted the freedom to work with their data in their own way to meet their own particular needs, as the shared and standardized solution wasn’t working for them.

In some industries, data marts became so popular that proliferation became a problem. Forget the expense, getting to a shared truth between different functions became a serious issue, as they were all working with “their” data.

This, in turn, introduced an interesting discussion around enterprise data standards vs business unit requirements, a hot topic that continues to this day.

And Then Data Became a Science

The game changed when data science became all the rage. A new crop of data scientists were coming up with amazingly impactful insights and getting a lot of executive attention.

These data scientists had found a way to work with messy, disparate — yet still simple — data sources (usually web-based), and connect them in a way that delivered unique insights.

They were working with data in a new way, and wanted a new platform.

Enter the data lake, designed to support teams of serious practitioners doing advanced analytics and machine learning. They do this by assembling toolchains of mostly open source tools using a big file system. If you are a data geek like me, this is seriously cool stuff.

But, as before, most data lakes are built around simpler forms of data.

Enter the Data Hub

Everything so far has been about aggregating simpler forms of data. What if some of the data is more complex?

Complexity can arise either because the data itself is inherently complex (think documents or forms, for example), or there’s a lot of different simpler data sources that aren’t following any standard. Both situations are common in a wide variety of industry segments.

There’s another important form of complexity that arises, specifically balancing the needs of business functions against enterprise requirements, avoiding a data proliferation problem.

Ideally, there’d be a shared “gold standard” of enterprise data, that business functions could “lens” in any way needed without having to make yet another copy of data.

A decent data hub analogy in the physical world might be a massive distribution center. Truckloads of products come in from producers, and they’re repackaged and rebundled depending on who’s buying what.

Complexity is inherent: stuff arrives packaged in a variety of forms you don’t have much control over. You have to ingest, store and process — but thankfully there are bar codes. If you’re Amazon, repackaging and redistribution is done on demand. There’s a high degree of automation, etc.

Doing something similar in the enterprise data world with potentially sensitive data adds a few wrinkles: provenance, compliance, security, and so on.

Why This Is Important

In larger organizations, they’re quite familiar with pros and cons of data warehouses and data marts. Both have been around for a while. It’s also likely that they’ve set up environments specifically for the analytics team, hence a data lake in some form. All good. All work well with simpler data types.

Very often left underserved are business functions that need to work with complex data: either the data itself is complex (e.g. a document), or complex because there are multiple sources that don’t follow much of a standard. And, quite typically, they have a legitimate need to view information in a way that wasn’t considered when setting enterprise standards.

That’s where a data hub platform makes sense — a real-time logistics center that accepts data sources in any form, helps make important data connections, and serves it up on-demand depending on what business users might need.

Like the data warehouses before, data hub platforms can support operational use cases: applications that scale and can automate human decision-making. A single data API conveniently speeds development.

How People Get There

It’s interesting to see how people come to the conclusion that familiar data warehouses, data marts, or even data lakes won’t meet all of their needs.

All of these platforms were designed to handle simpler forms of data: tables, well-structured objects, etc. As you try to use them with complex data, there’s eventually a dawning realization that complex data really is different: you ingest and process it differently, you organize and access it differently, and so on.

Once you realize complex data is different, you’re better equipped to find solutions. And that’s where MarkLogic comes in.

Chuck Hollis

Chuck joined the MarkLogic team in 2021, coming from Oracle as SVP Portfolio Management. Prior to Oracle, he was at VMware working on virtual storage. Chuck came to VMware after almost 20 years at EMC, working in a variety of field, product, and alliance leadership roles.

Chuck lives in Vero Beach, Florida with his wife and three dogs. He enjoys discussing the big ideas that are shaping the IT industry.