As data storage options evolve and become more complex, questions arise as to which approach is the right one. Arguments for or against a particular option aren’t always easily defined. It’s important to make comparisons between the different systems, database types, and storage formats, especially within the context of your organization’s specific data requirements. Let’s start with a quick comparison table of a MarkLogic Data Hub and then look more generally at the differences between data hubs and data warehouses.
MarkLogic Data Hub | Data Warehouse | |
---|---|---|
Use Cases |
|
|
Data Model |
|
|
Search & Query |
|
|
Data Ingestion |
|
|
Data Quality |
|
|
Data Curation |
|
|
Security |
|
|
Scalability |
|
|
Deployment |
|
|
Maturity |
|
|
Data warehouses are “observe the business” data stores designed for analyzing data that often comes from upstream “run the business” transactional systems. Their purpose is to provide analysts an aggregate, cross-cutting view of the data.
Data warehouses use a relational model in which data is managed in highly structured rows and columns. The data structure, or schema, is defined in advance (a.k.a. schema on write) and optimized for fast analytical queries using SQL. Analytical queries usually involve joining, aggregating, and filtering the data.
While data warehouses have existed for decades, today’s modern data warehouses are purpose-built for the cloud. Examples such as Snowflake and Redshift are popular reincarnations of traditional data warehouses like Netezza and Teradata. Snowflake, in their own words, is “glorified SQL.” These cloud-native data warehouses provide cloud scale, cloud economics, and are fully managed. And, they have evolved to provide some support for JSON. Their core use case is still the same, however — they support enterprise BI and analytics on relational data.
Let’s consider a typical example of how a data warehouse is used. Imagine a large bank is running real-time trading systems to handle transactions. Those transactions happen in multiple OLTP (Online Transactional Processing) systems across the bank and are then aggregated into a central OLAP (Online Analytical Processing) data warehouse using ETL tools to extract, transform, and load the data.
The warehouse is used for further back-end processing (e.g., trade reconciliation), analysis (e.g., aggregate risk exposure), and reporting (e.g., regulatory agency inquiries).
Data hubs are data stores that act as stable integration hub in a hub-and-spoke architecture and provide a centralized view of your most important data assets. They use a multi-model database to store multi-structured data of various types, and also have the tools to curate that data (enriching, mastering, harmonizing). They are also operational and transactional, meaning they can power transactional applications, be used for advanced analytics, or simply feed other downstream systems.
While they can serve as systems of record, Data Hubs are usually referred to as a shared integration point in most architectures, where they are used to create an organization’s 360-degree view. As a rule of thumb, a data hub is not a drop-in upgrade or replacement for a data warehouse. Data hubs and data warehouses can easily coexist, and MarkLogic customers often use both together.
Compared to data warehouses, data hubs provide greater agility, have built-in data curation tools, and are operational (not just analytical).
Data hubs provide agile DataOps. They make it possible to apply the principles of agile development to managing data in the data layer. This is possible because data hubs do not require a strict schema to be defined in advance, which forces a waterfall approach. Instead, raw data can be loaded into a data hub as is. The raw data can then be curated and made fit-for-purpose for downstream use. The process is often referred to as “ELT” because the data is loaded first, then transformed iteratively to meet the needs of the business. Schemas can be defined for the curated data or at query time (a.k.a. schema on read).
Data hubs also excel when there is ambiguity. They support scenarios when there are unknown, complex data sources that may need to be streamed in (or batch loaded), and unknown use cases for how the data will be used later.
The reason data hubs are great with handling ambiguity is that they index everything and provide search-style querying immediately after ingesting the data. And, data hubs have built-in tools to resolve the ambiguity over time as downstream use cases become concrete in defining how source data needs to be harmonized and curated.
Here are some examples of the integration challenges that a data hub can resolve:
Data hubs are operational. They can provide a real-time view of the business that can be kept up-to-date in real-time, and can even write back to the upstream system when necessary. By allowing real-time updates with transactional support, data hubs provide a reliable data store in which direct updates may be made to integrated data without hurting data governance and accuracy.
Here are some of the signs that indicate a data hub is a good choice for your architecture:
Our customers typically use MarkLogic Data Hub Service for use cases such as building a unified view, search and discovery, and operational analytics.
Data warehouses are proven in the enterprise and almost all organizations have one or more data warehouses, and often a number of data marts that have been spun off them. Data warehouses will always be useful when data is highly structured and well-defined, and when the warehouse’s purpose is also well-defined.
If all you need to do is run fast SQL queries over rows and columns then a data warehouse is a great solution. Data warehouses are optimized for loading structured data and querying with SQL, and because of their dominance across the enterprise for the past 30+ years, there is an abundance of people with data warehouse and SQL skills.
So, if you are happy with your data warehouse and you don’t have challenges with data integration, there is no reason to change!
Data hubs and data warehouses can easily coexist, and our customers often use both together.
In most cases, organizations have existing data warehouses but then a new use case pops up that requires integrating data from those warehouses and they don’t want to spend a bunch of time and money on ETL and data modeling to build a common schema to integrate it all.
To solve this problem, organizations can employ a data hub to integrate data from those siloed warehouses (and any other data silos). From there, the data hub can power applications, or can feed curated data to another data warehouse downstream, or offloaded it into a file system optimized for low-cost storage.
So, the data warehouse continues to be an important part of the architecture, but the data hub serves to make the overall data-integration process more agile and trusted.
We have many customers who chose to supplement or replace their data warehouses with a MarkLogic Data Hub. Some examples include AIRBUS, Northern Trust, Hannover Re, and Chevron.
See how MarkLogic simplifies complex data problems by delivering data agility.