All large organizations have massive amounts of data and it is usually spread out across many disparate systems. This wasn’t a conscious choice but rather a bunch of pragmatic tradeoffs. Silos are tech debt and are on the rise with the adoption of Software as a Service (SaaS) applications and other cloud offerings, increasing friction between the business and IT. Integrating those data silos is notoriously difficult, and there are clear challenges when trying to use a traditional data warehouse approach. For that reason, IT organizations have sought modern approaches to get the job done (at the urgent request of the business).
This comparison covers three modern approaches to data integration: Data lakes, data virtualization or federation, and data hubs. All three approaches simplify self-service consumption of data across heterogeneous sources without disrupting existing applications. However, there are trade-offs to each of these new approaches and the approaches are not mutually exclusive — many organizations continue to use their data lake alongside a data hub-centered architecture.
MarkLogic Data Hub | Data Lake | Data Virtualization | |
---|---|---|---|
Data Ingestion |
|
|
|
Data Model |
|
|
|
Search & Query |
|
|
|
Operational Capabilities |
|
|
|
Curation (Harmonization, Enrichment, Mastering) |
|
|
|
Security |
|
|
|
Scalability |
|
|
|
Performance |
|
|
|
Deployment |
|
|
|
A Data lake is a central repository that makes data storage at any scale or structure possible. They became popular with the rise of Hadoop, a distributed file system that made it easy to move raw data into one central repository where it could be stored at a low cost. In data lakes, the data may not be curated (enriched, mastered, harmonized) or searchable and they usually require other tools from the Hadoop ecosystem to analyze or operationalize the data in a multi-step process. But, data lakes have the advantage of not requiring much work on the front end when loading data.
Data lake use cases include serving as an analytics sandbox, training machine learning models, feeding data prep pipelines, or just offering low-cost data storage.
A few years ago, the Hadoop landscape was contended by three main players: Cloudera, Hortonworks, and MapR. Today, only Cloudera remains following its merger with Hortonworks and MapR’s fire sale.
For many organizations, object stores like Amazon S3 become de facto data lakes, and support the move to the cloud from an on-premises Hadoop landscape.
Besides the Hadoop core, there are many other related tools in the Apache ecosystem. For example, Spark and Kafka are two popular tools used for processing streaming data and doing analytics in an event-streaming architecture (they are marketing by Databricks and Confluent, respectively).
A detailed review of those tools is out of scope for this comparison. But, in general, those tools are complementary to a data hub approach for most use cases. They manage streaming data but still need a database. For example, Kafka does not have a data model, indexes, or way of querying data. As a rule of thumb, an event-based architecture and analytics platform that has a data hub underneath is more trusted and operational than without the data hub.
Data virtualization involves creating virtual views of data stored in existing databases. The physical data doesn’t move but you can still get an integrated view of the data in the new virtual data layer. This is often called data federation (or virtual database), and the underlying databases are the federates.
For example, you may have a few Oracle and SAP databases running and a department needs access to the data from those systems. Rather than physically moving the data via ETL and persisting it in another database, architects can virtually (and quickly) retrieve and integrate the data for that particular team or use case.
With data virtualization, queries hit the underlying database. Newer virtualization technologies are increasingly sophisticated when handling query execution planning and optimization. They may utilize cached data in-memory or use integrated massively parallel processing (MPP), and the results are then joined and mapped to create a composite view of the results. Many newer data virtualization technologies can also write data (not just read). Newer solutions also show advances with data governance, masking data for different roles and use cases and using LDAP for authentication.
One of the major benefits of data virtualization is faster time to value. They require less work and expense before you can start querying the data because the data is not physically moved, making them less disruptive to your existing infrastructure.
Another major benefit is that data virtualization gives users the ability to run ad hoc SQL queries on both unstructured and structured data sources — a primary use case for data virtualization.
Examples of companies offering stand-alone data virtualization solutions are SAS, Tibco, Denodo, and Cambridge Semantics. Other vendors such as Oracle, Microsoft, SAP, and Informatica embed data virtualization as a feature of their flagship products.
Data hubs are data stores that act as an integration point in a hub-and-spoke architecture. They physically move and integrate multi-structured data and store it in an underlying database.
With these advantages, a data hub can act as a strong complement to data lakes and data virtualization by providing a governed, transactional data layer. We discuss this more in depth below.
Here are some of the signs that indicate a data hub is a good choice for your architecture:
Our customers typically use the MarkLogic Data Hub Platform for use cases such as building a unified view, operational analytics, content monetization, research and development, industrial IoT, regulatory compliance, ERP integration, and mainframe migrations.
Data Lakes are best for streaming data, and they serve as good repositories when organizations need a low-cost option for storing massive amounts of data, structured or unstructured. Most data lakes are backed by HDFS and connect easily into the broader Hadoop ecosystem. This makes it a good choice for large development teams that want to use open source tools, and need a low-cost analytics sandbox. Many organizations rely on their data lake as their “data science workbench” to drive machine learning projects where data scientists need to store training data and feed Jupyter, Spark, or other tools.
Data virtualization is the best option for certain analytics use cases that may not require the robustness of a data hub for data integration use cases. They can be deployed quickly and because the physical data is never moved, they do not require much work to provision infrastructure at the beginning of a project. Another common use for data virtualization is for data teams to run ad-hoc SQL queries on top of non-relational data sources.
Data hubs and data virtualization approaches are two different approaches to data integration and may compete for the same use case. We find that customers who are using a data hub usually do not need to implement data virtualization as well. The data hub covers almost all of the same benefits. For instance, many MarkLogic customers have built metadata (or content) repositories to virtualize their critical data assets using MarkLogic Data Hub.
That said, it is possible to treat a MarkLogic Data Hub as a data source to be federated, just like any other data source. For example, MarkLogic Data Hub can be used to integrate data from multiple sources and can be accessed as a federated data source using tools like Spark for training and scoring machine learning models.
Data lakes are very complementary to data hubs. There are many of our customers that have utilized the MarkLogic Connector for Hadoop to move data from Hadoop into MarkLogic Data Hub, or move data from MarkLogic Data Hub to Hadoop. The Data Hub sits on top of the data lake, where the high-quality, curated, secure, de-duplicated, indexed and query-able data is accessible. Additionally, to manage extremely large data volumes, MarkLogic Data Hub provides automated data tiering to securely store and access data from a data lake.
Most commonly, customers either have an existing data lake and are in the process of migrating off of it, or they are choosing to off-load low-usage data into Hadoop to get the benefits of low-cost storage or support machine learning projects.
When considering what the next step is in planning your architecture, here is the summary of options to consider:
We have many customers who chose to supplement or replace their data lake or data virtualization with a MarkLogic Data Hub. Some examples you can explore include Northern Trust, AFRL, and Chevron.
See how MarkLogic simplifies complex data problems by delivering data agility.