Organizations have huge volumes of data available within their walls, and increasingly in cloud storage, but have trouble analyzing and leveraging all of this data. Two emerging technology patterns help make combined, enterprise-wide data useful and secure: data fabrics and Operational Data Hubs (ODHs).
This blog outlines the differences between data fabrics and data hubs and how they best work together.
The data pendulum swings periodically from local or departmental control to enterprise control. Early on, the shift was from mainframes to minicomputers, then client-server to personal computers, and recently from departmental data silos to shared cloud infrastructure.
Now data fabrics are emerging to once again swing the pendulum toward enterprise-wide value and shared data.
The fact that this pendulum, or trend, shifts back and forth in cycles from local, agile data to shared, enterprise data does not mean that each shift is not important or useful. Each shift, including the current shift toward data fabrics, happens for good reasons and delivers real value.
Data fabrics and the current shift toward enterprise data is driven by the current technology landscape that includes cloud computing, machine learning and analytics. Previously, agile methods, microservices and speed demanded departmental focus. This was a valuable shift, but it happened before the value of enterprise-wide data was understood and before enterprise technologies existed to analyze and benefit from “big data.”
Now that enterprises are focused on AI, machine learning and analytics, a new shift is required. Security has also become a paramount concern in recent years, with data breaches growing larger and more damaging every year—damaging to individuals, damaging to companies and sometimes damaging to national security. (In fact, China is believed to have acquired sensitive, personal data on every person in the United States with a security clearance.)
Let’s examine and enumerate these new, valuable opportunities that are driving the shift back toward enterprise data and control and away from siloed, departmental systems.
Cloud — The entire world is moving to the cloud. The cloud reduces costs and allows every enterprise to focus on its core mission, rather than focusing on establishing a new process to hot-swap a failed RAID-10 disk drive. A new server can be stood up, sometimes, in a minute rather than weeks, and maintenance is outsourced.
Cloud also allows shared computing for efficiency. If a key operational system is busy during the day, and a batch-analytics system is busy overnight, these systems can share CPU and other resources to reduce overall cost and hardware needed. This sharing is now (largely) transparent in the cloud.
Also, cloud computing feeds into our next driver: analytics.
Analytics — Machine learning, AI and analytics are showing remarkable promise and value. Cross-domain analytics and insight are now possible due to the scale provided by the cloud as well as maturing technologies and a renewed industry focus.
Data lakes were a big part of this shift. For about three years, Hadoop was expected to usher in a new age of big data. While data lakes have arguably moved through the “hype curve” into the valley of disillusionment at this point, they did kick off the start of a new era in analytics and insight, and big data.
The promise of big data—even without data lakes—requires wide access to data.
Security — As data is consolidated, it will yield more benefit for the enterprise. Unfortunately, it is also a data bonanza for any attacker who penetrates the system, or who is an inside attacker with access to the consolidated (unsecure) data store.
Security and policy are two of the limiting factors for data lakes. Consolidated data is more valuable to data scientists and is also more valuable to attackers and insider threats. Data lakes have scattered technologies that are difficult to secure and manage.
Data fabrics suffer from many of the same issues, but fortunately the next layers up can secure the data. Below we will discuss how data hubs secure data so that consolidating data actually increases security.
Operational, Shared Data — Data fabrics allow data sharing across silos—at least sharing of a sort. The data can be accessed across silos once exposed in a data fabric. But to operationalize the data, it must be understood, curated and mapped to common semantics that allow cross-domain analytics.
Once mapped or “harmonized,” data can be more easily secured and monitored for quality and policy. That is, it can be governed better as it is curated and harmonized. By using the right technology, the data can also be exposed as real-time services. Without real-time access, the entire data-fabric initiative is similar to a set of batch ETL jobs, where data is copied from place to place and transformed with brittle ETL jobs.
An ODH is a data integration pattern that works well on top of a data fabric to operationalize and secure data that would otherwise be accessible but not useful.
As covered above, the shortcomings of a data fabric are that data is:
ODHs sit on top of a data fabric to provide the next set of concerns and enable operational data sharing and analytics. The main features of a data hub include:
The following diagram illustrates MarkLogic’s Operational Data Hub pattern:
What is not shown are the benefits of the data fabric—the ability to seamlessly access all of the relevant data. This is a big benefit, but not the full capability needed to accomplish the goals on the right of the diagram. These goals are:
Analytics and business intelligence — Data cannot be summed, averaged, categorized or otherwise analyzed if it is low quality, full of duplicates or varied in naming and semantics.
Business processes, data services and microservices — To operationalize data, it must be exposed by known APIs with known formats suitable for real-time consumption.
Transactional applications — Often, shared data is useful beyond analytics. For instance, knowing the full picture for a person, place or thing. An “object” or “entity” is useful in many applications.
Downstream systems — Downstream systems need high-quality, deduplicated data in a known format. It is only half the problem to gather all of the data.
To accomplish these goals, the processes in the middle of the diagram are important:
Harmonization — Different data sets from different sources vary in structure, naming and semantics. Harmonization is the process of bringing data into common formats with common semantics. Things as simple as naming (firstName vs givenName) and a complex as overall risk or status must be harmonized.
Policy application — Policies from retention and tracing to security can be applied to harmonized data using a small, coherent set of rules. It is too difficult (typically) to create rules and processes for data from every corner of the enterprise in non-harmonized forms.
Validation — Data quality and stewardship become easier when data is brought together. Quality, like policy, is best applied to harmonized data.
Security — Security is the most important aspect of policy and governance. Typically, bringing more data together makes it vulnerable and less secure. Many groups will be reluctant or will block data sharing if there is no guarantee of security once shared.
Instead, by securing data in an ODH, using coherent policies on top of harmonized data, more users can be directed to a secure data hub, actually reducing access and risk of having multiple, unsecured, ungoverned data sets with broad access.
Operational Data Hubs and data fabrics work well together. Data fabrics focus on the access of data—on bringing it together. Once the data is together, then it must be harmonized, cleaned, deduplicated, secured and shared.
The first step is data-fabric access to all data. But then layering security and governance and making data actionable, requires a next phase, which is standing up one or more data hubs that provide secure, operational access to the data.
In short, data hubs are a critical component to deliver the data and promise of a data fabric.
Damon is a passionate “Mark-Logician,” having been with the company for over 7 years as it has evolved into the company it is today. He has worked on or led some of the largest MarkLogic projects for customers ranging from the US Intelligence Community to HealthCare.gov to private insurance companies.
Prior to joining MarkLogic, Damon held positions spanning product development for multiple startups, founding of one startup, consulting for a semantic technology company, and leading the architecture for the IMSMA humanitarian landmine remediation and tracking system.
He holds a BA in Mathematics from the University of Chicago and a Ph.D. in Computer Science from Tulane University.
Subscribe to get all the news, info and tutorials you need to build better business apps and sites