When we talk about data governance, we’re typically looking at three key components: people, policies and technology. An often overlooked aspect of data governance is data architecture.
Data governance is tightly linked to data architecture and the technical solutions that can support it. Ideally, your data architecture can accommodate the data governance policy you want to implement. The reality, however, is that the data architecture your organization currently has in place will dictate a large part of the data governance choices you make.
This is especially true for businesses that rely on point-to-point integration, or application-to-application calls. The point-to-point approach creates a complex data ecosystem that requires continuous maintenance and scaling. While many of us are trying to be agile and take a pragmatic approach to improving data infrastructure, tying this to new company-wide business initiatives, it can be hard to shake the silos and siloed cultures that have been created over time.
While you may want to keep certain aspects of data management decentralized for agility, you also want to be able to apply data governance policies centrally and to all your data. For this, you first need your data to be uniform and managed in one place. A data hub can help your company integrate, store and analyze high-value information that can be used to make better business decisions.
According to the industry definition, a data hub is a data-centric architectural pattern that acts as the central hub for data exchange and agile data flows. The Progress MarkLogic team has always believed that adding a data storage component to the pattern strengthens its capacity to enable business cases more sustainably and long-term. So, let’s explore our definition of a data hub architecture:
A data hub is a centralized platform that consolidates and harmonizes data from multiple sources and makes it available to a wide variety of users and use cases. Because data hub endpoints can be anything from applications and workflows to databases and downstream systems, they enable streamlined access to data almost in real time and facilitate data sharing by connecting data producers with data consumers. This decreases the knowledge silo effect and accumulated technical debt.
Data hubs offer high performance, flexible schema and scalability and are best used in scenarios where an organization needs to integrate, distribute and govern large data sets across systems, like after a merger for example. At their best, data hubs can work with any data format and consolidate data from any database, providing approaches for organizations to create golden records.
The data hub first emerged as a pattern due to a technological shift to NoSQL (multi-model, document and graph) databases. While relational databases still support about 70-80% of data workloads in most organizations, they lack flexibility and require extensive up-front data modeling to support common data integration use cases within a reasonable budget and timeframe. A multi-model database, by contrast, enables organizations to ingest raw data immediately, lower schema-design costs and deliver faster value for more use cases.
Unlike data warehouses or data lakes, a data hub is significantly more agile and can keep up with today’s fast-moving business cycles. A data hub provides transactional and analytical capabilities for both run-the-business and observe-the-business use cases. And it can accommodate the security and governance required for mission-critical data.
A data hub is a proven approach to simplifying an organization’s architecture, providing greater agility, lower costs and enhanced data governance.
Recent research by AWS showed that for over 60% of surveyed CIOs and CDOs, data governance came at the top of their priorities for 2023. Data and Analytics leaders spent more than 20% of their time on data governance initiatives.
This is because data represents the backbone of an organization’s decision-making process. The foundation of high-impact analytics is quality data, and without proper data governance practices in place, data is hard to access and will likely deliver poor results.
Another data quality-dependent initiative on data and analytics leaders' agendas is generative AI. 46% of them recognized that foundational data quality is a fundamental challenge that must be overcome before companies can take advantage of this new opportunity. While data quality, by definition, constitutes an essential goal of data governance, the rise of genAI adoption brings new demands to data governance and a company’s technology stack—specifically the ethical and unbiased use of data and an even greater focus on security and data access.
To add to that, generative AI hinges primarily on unstructured data—another aspect of data management many organizations have been postponing until now as it comes with its own set of challenges. Before 2023, only about 20% of organizations had the technology in place to make full use of unstructured data. Now, almost every CIO and CDO is looking to incorporate unstructured data into their data strategies and the infrastructure that can handle it to support their analytics and AI journey.
Compliance is also vying for the attention of D&A leaders as companies struggle to keep up with the increasing changes in regulated markets. Underneath all that, the main obstacle to an effective data governance implementation remains the decades-old challenge every organization is trying to fight off: silos.
Much of a company’s data was created for a specific purpose and was not formatted or linked to other data in a way that allows it to be used in other use cases.
When data changes in the primary source, they are often not propagated to other copies. As data is moved, it is often transformed and enriched in ways that can make data inconsistent throughout the organization. The diversity of data types makes it hard to understand, combine and query. The security of the entire data infrastructure can be compromised by any silo. All of this can be disastrous and can lead to low-quality analytics and decisions.
Using a data hub architecture, problems like inconsistently named elements and fields, varying formats, deduplicating data and tracking changes to data records can be resolved.
The benefits of data hubs for effective implementation of your data and analytics governance policy can be summarized as follows:
Data governance has four main goals: data integration, quality, security and access. A data hub provides a connected approach to streamline and simplify how all policies and programs related to data and analytics governance operate, including federated access, master data management, data catalogs and more. Let’s look at how data hubs help support each of these pillars:
Data hubs offer the ability to easily load different data without endless preprocessing. This approach can satisfy demands for quick and reliable sharing of critical data across organizations as it can support the scalable plugging in of new data sources into an organization's data ecosystem, without incurring the complexity and extra costs of integration. It also enables operational efficiency by aligning and integrating previously siloed systems.
Data hubs can bring together data from multiple sources while deduplicating redundant data. Because data hubs are centralized stores of all data coming from various organizational sources, they enable organizations to harmonize, model, enrich and complete data records to make them fit for purpose for a variety of business use cases. This focused approach to governance enables consistency and trust of data, driving strategic business outcomes.
There are two aspects of security data hubs can support: controlling data access and auditing. Data hubs help centrally apply rules about who within an organization can access what data, assigning roles and privileges and redacting sensitive data. Data hubs can store ad-hoc metadata about data lineage, which means you can always trace back changes to data throughout its management cycle. Combined, those security features are critical for regulatory compliance.
Making decisions based on subsets of data can lead to bad choices. Being able to conveniently access all company data as an integrated whole is a critical goal of data governance and by extension a pre-requisite for an organization to leverage its data for analytical and AI purposes. Data hubs offer the perfect architecture to facilitate democratized access to valuable data and speed up time to insight for business users and analysts.
While the tech stack is just a small part of the big data governance picture, a carefully chosen data management tool with built-in data governance capabilities is critical for the successful implementation of your data, AI and analytics strategy.
With the MarkLogic platform, organizations can build scalable data hubs for operational, analytical and generative AI purposes that will also support a robust data governance strategy. The powerful combination of a multi-model database, a data integration and modeling hub and high-performance built-in search helps organizations manage the full data supply chain with speed and agility.
Let’s explore the key capabilities of the MarkLogic Data Hub and how they can support your data governance implementation:
The MarkLogic Data Hub provides multiple tools to address data quality issues. Smart Mastering is a key capability built into the Data Hub that performs data deduplication upon data ingestions, rather than as a separate process.
With Smart Mastering, all versions of data used to create a golden record of an organization’s data can be stored with it so data stewards and others can easily verify results. The feature utilizes a user-defined model to determine the similarity between two objects and then can automatically merge them, mark them for review or conclude they are not the same entity.
For example, when determining if two customer IDs refer to the same person, the Data Hub Smart Mastering may compare addresses, age, social security records, medical conditions or other factors, depending on what data is available.
Smart Mastering maintains the original entity data so that merging decisions can easily be evaluated and reversed.
The MarkLogic platform brings agility to data modeling and curation. In the MarkLogic Data Hub, data modeling is performed on an as-needed basis with initial data models reflecting immediate needs. The Data Hub provides an abstraction layer and easy-to-use interface where much of the modeling can be done by domain experts with limited need for IT involvement.
This data model includes the harmonization and data enrichment needed to bring incoming data sources into conformation with the data model. The “golden records,” which are then created, are stored alongside the original as-is data.
As new needs are identified, the model is enhanced, and the data hub reflects the new model for both new and existing data. Because the original data is always easily available, existing documents and records can be reprocessed to always reflect the model’s current state.
This integration of modeling and harmonization is fundamentally different than the approaches used by most platforms and is a key differentiator as it allows for data hubs to be deployed quickly and easily while being easily enhanced.
The MarkLogic Data Hub offers superb lineage capabilities. This is because the MarkLogic Server stores both the original “as-is” data and the derived curated data in the same underlying document.
Tracking of changes to data is incorporated into the data model as part of the modeling process. Lineage information is done automatically, without the need for coding, and stored alongside the as-is and curated data so you can determine where a data value came from. Lineage information is stored in the PROV-O standard so it can be easily accessed and combined with lineage data from other systems.
MarkLogic Server also provides robust bitemporal functionality which can greatly enhance your organization’s ability to respond to regulators. Many compliance queries require organizations to show when information was first known and to be able to recreate that historical record in case of an audit or to perform analytics after the fact.
When there are schema changes involved, relational based systems can require restoring an older version of the database and its software. While this sometimes works, it is cumbersome. By contrast, managing bitemporal data with timestamps is an extension of how the MarkLogic platform manages all the documents in a database.
MarkLogic Server provides its primary security mechanisms at the database level, helping to simplify the creation of a secure data infrastructure. It provides a role-based access control model, which integrates with your existing security infrastructure, element-level security and compartment security for the simultaneous handling of complex data access rules.
Additionally, when you need to expose sensitive data while protecting privacy, for example when you need to do aggregates on sensitive PII data fields, MarkLogic offers advanced security features like redaction to dynamically change or mask the information.
When security requirements are just too complex to handle with rules, applications often create a report with specific data in it and manually determine who should be allowed to have access to it. The MarkLogic query-based access control makes this easy to do.
MarkLogic also comes with full auditing abilities, including logging who looked at what documents. There are many regulatory and compliance data hub use cases where it is important to understand when users change or even look at data. Keeping track of failed logins is essential to spotting attempted security breaches. When users know that critical data is tracked at this level, it not only helps surface wrongdoing but, by providing a deterrent, helps prevent it.
For users to find answers to their complex questions, they need access to different types of rich data, including geospatial, relational, text and semantic data for greater context.
Most implementations with advanced analytics handle this with multiple data and technology stacks. Each data type is handled separately and the responsibility for integrating the data and the technologies used to access the data is left to the application development team.
Because of its ability to handle complex data and complex queries, MarkLogic Server allows queries to access many data types and return the results as SQL views or in other formats, like JSON, depending on user needs. With the MarkLogic Optic API, all your data can be queried together far easier and with less expertise needed by your company—any data modeled in the MarkLogic Data Hub is automatically available as SQL views.
In summary, data hubs help organizations increase return on all D&A investments through more effective and targeted efforts on implementing governance of D&A information assets. Data hubs also help reduce complexity and cost across overall information infrastructure and can be the first step to building a data fabric or a data mesh.
Watch the on-demand webinar “Data Hub Strategy for Effective AI and Analytics Governance” to learn more about the topic.
David Kaaret has worked with major investment banks, mutual funds, and online brokerages for over 15 years in technical and sales roles.
He has helped clients design and build high performance and cutting edge database systems and provided guidance on issues including performance, optimal schema design, security, failover, messaging, and master data management.
Subscribe to get all the news, info and tutorials you need to build better business apps and sites