Four Biggest Challenges to Data Integration

March 14, 2018 Data & AI, MarkLogic

Data integration initiatives create untold burdens on IT teams. I reached out to four veterans of data integration to identify those tasks they thought caused the biggest challenges. Not surprisingly, answers surrounded areas such as knowing your data and data mapping. But what also emerged was the cultural and organizational commitments to change – and actually leveraging integrated data. The final burden was dealing with the sometimes capricious and ever-changing deadlines – set by business, government regulators, and now even, the public.

What do you think, do you agree?

Details – Buried Deep in the Data and Systems

The goal of data integration is to combine disparate sets of data into meaningful information. The tasks involved to accomplish this goal generally follow common patterns, but can quickly become as varied as the data sources themselves. Here are a few of the tasks that have proven to be most difficult:

  • Quick and accurate understanding of data sources. You may start by profiling the data and come to some general understanding but invariably will encounter a peculiar anomaly about the data source – which will cause you to rework your integration efforts. This might take the form of a data element that is intended for one purpose, but contains data content of something else.

(Example: In an RDBMS, you have a column defined as: data type of varchar(100), column name: product_description; perhaps the word “discontinued” (or “dis” or “disc”, etc.) is embedded to reflect the status of the product. Now, the column “product_description” has multiple meanings and/or data (content), and is inconsistent with what the column_name suggests.)

  • Handling changes in data over time. This is one of the more complex aspects of data integration, because different sources can be updated at different intervals. Sometimes source systems can have very unusual processes for handling updates that can cause assumptions typically seen in relational databases, such as referential integrity, to be invalid. To combine data from multiple sources into a cohesive result, one must get down to the detailed field level mapping of the multiple data sources – all of which have varying consistency, definitions, and contexts.
  • Data mapping. Business analysis, domain expertise, and technical knowledge about source systems are required when mapping data from one source to another. While differences in naming conventions and data formats is a good first step, this task also requires understanding the relationships of one data set to another. The business rules embedded in the source system that produce the data must be considered when applying transformation logic to create the integrated data set.

These complexities demonstrate that a request to “bring data together” is no trivial exercise. When you attempt to combine sources of data that were designed and created in insolation from one another, it will be the details buried deep in the data and systems that will challenge the progress toward creating one cohesive, reliable, resilient, and secure source of information.

Maureen Penzenik is a Solutions Architect at Northern Trust, focusing on Information Architecture, Data Integration, and the foundations of BI and Analytics. She previously held positions in at RR Donnelley and McDonald’s Corporation in applying Data Warehousing and Business Intelligence methods and technologies.

Deadlines – Set by Governments, Departments or Consumers

Currently, the EU General Data Protection Regulation (GDPR) is the biggest challenge for companies handling data, since every company in the European Union has to comply with the GDPR by May 25.

The GDPR was designed to harmonize data privacy laws across Europe, to protect and empower all EU citizens’ data privacy and to reshape the way organizations across the region approach data privacy.

To comply, companies need to:

  • Identify/classify shadow IT
  • Analyze compliance risks
  • Continuously monitor and report all existing information

As GDPR compliance is mandatory by law and penalties impend, many companies see the implementation within the given timeframe as a burden. Further, compliance requirements are not as clear as “you do XYZ and then you are bullet-proof compliant.” Thus, there is a lot of insecurity in the market on how to approach this and how to execute in order to be compliant.

Companies need to document how they process data, where they store it, and how they got it, and document that they have approval to process it. What’s more, they also need to establish processes that document and report the data that is stored in a given format and within a given period of time, by request of any EU citizen. Also, deletion of data and documentation thereof must be executed on demand.

When you look at it from a technical perspective, this is a comprehensive data integration project. On a baseline, you’ll need to extract, link, search and explore/monitor data out of all silos and formats you can imagine since the company was founded.

The stakes are high – penalties for non-compliance per incident are:

  • Up to €10 million, or 2 percent annual global turnover – whichever is higher.
  • Up to €20 million, or 4 percent annual global turnover – whichever is higher.

Technologies that allow for working with historical and operational data at the same time, not only make the burden/task of compliance easier, but enable the ability to create new revenue streams in data-driven businesses.

Alexander Deles, CEO – EBCONT, is a member of the management board of the EBCONT group of IT companies. He brings almost two decades of experience across different industries. EBCONT is a long-term MarkLogic partner and successfully delivers projects based on MarkLogic around the globe.

Data Maps – Determine the Relationships Between Entities & Identity Keys

As a consultant, I’ve worked with a number of Fortune 500 companies to help manage their data integration strategies better. There are multiple pain points that I’ve seen bring projects to a crawl. In general, the primary difficulty is not in the ETL. Importing a spreadsheet or extracting content from a relational database is old news, and there are multiple strategies for getting content into NoSQL or graph databases that, while not simple, are relatively mechanical.

The more complex issue involves bringing this information into a data store such as MarkLogic. The challenge comes in being able to handle resources, such as products, organizations, people, or similar entities, that are identified by different identity keys from different databases. Today, most master data management solutions use algorithms to try to map how closely two given entities match, but this is an area, in particular, where semantic technologies (a subfield of artificial intelligence) and graph databases (triple stores) shine.

This becomes especially the case as 360° data initiatives become more prominent. Such initiatives, providing a global view of the data within an organization, should use semantics. Period. This means thinking about data holistically, and making an effort to create a simplified core model that can be a target for ingestion ETL from a wide variety of sources. It also involves baking in data governance, provenance, quality and rights management as part of the overall design, as most relational databases are often notoriously bad for capturing any of this.

The payoffs for this, however, are worth it – it becomes possible to determine what is and is not quality data, determine what data is applicable (or even available) by locality or time, and makes it easier to build services to get at that data across multiple potential sources. As a similar process can be used for managing the data dictionaries or taxonomies that your organization has accrued; this also means that you can use semantics as a way to coordinate relational databases with far fewer headaches.

One final arena where semantics have simplified the burden of data integration is in providing better tools for managing controlled vocabularies and taxonomies. Machine learning relies heavily upon having good, complementary facet tables and facets, and as stochastic models become more complex, so, too, do the number and types of facet tables necessary to fully describe an information space. This is a natural (and early) use of semantic technologies, and the combination of machine learning and semantics will provide a powerful boost to computational business processing.

Thus, most of the pain points I’ve seen – master data management, 360° views, provenance/governance, and facet table problems – can be resolved with the proper use of semantic technologies.

An Invited Expert with the W3C, Kurt Cagle has helped develop XML, HTML, CSS, SVG, and RDF standards, regularly writes on data modeling, architecture and data quality for Data Science Central and LinkedIn. His company, Semantical LLC, provides expertise on building smart data applications and metadata repositories.

Readiness – Technological, Organizational and Cultural

Data integration always should be driven by concrete business needs: better customer understanding to sell more, improve cost efficiency or to meet regulatory needs. In order words, data integration tasks should be applied to grow, optimize, innovate or protect.

And data integration is not only a technological issue but also cultural and organizational one. All three need to be considered as a first stage before tackling data integration. Is my organization ready to handle it? Do we have the right people? Are the different areas prepared?

Next, you need to tie data integration with a data governance approach. Data quality and data lineage are crucial to success. In fact, a common trend in the market is to provide end-to-end data lineage, regardless of whether it is transactional or informational/historical data.

The main barriers that could appear when facing data integrationare:

  • Lack of readiness in departments to understand the value of integrated data
  • Missing or incomplete documentation for legacy applications
  • Wanting to migrate of data – when reengineering is what is really needed
  • Not decommissioning legacy sources
  • Poor or non-existent management of customer data encryption and decryption in the cloud
  • Poor planning or management of time and people expectations
  • Lack of training and incentives for your employees to want to adopt a new integrated data scenario

Main tips to keep in mind:

  • Think of an agile and iterative approach to get insights in short-term
  • Find people with the right skills AND technologies with the right capacities (There are very few, in fact.) that provide the necessary functionality for data integration
  • Have functional ontologies and an augmented analytics approach to speed up data integration
  • Close the loop. Customer training and support to adopt new features, avoiding the unnecessary customer effort to know where the data is, are essential. Self-service tools combined with data dictionaries, will help drive success – as do technologies that including data indexing.

As the Big Data Executive Director at everis, Juanjo Lopez leads the eDIN, a Center of Excellence around Data & Analytics, which offers services inside everis & NTT DATA Group worldwide and across all market sectors. Juanjo also belongs to the NTT DATA G1T, a worldwide steering committee defining the strategy and services offering of the Group around Data & Analytics.

Diane Burley

Responsible for overall content strategy and developing integrated content delivery systems for MarkLogic. She is a former online executive with Gannett with astute business sense, a metaphorical communication style and no fear of technology. Diane has delivered speeches to global audiences on using technologies to transform business. She believes that regardless of industry or audience, "unless the content is highly relevant -- and perceived to be valuable by the individual or organization -- it is worthless."