Data & AI MarkLogic

The Staggering Impact of Dirty Data

blog-hero-staggering-impact-dirty-data-1600x745

by Ed Downs Posted on August 21, 2018

Sometimes, costs sneak up on us. What might seem to be an everyday annoyance has been having staggering cost implications for years.

Dirty data—data that is inaccurate, incomplete or inconsistent—is one of these surprises. Experian reports that on average, companies across the globe feel that 26% of their data is dirty. This contributes to enormous losses. In fact, it costs the average business 15% to 25% of revenue, and the US economy over $3 trillion annually. Anybody who’s had to deal with dirty data knows how frustrating it can be, but when the numbers are added up, it can be difficult to wrap your head around its impact.

Since dirty data costs so much—a sobering understatement—it is critical to understand where it comes from, how it affects business and how it can be dealt with.

Where Does Dirty Data Come From?

According to Experian, human error influences over 60% of dirty data, and poor interdepartmental communication is involved in about 35% of inaccurate data records. Intuitively, it seems that a solid data strategy should mitigate these issues, but inadequate data strategy also impacts 28% of inaccurate data.

When different departments are entering related data into separate data silos, even good data strategy isn’t going to prevent fouling downstream data warehouses, marts and lakes. Records can be duplicated with non-canonical data such as different misspellings of names and addresses. Data silos with poor constraints can lead to dates, account numbers or personal information being shown in different formats, which makes them difficult or impossible to automatically reconcile.

Dirty data can remain hidden for years, which makes it even more difficult to detect and deal with when it is actually found. Unfortunately, 57% of businesses find out about dirty data when it’s reported by customers or prospects—a particularly poor way to track down and solve essential data issues.

Many organizations search for inconsistent and inaccurate data using manual processes because their data is too decentralized and too non-standard. These plans tend to fall into the same trap as the data—instead of consolidated planning, each department is responsible for its own data inaccuracies. While this may catch some instances, it also contributes to internal inconsistencies between department silos. The fix happens in one place but not in another, which just leads to more data problems.

The Impact of Dirty Data

Dirty data results in wasted resources, lost productivity, failed communication—both internal and external—and wasted marketing spending. In the US, it is estimated that 27% of revenue is wasted on inaccurate or incomplete customer and prospect data.

Productivity is impacted in several important areas. Data scientists are spending around 60% of their time cleaning, normalizing and organizing data. In the meantime, knowledge workers are spending up to 50% of their time with hidden and inaccurate data.

Dirty data lacks credibility, and that means that end-users who rely on that data spend extra time confirming its accuracy, further reducing speed and productivity. Introducing another manual process leads to more inaccuracies and mounting inconsistencies through growing numbers of dirty records.

In addition to the revenue loss, dirty data impacts businesses more insidiously. Only 16% of business executives are confident in the accuracy that underlies their business decisions. Garbage in, garbage out—when you can’t rely on your own data, something needs to be done to increase data accuracy and reliability.

Dirty Data in Banking

Worldwide, inaccuracies in data costs between 15% and 25% of revenue for a company. With global revenues of over $2.2 trillion, this means that dirty data costs the global banking industry over $400 billion. Dirty data also leads to a number of risks that are unique to the banking industry.

Inconsistent information across data silos in an organization leads to transactional risks such as inaccurate or even fraudulent transactions. Fake and fraudulent accounts should be caught early by processes that clean or detect dirty data. When they don’t, the bank is put at risk, and its reputation is damaged.

With so much dirty data and so few executives trusting the data they are using, it’s bound to result in poor strategic decisions. You can’t pick the right path if you don’t know where you are. Dirty data can lead to tremendous operational risks.

The constantly evolving regulatory landscape also creates a heavy burden for data management. Compliance teams are under significant pressure to provide more information about data, but when they don’t have clean data to work with, they are out of luck. The 2018 rollout of Mifid II regulations has been a painful example of this, with faltering compliance and increasingly strict regulators causing pain for many European financial firms.

Dealing with Dirty Data

The most challenging problem in cleaning up dirty data is the cleaning of invalid entries and duplicate data. Careful error correction is needed to not only ensure that no data is lost while improving the consistency of existing valid data, but that all of the metadata corresponding to data correction is maintained alongside the integrated data itself.

Once data has been cleansed, it needs to be maintained. After the initial process of cleaning dirty data, only new or changed data should need to be checked for validity and consistency. In all cases, from old to newly entered data, the lineage of the data must be recorded. This ensures its validity and trustworthiness.

Best practices for cleaning dirty data and for data governance include the following practices:

Harmonize by correlating the data across different siloed sources and harnessing metadata for data provenance and lineage.
Leverage core smart mastering capabilities to match and merge entities in a single multi-model platform.
Apply semantics to capture relationships between data and to ensure consistency.
Create a 360-degree view by integrating all of your data sources.
Find dirty data using natural language searching, data modeling and machine learning to identify patterns and anomalies.

It is a lot, but it’s worth it. An organization that uses strong data governance in addition to data-cleansing practices can generate up to 70% more revenue.

Stop Letting Dirty Data Slow You Down

The business impact of dirty data is staggering, but an individual organization can avoid the morass. Modern techniques and technology can minimize the impact of dirty data. Clean, reliable data makes the business more agile and responsive while cutting down on wasted efforts by data scientists and knowledge workers.

Your business might already be planning to tackle its dirty-data problems. In fact, 84% of businesses are planning to implement data quality solutions soon, but many of these solutions are segmented across departments in the enterprise. Moreover, many data quality initiatives won’t address core changes needed inside the database to affect positive change where it is needed the most. This will only lead to future problems with inconsistent data, exacerbating the current state as data proliferates. The effort needs to be global across the business and in a way that addresses shortcomings at their source—inside the database. An operational data hub, such as one built on top of MarkLogic®, can help your business get the right start on cleaning its dirty data.

Learn how MarkLogic’s Operational Data Hub framework can help you improve data governance and increase the quality of your data assets.

Ed Downs

Ed Downs is responsible for customer solutions marketing at MarkLogic. He draws on his considerable experience, having delivered large-scale big data projects and operational and analytical solutions for public and private sector organizations, to drive awareness and accelerate adoption of the MarkLogic platform.

Related Tags

MarkLogic

MarkLogic

Semaphore

OpenEdge

DataDirect

Sitefinity

Telerik

Kendo UI

Corticon

DataDirect

MOVEit

Chef

Flowmon

Kemp LoadMaster

WhatsUp Gold