Gone Data: Is Data That Was Supposed to Be Deleted — Really Gone?

February 21, 2017 Data & AI, MarkLogic

EU GDPR is surfacing a host of issues — and one of them is: if a client requests data erasure — how can you be sure it is gone?

It sounds easy — but customer data is stored all over the place. There may be pictures, social media accounts, videos. All of which you know to be linked to Jane or John Smith but why are your searches not coming back with these?

With more and more systems added on top of each other over the years, there is no one system that has all the data in one place (or even knows about all the data) to ensure these requests are fulfilled to the letter of the (international) law. This presents a tremendous data integration problem — and this brings us to a board meeting, happening now, I am sure, where the Head of IT, a consultant or even a technology savvy privacy lawyer has been brought in to discuss the problem:

CXO: So the next item on the agenda, EU GDPR. How are we going to respond to this?

IT PRO: Ahem, it’s difficult. The data we have is, well, a mess. There are three problems. The first is — and this doesn’t effect just Data Privacy — but everything we do — we have data on top of data. Spread across multiple different sites. Across multiple systems and even some data that we are struggling to get a handle on.

Data Scattered Across Silos

When you think of Big Data, or metadata or small data or any type of data-related role; what do you think that entails? Do you think that data scientists spend most of their time analyzing the data to get the best insights they can? Insights that can change the fortunes of a company? Or form a part of some great new innovation? This, I am sad to say, couldn’t be further from the truth. A recent study by CrowdFlower found that 60 percent of data scientist’s time is spent “wrangling” or “cleansing” the data to make it fit into their database provider. Oh, and 80 percent of those who responded to the survey also found this aspect of their role is the least enjoyable part of it. That is a tough pill to swallow. You are asking these people to spend most of their time in the least enjoyable part of their role.

Let’s return to the scene.

CXO: Ok, so what do you think we need?

IT PRO: We need a central database where we can bring in all the data to one place, link all of the data and ensure that we can find every document, or other, and have them linked to the piece of data we have been requested to erase.

Schema-on-Read Approach

This is the second problem: Bringing your data together and organizing it into a central hub. Which is great if your database can ingest all the data as-is and handle a schema-on-read approach. However because many IT departments work with structured data — they think that they should store all their data in a relational database or warehouse. And those systems definitely don’t let you load as is. It requires months and months of mapping and modeling. And if you want to include unstructured or semi-structured data – all the information in a contract, or digital cameras, or text messages – well the job just got harder.

Once again back to the executive suites.

CXO: Ok so I get that we need something new. Why can’t we use our existing tools and vendors?

IT PRO: We could but we would need training, development and time. All three things that will come with a price tag.

CXO: Hmm so how much will this cost?

We’ll leave this scene there. You see, the other database providers that he is referencing need the extended development time due to the fact that they are trying to shoehorn unstructured data into a structured environment. All of this costs time and more importantly money. Both of which are under enormous pressure; especially when you think that this legislation is coming next year. What is needed for today and tomorrow’s data legislation (including EU GDPR) is a database that can handle all of your data – including structured, text, JSON, XML, RDF triples, geospatial and large binaries.

Those of you with a keen eye will have noticed that our IT/Consultant/Savvy Lawyer mentioned three problems that EU GDPR throws up — and solving it puts you on a stronger business footing too!

So you have all your data in one place, structured, unstructured, everything. This is where the real problem lies and our third problem. You have created a database with all your data but how are you going to comply with a request to be forgotten. How do you know you have caught everything in your system that needs to be deleted?

Centralized Operational Data Hub

Well if you have a centralized data hub that creates semantic associations (metadata) between entities and assets — you can do a search and find all assets associated with a specific individual. All of the data, whether it is structured or unstructured, should be semantically linked on ingest to ensure that you find all of Jane or John Smith’s character-centric files (alphanumeric or otherwise) and any other file (pictures, videos even social media references) are captured and able to be deleted with ease and in a timely manner.

Now when that deletion request comes through, you can be confident that you are in compliance.

Is there such a system that will semantically link on ingest? Actually yes, Allan Donald, senior product manager at the BBC, described http://www.bbc.co.uk/academy/technology/article/art20141013145843465 the BBC’s development of its program metadata API and why they chose (again) to partner with MarkLogic.

“After some abortive attempts to solve this problem in SQL, and a lengthy period of prototyping and testing alternatives with major database vendors, we settled on a NoSQL database from …MarkLogic. This had already been successfully used for the Olympics as part of the BBC’s Dynamic Semantic Publishing platform. Using [MarkLogic] we saw significant speed benefits. Some sample availability queries that took up to 20 seconds for SQL could be performed on NoSQL documents in around 20ms – a thousand times faster.”

EUGDPR is coming into effect in May 2018. Failure to comply or have sufficient measure in place by this date could result in organisations being fined up to 4 percent of annual global turnover or €20 Million (whichever is greater).

Take that in for a minute – 4 percent of your turnover. To put that into context, at the time of writing this article, the company at the bottom of the FTSE250, if found to be non-compliant could be fined £15,800,000. Or put another way 16% of their cash reserves.

Will you be ready? Will you be fast enough to ensure that you remain compliant? More importantly, will you ensure that you have the right database for the job?

For more information on this topic

EU GDPR: Beyond Compliance, blog post that outlines the key issues Data Protection Officers face, and how a 360-view of clients can help

Schema-on-Read vs Schema-on-Write blog post that defines the true strength of a NoSQL database — the Schema-on-Read approach – which allows you to load data as is — and transform later as you need it!

The Path To Compliance 45-min webinar, Christy Haragan joins Anastasia Olshanskaya to discuss the new data privacy rights individuals have — how this dramatically impacts business, and they leave you with a 5-step guide to EU GDPR compliance.

Philip Miller

Philip Miller serves as the Senior Product Marketing Manager for AI at Progress. He oversees the messaging and strategy for data and AI-related initiatives. A passionate writer, Philip frequently contributes to blogs and lends a hand in presenting and moderating product and community webinars. He is dedicated to advocating for customers and aims to drive innovation and improvement within the Progress AI Platform. Outside of his professional life, Philip is a devoted father of two daughters, a dog enthusiast (with a mini dachshund) and a lifelong learner, always eager to discover something new.