Beware of “Graft” on GDPR and CCPA

January 27, 2020 Data & AI, MarkLogic

Don’t be offended by the play on words. Graph databases are very powerful and not generally involved in bribery. In fact, given its ability to discover fraudulent activity through the relationships it captures, a graph database is quite good at uncovering transgressions such as payoffs and other forms of corruption.

With that caveat aside, let’s explore why a graph database should not be the ONLY data-management technology for capturing various 360 views—especially for customers—in the context of GDPR and the recently enacted California Consumer Privacy Act (CCPA).

As my colleague, David Gorbet, wrote in a recent SC Media article, California Consumer Privacy Act: Challenge and Opportunity, CCPA is

considered the most comprehensive of any state privacy law, provides consumers with new rights, including a right to transparency about data collection, a right to be forgotten and a right to opt out of having their data sold.”

David goes on to discuss the importance of viewing data as an asset, inventorying it properly, centralizing governance policies and moving past point solutions.

Attempting to do all of this strictly with a graph database is not the right approach. As with highly normalized relational databases, collecting all there is to know about a customer and shredding it into a graph model is like taking apart one’s car and putting its thousands of pieces on shelves each time one enters their garage. Needless to say, the task of assembling the car for day-to-day use becomes expensive, tedious and unreliable (oops! forgot the brake liners).

A better approach for meeting regulatory requirements and reducing the risk of non-compliance is to implement a multi-model strategy. Such an approach incorporates document, relational and graph structures along with their respective query mechanisms, i.e., NoSQL document search, SQL relational access and SPARQL semantic/graph access. In fact, having the ability to leverage allof these access mechanisms in a single, complex query across all three data models simultaneously is a powerful feature for GDPR/CCPA solutions.

As described in David’s article and Companies: Lean into Consumer Privacy to Win (by another colleague, Ken Krupa),

It’s difficult to ensure trust and accountability in data when data is sourced from different silos and applied to many different use cases.”

Think of all the touchpoints an enterprise has with its consumers and the form in which those interactions are captured. For example:

  • Orders for purchases are likely captured in several relational databases of transactions spanning the enterprise.
  • Profile information is likely kept in several document databases.
  • Householding information, i.e., relationships to a spouse, children or friend, could be kept in a graph database.

Information is naturally kept in table form for transactions, document form for profile information and graph form for relationships that spider out from consumers to spouses, friends and other associations.

In a multi-model approach, pulling this information together in response to a customer request to “forget me” would be fulfilled first by performing a powerful document search. The documents (e.g., XML, JSON or free text) would contain much of the sought-after information and link to other information via graph structures.

Returning to the “car shredding/assembly” analogy, this would be like keeping the engine, transmission, wheels and body intact so as to retain their integrity as composite entities, but retaining the ability to reassemble them with “Transformer”-like agility (and coolness I might add) into a complete view of a car … or customer in our case.

A query that simultaneously performs a NoSQL search across documents, an SQL query against relational rows and a SPARQL query against semantic graphs gets all the data more reliably, which greatly reduces the risk of non-compliance. Also, by filtering first with search, it mitigates the need for a massive compute infrastructure required to rejoin customer data, at scale, when everything is stored in a graph model.

One final point. It’s possible to pull together the recommended solution with readily available technology components such as an open source NoSQL document database, relational database, search engine and graph database. But, integrating all of these fast-moving pieces into a reliable, enterprise-ready platform that accounts for security, data consistency, ACID transactions and overall governance is a formidable challenge.

MarkLogic’s Data Hub Platform addresses this challenge. As a multi-model database with NoSQL search, SQL access and SPARQL query features, it relieves enterprises of the burden to expend valuable technical resources on integration tasks and allows them to focus on higher-value business activities. MarkLogic’s Data Hub is a platform that can help an enterprise optimize resources, reduce risk and remain compliant with GDPR and CCPA regulations.

Learn More

Michael Malgeri

Michael Malgeri is a Principal Technologist with MarkLogic. He works with companies to match their business requirements with MarkLogic’s enterprise NoSQL database and semantic features. He helps organizations reduce costs, automate processes, find new opportunities and create applications that bring high value to businesses and their customers. Michael focuses on the media and entertainment industry, where content providers, distributors and related companies are seeking to leverage the power of data in order to capture new opportunities driven by expanding global information consumption.

Michael holds Master’s Degrees in Computer Science, Business and Mechanical Engineering. He's been a Certified Project Management Professional since 2011.