Protecting Against Linkage Attacks that Use ‘Anonymous Data’

November 02, 2017 Data & AI, MarkLogic

Join us at the upcoming IAPP Europe Data Protection Congress 2017 on November 8-9 in Brussels where we will be showcasing MarkLogic’s anonymization and redaction capabilities.

It is well known that ‘anonymous data’ often isn’t that anonymous. There are a few well-publicized examples of ‘anonymous’ datasets being released that were quickly de-anonymized:

  • Netflix published data about movie rankings for 500,000 customers in 2007, and researchers showed they could de-anonymize the data using a few additional inputs from IMDb
  • Using 1990 U.S. census data, Stanford researchers showed that they could uniquely identify 87 percent of the U.S. population using only their Zip code, gender, and date of birth
  • AOL published search data for 650,000 users in 2007, thinking it was enough to anonymize their name using a unique ID. Unfortunately, most users often query their own name. As a result, their CTO resigned and an entire research team was fired after the public outcry

That was the end of releasing ‘anonymous’ data to the public. But, the problem with anonymous data lives on within organizations.

In what is known among cybersecurity pros as a linkage attack, adversaries collect auxiliary information about a certain individual from multiple data sources and then combine that data to form a whole picture about their target, which is often an individual’s personally identifiable information.

The common approach to mitigate linkage attacks is to anonymize data before exporting by removing personally identifiable information (PII) such as ID, phones, etc. Unfortunately, this is not enough.

A better approach to protect against linkage attacks is to centralize sharing, simply share less raw data, and if you do want to share data—create layers of abstraction or generalization by redacting parts of the data.

Linkage Attack Example

How does a linkage attack work? Let me provide an example from the healthcare industry. Imagine that a care provider shares anonymized data with external researchers about medical conditions. The export contains “Gender,” “Postal code,” “Date of birth,” and “Description.” An attacker could easily use a public voter list that contains “Name,” “Gender,” “Postal code,” and “Date of birth” to cross-reference the patients.

In practice, the more you preserve the analytical utility of the dataset, such as keeping “Gender” and “Postal code” information in the export, the more you are susceptible to linkage attacks.

Why Partial Anonymization Is Not Enough

Many people think that if they just remove the PII from their data, it is okay to export. But, it’s not.

For example, let’s say you export credit card transactions removing all PII. What is left is anonymized data that includes the user’s primary key, transaction date, and value. You give this export to a data analyst to calculate the average customer spent, find common behavior, etc.

However, the data analyst has another idea in mind. He also has access to the call center database, which does have PII. The call center database has information about which products the customer purchased, a history of complaints, questions, disputes, etc.

Given a sufficiently large dataset, the analyst can find a customer in the credit card transactions dataset. While transactions may not uniquely identify a customer, the analyst can easily combine the transaction data with complaints, questions, and disputes to form the complete picture. For example, if a customer calls to complain about a duplicate charge on a particular day, the analyst can use this information to search the transactions and find potential matches. With time, he could uniquely identify large numbers of customers.

Here, we described a complex attack by an internal adversary for three reasons:

  • First, most attacks are internal as seen in the latest Breach Report research by Verizon
  • Second, mandates and regulation compel companies to establish data governance internally and externally
  • Third, even heavily anonymized data can be used for linkage attacks. Any public information, such as a voter list, forum comments, or reviews in websites can be used

Protecting Against Linkage Attacks

To better protect the data exported against linkage attacks, we recommend that you centralize sharing, share less, create abstractions, and use the right protection.

Centralize Sharing

It’s really hard to secure data across multiple data silos. As we have seen in the aforementioned examples, insiders conducting linkage attacks have access to an assortment of databases. These database silos all have different access controls and auditing, not to mention various data formats. These silos prevent implementation of a consistent policy to protect user information and privacy.

The best approach to address this problem is to use a centralized database to govern and secure the data. This approach makes securing applications easier and faster. Why rely on heuristic, probabalistic approaches to protection against re-identification attacks when you can have comprehensive auditing and policy execution, consistently implemented across your entire organization, and exposed via a rich set of APIs to access aggregate information?

The best database for centralizing all of your data is a multi-model database like MarkLogic. MarkLogic is built to flexibly store and manage all of an organization’s data, and enables consistent data governance across disparate data stores. MarkLogic has a lot of advanced features for securing data such as Document and Element Level Security, and all security can be controlled from a central location that serves different purposes and applies different access controls.

Share Less

Our second recommendation is to bring the data analysis to the data. In other words, “give me your code.”

Most business users are looking for summaries (or aggregates) of information–not the data itself. It’s better not to share raw data.

In MarkLogic, you can use amped functions that run internally at a higher privilege and do things that the user cannot do directly, to calculate aggregates but avoid giving access to the raw document data. For example, you can use an amped function to calculate what customers spent per Postal Code, but the user has no access to individual records.

This is a terrific approach to protect against linkage attacks. The challenge is that you need to know your questions a priori in order to create the functions. Therefore, this is a great approach for a report or portal that displays aggregates and calculations.

Create an Abstraction

If you need to share data in its raw format with data scientists, consider adding a layer of abstraction or generalization.

Do you really need to share the full “Date of birth”? Or just “Year of birth”?

Do you really need the full “Postal Code”? Or, would “County” do?

For example, in MarkLogic you can use Redaction to mask the “day” and “month” out of “Date of birth.” You can use a dictionary to replace “Postal code” with “County.” Or, replace “Age” with “Age range.” Just keep in mind that although bigger abstractions provide more security, they also result in slightly less precise analytics.

Don’t Rely on the Wrong Protection

Oftentimes, when I talk to customers about this, they suggest protecting against linkage attacks using format-preserving encryption, homomorphic encryption, perturbation, and salting.

Not so fast!

These technologies protect against attacks such as dictionary, rainbow, and brute force attacks but not against linkage attacks. Linkage attacks don’t use encrypted data, so those approaches don’t work. Linkage attacks are done with data that is left preserved for analysis. If you are also concerned about dictionary and rainbow attacks against your data, MarkLogic can also protect you.

MarkLogic has advanced Encryption at Rest, which has multiple low-level encryption keys to minimize the impact of any breach, and multiple salting methods on redaction, to maximize exported data entropy.

Conclusion

Linkage attacks can be simple or very sophisticated. Protecting against them may involve simple forms of redaction, more sophisticated abstraction, or full computation at the data layer.

MarkLogic provides a set of capabilities to help you ensure that your data is safe, in a central location, and that you still can use it for analytics, operations, and business reporting.

MarkLogic helps you secure and govern your data:

  • Auditing across all data store
  • Redaction of data export to create anonymous data
  • Granular Access Controls of information and of management capabilities
  • APIs that enable access to aggregate data but not access to raw data
  • Data governance with policy directly tied to the data

To learn more, download the white paper, Developing Secure Applications on MarkLogic. For a quick summary, check out our Element Level Security and Redaction Datasheet.

Caio Milani

Caio Milani is Director of Product Management at MarkLogic responsible for various aspects of the product including infrastructure, operations, security, cloud and performance. Prior to joining MarkLogic, he held product management roles at EMC and Symantec where he was responsible for storage, high availability and management products.
Caio holds a BSEE from the University of Sao Paulo and a full-time MBA Degree from the University of California, Berkeley.