How to Find and Control Access to PII

August 22, 2017 Data & AI, MarkLogic

How do you find and control access to Personally Identifiable Information (PII) and sensitive information in a MarkLogic database? You may be complying with regulation, internal policies, or simply trying to reduce risk.

In many situations, it may be apparent which elements of the document or JSON properties contain PII. For instance, an element called SSN or credit card number clearly indicates that this is sensitive information. However, what if you have too many data sources and unknown element names to describe an SSN? For example, you are implementing an Employee 360 application that ingests data from multiple systems. One system describes SSN as id, another uses SSN as primary key, another has SSN within text notes, etc. Therefore, our first step is finding the path to information that we want to protect.

Finding PII

Leverage the Indexes

MarkLogic has a powerful Universal Index that can help us find information in documents. If you can grab a list of known SSNs in the database, for instance a list from HR of some SSNs, you can easily use the number as input to word queries. Here is one example of a query searching for some SSNs.

let $pii := ("567-90-6071", "390-32-7638", "728-48-1630")
let $docs:= cts:search(fn:doc(), cts:word-query($pii))
return fn:distinct-values(xdmp:path($docs//text()[fn:matches(fn:string(.), $pii )]/..))

This search works across XML and JSON, but the output is slightly different. Let’s say for instance that the function returns the following /employee/user/text("ID"), /id/number/text(), /employee_table/key/text("primary"). This tells me that I have Social Security numbers across my documents under the following paths:/employee/user/ID, /id/number, and /employee_table/key/primary.

This is a fast and reliable way of finding the PII. However, what if I don’t know which values are classified as PII up-front?

Pattern Matching: Brute Force Method

For those cases, we will use two different techniques: pattern matching and sampling. Most PII and sensitive information follow a pattern such as a well-structured Social Security number. Therefore, you can use MarkLogic’s built-in capabilities to search for patterns. In the function below we are searching for an SSN pattern using fn:matches and returning all the paths using xdmp:path that match the pattern.

let $pattern :="^(d{3})([-.s]?)(d{2})2(d{4})$" 
xdmp:path(fn:collection("Employees")//*[fn:matches(fn:string(.), $pattern])

CAUTION: STOP, don’t try it! Notice that we are scanning all documents in the collection and the entire document structure using *//. This pattern matching will not leverage the indexes and is a brute force search. Consequently, this would take a long time if you do in a large database. This certainly would be okay if you are running against a small staging or QA database or if you need to guarantee that you find all occurrences of Social Security numbers in the production database in order to protect them.

Scoping and Sampling: Elegant and Better Than Brute Force

Ok, we don’t want to scan the entire database! Let’s apply scoping and sampling techniques to reduce the number of documents that we search. First, begin by scoping the documents using collections. If you have multiple types of documents that represent logical entities in the database, focus on the ones that you know may contain PII. Second, use sampling.

The code below randomly samples 15% of the documents in the “Employee” collection and then searches for the SSN pattern, returning distinct paths

let $x := .15  (: 15% :)
let $results := cts:search(fn:collection("Employee"), cts:true-query(), "score-random")
let $estimate := xdmp:estimate(cts:search(fn:collection("Employee"), cts:true-query()))
let $docs := $results[1 to fn:ceiling($estimate * $x)]
let $pattern :="^(d{3})([-.s]?)(d{2})2(d{4})$"
return fn:distinct-values(xdmp:path($docs//text()[fn:matches(fn:string(.), $pattern )]))

Mechanisms to Secure PII

Now that we have found the paths to the PII, you can use MarkLogic’s document level security, Element Level security and Redaction to control access to it.

Document-Level Security

Document level security controls access to entire documents based on user roles. If you haven’t set document level permissions during ingestion, which is the preferred method, you can use the following function to do it on existing documents. In this example, only Employees can read the documents that match the collection search.

xdmp.documentAddPermissions(fn.collection(“Employee”), [xdmp.permission("Employee_role", "read")])

Element-Level Security

Element Level Security allows you to control access to paths in the document on real-time search, queries and updates based on user roles. The following Element Level security functions will ensure that only users with the HR_read role will be able to search or read Social Security numbers:

declareUpdate();
var sec = require("/MarkLogic/security.xqy");
const perm = xdmp.permission("HR_read", "read", "element");
let paths = ["//employee/user/ID", "//id/number","//employee_table/key/primary"];

for (const path of paths) {
  sec.protectPath(path, null, perm);
}

Note: it is important to highlight that the XPath expression (//employee/user/ID) allows some flexibility in the document structure because of the // at the beginning. If your document structure changes to /data/employeedata/employee/user/ID the XPath expression will still match and protect the SSN.

Redaction

Redaction allows you to hide or mask information while you export it. The following rule, once applied using Redaction, will generate a fully random SSN when you export data, for instance to a QA or development instance. (one rule and path shown for brevity)

<rdt:rule xml:lang="zxx"
    xmlns:rdt="http://marklogic.com/xdmp/redaction">
  <rdt:description>Mask SSNs</rdt:description>
  <rdt:path>//employee/user/ID</rdt:path>
  <rdt:method>
    <rdt:function>redact-us-ssn</rdt:function>
  </rdt:method>
  <rdt:options>
    <rdt:level>full-random</rdt:level>
  </rdt:options>
</rdt:rule>

Redaction rules are simply documents that are added to a collection in the Schemas database. You call mlcp (MarkLogic Content Pump) to export data and use mlcp –redact CollectionName to apply all rules in a collection to the data as it is exported. Redaction also leverages XPath and has the same flexibility in the document structure.

Verify It’s All Done

Once you set document and element level security, you can test to ensure that only the correct users have access to the PII. The following code will search for a particular SSN using different users. In our scenario, only HR_reader (who has HR-read role) will find a match.

'use strict';
/** Invokes any function as a specific user */
const asUser = (fn, user) =>
  xdmp.invokeFunction(fn, { userId: xdmp.user(user) });

/** The actual function that does the work. */
function _readDocs(pii) {

  let docs= cts.search(cts.wordQuery(pii));
  var paths=[];

  for (var doc of docs) {
    var matched = doc.xpath("//text()");
    for (var node of matched) {
      if (fn.matches(fn.string(node), pii)) {
        paths.push(xdmp.path(node));
      }
    }
  }
  fn.distinctValues(xdmp.arrayValues(paths));
}

/** A new synthetic function that combines the actual function and the user switching */
const readDocs = (pii, user) =>
  asUser(
    (/* zero params*/) => 
    _readDocs(pii), user
  );

const users = ['Employee_reader', 'HR_reader', 'Contractor_reader'];
const pii ="898-31-4408";
const out = [];
let i = 0;
for (const result of users.map(user => readDocs(pii, user))) {
  out.push(`------------ ${users[i++]} ------------`);
  out.push(result);
}
out;

A Better Database to Find and Control PII

We learned how MarkLogic not only helps you find PII in your documents but also allows real-time access control access to PII, both at the document level and sub-document level, using element level security. Besides real-time control, MarkLogic also helps you to create data exports that are safe for developers or data scientists to use, as you can easily hide, mask or create sample data to replace PII.

Additional Resources

Caio Milani

Caio Milani is Director of Product Management at MarkLogic responsible for various aspects of the product including infrastructure, operations, security, cloud and performance. Prior to joining MarkLogic, he held product management roles at EMC and Symantec where he was responsible for storage, high availability and management products.
Caio holds a BSEE from the University of Sao Paulo and a full-time MBA Degree from the University of California, Berkeley.