Unifying Data, Metadata, and Meaning

September 15, 2022 Data & AI, MarkLogic, Semaphore

If you’ve ever taken a historical look at familiar inventions and how they came to be, you’ll notice that they’re mostly built from ideas and concepts that were around at the time, but put to work in new ways.

There was a problem at hand, and someone came up with a novel way of solving it.

Our collective problem at hand is a natural outcome from our speedy transition to a digital society.

Simply put, we are all drowning in data.

Anyone who thinks they might be ahead of the curve might not appreciate the enormity of the challenge at hand. If not today, then soon.

Our ability to interpret and act on data isn’t keeping up with what’s collectively coming at us.

Not only is the data moving fast, our understanding of it moves fast as well.

Ideally, we’d be agile with data: being able to go from data to knowledge to insight to action as fast as possible.

While we’ve figured out great ways to share massive amounts of data, we haven’t figured out great ways of sharing what we know about it.

That requires formalized definitions, meanings, and interpretations – a specialized language – about the data we care about.

It turns out that the pieces to solve this problem are already at hand – and already being put to work in a wide variety of compelling, real-world environments.

While many of the components might be familiar to some, they are now being used in new ways to solve these problems, and more.

Let’s Start With Data

A quick history of databases might read something like: indexed, relational, specialized, and then multi-model. Multi-model as a category is appealing here as it uses metadata to represent (materialize) data just about any way you’d like to see it: SQL tables, flat files, graphs, key-value, etc.

Same data, many perspectives.

That flexibility makes multi-model appealing for three patterns today: applications, platforms, and fabrics.

Applications meet specific needs for specific users (short timeframe), while platforms meet shared needs for aligned users (moderate timeframe), with enterprise fabrics intended to meet all potential needs for all potential users, internal or external (much longer timeframe).

Back to our desire to share our encoded knowledge about data, while all three multi-model database patterns are useful, an “enterprise fabric” pattern is clearly both much more difficult to achieve and much more compelling from an outcome perspective.

Ideally, we’d use a multi-model database that could support and integrate all three patterns as needed during an adoption phase in a larger organization. And there are many examples of larger organizations doing just this by standardizing on a single multi-model database technology for all three patterns.

But How Do You Create Metadata in the First Place?

Not surprisingly, the hardest part turns out to be creating and improving the metadata that is used to describe the data. Simple labeling isn’t too hard: where did this data come from, when did we get it, agreed fields and formats, and so on.

But, for example, how do you determine something is PII – personally identifiable information?

When data is identified as such, it should trigger a set of enforced rules for its handling. Also, failing to identify PII creates avoidable risk.

Making matters more interesting: the rules, definitions, and interpretations surrounding PII themselves change, often rapidly. Put differently, what we know about PII and what it means is always a moving target.

If handling PII consistently and uniformly is very important to you, how would you ensure your current knowledge about handling data with potential PII is used consistently throughout the entire organization and its ecosystem of partners?

  • You would have to first encode a set of rules of how PII should be identified in any form of data you are responsible for, keeping those rules updated as data sources and interpretations change.
  • Next, you would define a set of rules for the handling of data once it has been identified as PII. Some uses are OK, and others are not. Those rules change as well.
  • Most importantly, you would have to enforce the use of those rules against any and all uses of the data that you are responsible for, and be able to prove that in an audit context. How people will want to use data may change, as well as audit requirements.

This three-part problem shows up in a surprising number of situations, with PII just being one illustrative example:

  1. How do you encode your knowledge about the data?
  2. How do you use that knowledge to identify and handle important data from otherwise?
  3. Most importantly, how do you ensure that the data and the encoded knowledge about the data is used uniformly everywhere?

And how do you do this in an agile, trusted way?

How We Encode Knowledge About Data Today

There is a wide variety of ways we encode our knowledge about data, ranging from urban folklore to precise knowledge graphs.

In between we’ll find familiar artifacts such as researcher notebooks, data dictionaries, glossaries, ontologies, metadata managers, and related.

A better way is to use semantic knowledge graphs to encode our knowledge of data. SKGs are very handy ways of representing very deep and specialized meanings and interpretations of facts: digital or otherwise.

SKGs have become de-rigueur in knowledge and metadata management disciplines as they are a rich, flexible representation that readily encapsulates and extends existing ones.

However, none of these things manage source data; they manage various encoded descriptions of source data. They are most always divorced from the source data itself. They are also not usually intended to have software evaluate data and make decisions about it.

To do that, metadata must be created from the data at hand.

How We Create Metadata Today

To interpret any form of data, metadata must be created about the data, and the richer and more automated the metadata creation, the better.

We have a stunningly wide variety of tools available to look at data and create rich, automated metadata. We use sentiment analysis on social feeds, image recognition on video, pattern recognition on IOT streams, even simple textual search can be powerful when informed.

Sadly, in most enterprise environments, automated metadata creation from potentially useful data is in dismal shape. It is usually done in the form of using expensive coding experts coupled with expensive domain experts to define and create static interpretations of data.

As a result, it isn’t done often, and when it’s done, it requires constant attention. Better technology helps greatly.

Semantic AI uses natural language processing (NLP) to have domain experts converse directly with software, using the specialized language that they are most comfortable with.

Semantic AI eliminates the need to translate complex concepts through a coding expert, so it is inherently more agile and accurate as a result. It is widely in use today in a variety of pursuits where specialized interpretations of data are important.

How We Keep Data and Encoded Knowledge Together Today

The last part of our three-part problem is making sure that any time data is being used, it is consumed alongside everything that is known about it. That might be a useful definition, important concepts, how those relate to others, rules regarding security or privacy, and so on.

Just to be clear: data without usable knowledge about the data is of limited use, and can create avoidable forms of risk. Also, usable knowledge about data is of limited use if it isn’t readily available when and where the data is being consumed: informed search, contextual applications, grounded analytics, etc.

There appears to be a history of many smart technology teams that have encountered this “connect data with everything we know about it” challenge in various forms, as it shows up in many places and in many ways.

Maybe you are personally familiar with such an effort?

Most set out on attempting to integrate the three functional components through clever software, and fail from sheer entropy. As their integration cannot be agile despite their best intentions, it can’t easily keep up with the real world, and the project is abandoned for the time being.

However, if one stores data and knowledge about data (metadata) together as a single entity, the problem is neatly solved, creating data agility in the process.

When you change the metadata, you immediately change how the data is interpreted everywhere it is consumed.

Ideally, you’d create your “data knowledge” in the form of a semantic knowledge graph represented as metadata, using semantic AI to more quickly and effectively encode and decode your unique knowledge about data using any specialized language that people are using today.

That leads us to the idea of a semantic data platform, one where active data, active metadata, and active meanings can be kept together at all times.

More about that later.

Jeremy Bentley

Jeremy Bentley is the founder of Semaphore, creators of the Semaphore semantic AI platform, and joined MarkLogic as part of the Semaphore acquisition. He is an engineer by training and has spent much of his career solving enterprise information management problems. His clients are innovators who build new products using Semaphore’s modeling, auto-classification, text analytics, and visualization capabilities. They are in many industries, including banking and finance, publishing, oil and gas, government, and life sciences, and have in common a dependence on their information assets and a need to monetize, manage, and unify them. Prior to Semaphore Jeremy was Managing Director of Microbank Software, a US New York based Fintech firm, acquired by Sungard Data Systems. Jeremy has a BSc with honors in mechanical engineering from Edinburgh University.