Data & AI MarkLogic

The Next Critical Step for AI: Eliminate Data Bias

by Philip Miller Posted on April 13, 2023

Artificial Intelligence (AI) has a great capacity for good. I believe human-driven AI will probably be one of the greatest tools humanity has ever developed. But fulfilling that potential requires us to do the hard work—now. This begins with ensuring the data our systems ingest are comprehensive and free of bias. The good news is that technology can and should help.

Data Bias—A Real-World Example

The typical enterprise won’t gain much benefit from AI trained on data scraped randomly off the internet. Business value comes with AI trained on an organization’s own data, which is also where bias can creep in. Flawed data sets produce flawed AI decisions, and these can have drastic consequences:

A woman in the United States took sleeping tablets, following her doctor’s advice based on the manufacturer’s own guidelines. The next morning, she rose and drove to work, but got pulled over—and later arrested. The issue? The prior night’s medication still in her system left her driving under the influence. She fought the charges in court where it was later revealed the medicine guidelines her physician gave her, based on the advice from the manufacturer, were developed using data solely from male test subjects. With faster metabolisms, certain medicines exit the systems of men far faster than women. In this case, biased medical data led to bad medicine and a scary legal entanglement.

How to Avoid Biased Datasets

To avoid biased data, or at the very least mitigate its prevalence, companies should follow two important steps. First, the widest array of data needs to be ingested. This includes vast amounts of their own, proprietary raw data, structured and unstructured, drawing upon every possible company source, such as documents, excel files, research, financials, regulatory data, historical data and benchmarks. Second, controls are required, enabled by meta-tagging data with contextual information.

To accelerate this process, companies need a tool that enables the data to be ingested with the necessary context applied. This has historically been the role of subject matter experts. However, processing data at scale requires a rules-based engine to classify data with the proper taxonomies and ontologies, thus providing the context behind the data, which can so often expose the bias.

This process enables businesses to not only consider the validity of the algorithm, but really, the source data used to train the algorithm as well. Oversight is where humans can help keep the AI decisioning on track. For example, we wouldn’t teach an algorithm that 2+2=5. But that’s exactly what we’re doing if we don’t ensure the data we use for AI is clean, sensible and has the proper metadata context.

Infusing AI with internal data already shows great promise. BloombergGPT™ is reported to be 52% proprietary or cleaned financial data. Its study found, “the BloombergGPT model outperforms existing open models of a similar size on financial tasks by large margins, while still performing on par or better on general natural language processing benchmarks.” This is just one example but shows how powerful integrating internally sourced data sets can be.

AI Still Needs Humans

Regardless of where the data comes from, AI lacks a moral compass and ethical context that human decisions organically include.

To compensate for this gap, we must ask the right questions and include those rationales in our data sets. AI algorithms also need to be trained across cultures, ages and genders, as well as a host of other parameters to account for bias. The cleaner the data points used, the more sound the decision.

The “wisdom of crowd” theory puts forth, in brief, that the more data points you combine about a particular question, the more “right” your resulting answer. This even holds when crowd-sourced decisions are compared to experts. Stripped to its core, AI takes a reasonable guess based on the data it has. Accuracy, therefore, comes from aggregating the data points and balancing the wrong and the right to discern the most probable. But AI can’t govern itself. It takes diverse and critical thinking, weighing many factors to ensure the decisions we get via AI’s advanced decision-making are for the good of the whole, rather than biased to the few.

A Transparent Way Forward

As the world of data grows, businesses need scalable solutions to process and manage it all. There is a limit to how much information a human brain can process. And repeatedly retaining subject matter experts is impractical. Achieving unbiased data requires an agile, transparent, rules-based data platform where data can be ingested, harmonised and curated for the AI tool. If businesses and their AI teams are to responsibly move forward, they need a replicable, scalable way to ensure AI algorithms are trained with clean, quality data. Preferably, their proprietary own.

In my next blog, I am going to look at another feature that any data platform should have to help remove data bias and add further transparency to the data: bi-temporality. That piece will look at how it can be leveraged to provide data provenance and lineage throughout the life cycle of the data.

Data Bias Survey Results

For more information on the state of data bias in business today, and to gain insight into how to avoid and address data bias in your own organization, read the highlights from our data bias survey.

Read the blog

Philip Miller

Philip Miller serves as the Senior Product Marketing Manager for AI at Progress. He oversees the messaging and strategy for data and AI-related initiatives. A passionate writer, Philip frequently contributes to blogs and lends a hand in presenting and moderating product and community webinars. He is dedicated to advocating for customers and aims to drive innovation and improvement within the Progress AI Platform. Outside of his professional life, Philip is a devoted father of two daughters, a dog enthusiast (with a mini dachshund) and a lifelong learner, always eager to discover something new.

Related Tags

MarkLogic

How to Leverage Your Own Data to Improve AI Trust & Confidence

Learn how to leverage your own business data to improve AI trust by creating custom ai training data sets and knowledge graphs.

Data & AI MarkLogic Semaphore

Imran Chaudhri August 28, 2023

How to Leverage Cross-Programmatic Data for Improved Customer Experience

Cross-programmatic data helps agencies deliver exceptional public services.

Data & AI MarkLogic

Gary Katz August 16, 2023

The Progress MarkLogic Spark Connector: Retrieving Data

The Progress MarkLogic Spark Connector combines the capabilities of Apache Spark and MarkLogic Server by transferring data in rows between each tool.

Data & AI MarkLogic

Rekiah Hinton August 08, 2023

MarkLogic

Semaphore

OpenEdge

DataDirect

Sitefinity

Telerik

Kendo UI

Corticon

DataDirect

MOVEit

Chef

Flowmon

Kemp LoadMaster

WhatsUp Gold

Telerik

Kendo UI

Fiddler

Test Studio

MOVEit

WS_FTP

The Next Critical Step for AI: Eliminate Data Bias

Data Bias—A Real-World Example

How to Avoid Biased Datasets

AI Still Needs Humans

A Transparent Way Forward

Data Bias Survey Results

Philip Miller

Related Tags

Related Articles

Latest Stories in Your Inbox