The Next Critical Step for AI: Eliminate Data Bias

April 13, 2023 Data & AI, MarkLogic

Artificial Intelligence (AI) has a great capacity for good. I believe human-driven AI will probably be one of the greatest tools humanity has ever developed. But fulfilling that potential requires us to do the hard work—now. This begins with ensuring the data our systems ingest are comprehensive and free of bias. The good news is that technology can and should help.

Data Bias—A Real-World Example

The typical enterprise won’t gain much benefit from AI trained on data scraped randomly off the internet. Business value comes with AI trained on an organization’s own data, which is also where bias can creep in. Flawed data sets produce flawed AI decisions, and these can have drastic consequences:

A woman in the United States took sleeping tablets, following her doctor’s advice based on the manufacturer’s own guidelines. The next morning, she rose and drove to work, but got pulled over—and later arrested. The issue? The prior night’s medication still in her system left her driving under the influence. She fought the charges in court where it was later revealed the medicine guidelines her physician gave her, based on the advice from the manufacturer, were developed using data solely from male test subjects. With faster metabolisms, certain medicines exit the systems of men far faster than women. In this case, biased medical data led to bad medicine and a scary legal entanglement.

How to Avoid Biased Datasets

To avoid biased data, or at the very least mitigate its prevalence, companies should follow two important steps. First, the widest array of data needs to be ingested. This includes vast amounts of their own, proprietary raw data, structured and unstructured, drawing upon every possible company source, such as documents, excel files, research, financials, regulatory data, historical data and benchmarks. Second, controls are required, enabled by meta-tagging data with contextual information.

To accelerate this process, companies need a tool that enables the data to be ingested with the necessary context applied. This has historically been the role of subject matter experts. However, processing data at scale requires a rules-based engine to classify data with the proper taxonomies and ontologies, thus providing the context behind the data, which can so often expose the bias.

This process enables businesses to not only consider the validity of the algorithm, but really, the source data used to train the algorithm as well. Oversight is where humans can help keep the AI decisioning on track. For example, we wouldn’t teach an algorithm that 2+2=5. But that’s exactly what we’re doing if we don’t ensure the data we use for AI is clean, sensible and has the proper metadata context.

Infusing AI with internal data already shows great promise. BloombergGPT™ is reported to be 52% proprietary or cleaned financial data. Its study found, “the BloombergGPT model outperforms existing open models of a similar size on financial tasks by large margins, while still performing on par or better on general natural language processing benchmarks.” This is just one example but shows how powerful integrating internally sourced data sets can be.

AI Still Needs Humans

Regardless of where the data comes from, AI lacks a moral compass and ethical context that human decisions organically include.

To compensate for this gap, we must ask the right questions and include those rationales in our data sets. AI algorithms also need to be trained across cultures, ages and genders, as well as a host of other parameters to account for bias. The cleaner the data points used, the more sound the decision.

The “wisdom of crowd” theory puts forth, in brief, that the more data points you combine about a particular question, the more “right” your resulting answer. This even holds when crowd-sourced decisions are compared to experts. Stripped to its core, AI takes a reasonable guess based on the data it has. Accuracy, therefore, comes from aggregating the data points and balancing the wrong and the right to discern the most probable. But AI can’t govern itself. It takes diverse and critical thinking, weighing many factors to ensure the decisions we get via AI’s advanced decision-making are for the good of the whole, rather than biased to the few.

A Transparent Way Forward

As the world of data grows, businesses need scalable solutions to process and manage it all. There is a limit to how much information a human brain can process. And repeatedly retaining subject matter experts is impractical. Achieving unbiased data requires an agile, transparent, rules-based data platform where data can be ingested, harmonised and curated for the AI tool. If businesses and their AI teams are to responsibly move forward, they need a replicable, scalable way to ensure AI algorithms are trained with clean, quality data. Preferably, their proprietary own.

In my next blog, I am going to look at another feature that any data platform should have to help remove data bias and add further transparency to the data: bi-temporality. That piece will look at how it can be leveraged to provide data provenance and lineage throughout the life cycle of the data.

Data Bias Survey Results

For more information on the state of data bias in business today, and to gain insight into how to avoid and address data bias in your own organization, read the highlights from our data bias survey.

Read the blog

Philip Miller

Philip Miller serves as the Senior Product Marketing Manager for AI at Progress. He oversees the messaging and strategy for data and AI-related initiatives. A passionate writer, Philip frequently contributes to blogs and lends a hand in presenting and moderating product and community webinars. He is dedicated to advocating for customers and aims to drive innovation and improvement within the Progress AI Platform. Outside of his professional life, Philip is a devoted father of two daughters, a dog enthusiast (with a mini dachshund) and a lifelong learner, always eager to discover something new.

Read next How to Leverage Your Own Data to Improve AI Trust & Confidence