LLMs, Large Numbers and Noisy Data: Why Bigger Isn’t Always Better in AI

October 03, 2023 Data & AI, MarkLogic, Semaphore

Enterprises should understand their use of AI, Large Language Models and Medium Language Models, and how data in any volume or shape can influence it to increase data accuracy, trustworthiness and transparency.

Remember the old and gold adage “Garbage in, garbage out?” That still holds true today, especially with the rise of AI and Large Language Models in business and the copious amounts of data these technologies use as their foundation.

According to a recent report by Accenture (2023), 73% of businesses claim that AI is their top digital investment priority. While enterprises want to capitalize on the potential of AI, they proceed with caution in how they implement it across their organization as many recognize that bigger isn’t always better and can lead to data biases and inaccuracies.

In this blog, I share my take on why we must understand ourselves, our data and our businesses to mitigate the potential pitfalls that arise when dealing with this potentially world changing technology.

The Impact of Noisy Data in Generative AI

Data bias might arise from noisy data. Noisy data can impact company’s performance and it’s forecasting, decision making, resources and customer experiences. But what is noisy data? Let’s look at TechTarget’s definition:

The term “noisy data” is often used synonymously with “corrupt data.” However, it has broadened its meaning to include any type of data that machines cannot read or interpret correctly, such as “unstructured data”. In other words, any data that’s been received, saved, or modified in such a way that it’s impossible for the program that created it to read or use it can be classified as noisy.

Keeping that in mind, let’s explore where one might find noisy data, how this is linked to the sheer volume of data AI needs to be trained and explore why we must take a look at our understanding of this increasing volume and where we as humans might fall down and need the assistance of technology to make sense of this scale.

Large Numbers, Infinity and AI

Humans have never needed to see beyond 10s, 100s or maybe even 1000s. When hunting, we hunted game in relatively small numbers. When we had to avoid threats, they were usually singular in nature. When we were manipulating the world, we were again only moving, planting, constructing, etc. a relatively small number of objects. Hence our brains have never, until recent in evolutionary history terms, had to deal with large, 1,000 or even 10,000+ sets of things. So, our past experiences have shaped our brains into a place where we can’t imagine the upper bounds of the numbers coming from our modern data-led world.

Whilst the human experience is limited, in terms of numbers and dimensions, AI technologies can work with numbers far beyond our comprehension and have no trouble looking beyond our four-dimensional world. ChatGPT or the version behind the current iteration uses 12,288 dimensions, with each dimension being an aspect of the word (softness, frequency, register etc.), which is then given a value for each property ChatGPT is ascribing to that word. We can’t visualize those dimensions. ChatGPT can’t “show” us what that looks like but that is how it, and other LLMs (Large Language Models), “see” its universe.

Where our experience of the universe converges is infinity. Both AI and we can’t accurately visualize, draw or even imagine what infinity looks like. It’s too big, too vast, beyond the bounds of our existence. We, meaning both the AI and us, can use it in math equations and calculations. But manifesting it in our respective existence, we’re just not there yet.

So, why talk about infinity? Well, weird things happen when you go to larger and larger numbers, and so it is when you use larger and larger data sets for training AI.

Large numbers play tricks with our minds and go beyond comprehension. If I said I could give you either 10,000 or …9,999,999 with all of those nines going on forever, in cash, which one would you choose? You’d go with all nines, right? Well, what if I told you that the infinite number of nines equals -1. This is the same for 0.999999… or all nines going off to the right of the decimal point only now you’d be better off as this number equals one. This is using the 10-adic number system, which don’t worry if you don’t understand it there’s a great explanation of this here.

But again, all these numbers pale in comparison to infinity. It is said that due to the nature of our universe, the random nature of quantum mechanics, probabilities, etc. that if our universe was infinite, with an endless number of atoms; then if you started traveling through space in a straight line that eventually you would come across another Earth. Identical in every aspect to our own Earth including containing your real—life doppelganger that has lived their life in the same manner, who is also reading this article, right now.

Now in that universe and even our own, this sheer amount of numbers or data is too much for our brains to handle. It creates confusion and almost overwhelms us. There is so much that isn’t very meaningful or accessible to us. And that is where we start to converge with AI. As we increase the amount of data to our AI, whether it be ChatGPT or any other AI, the noise, in the data, aberrations, errors, things the AI doesn’t need or want, can interfere with what we’re looking for as an output (the signal). And when we add data that is inaccessible, such as unstructured data in some cases, this noise only increases.

Generative AI for the Enterprise

Enterprises should look deeply at the data they use to train their AIs. They should be cleaning, curating, harmonizing and modelling their proprietary data before the AI even looks at it to ensure that this noise is reduced and that the data required is significantly less. This will not only remove most of the noise from the output, but also reduce the cost when training the AI, moving closer to a MLM (Medium Language Model).

Businesses need to ensure that the data platform they are using to achieve AI—whether it be an LLM, MLM or some other AI model—is scalable, multi-model and secure. The extra capabilities to look for is whether your data platform can handle metadata, meaning extract facts from entities in the data, combine this data with metadata, its location in a taxonomy, the ontology around the data, links and relationships to other data and harmonize the data into the correct canonical model for the AI. Because the data platform may be using third-party data or handling sensitive data, it’s important to pay attention to the security aspects as well. This includes providing an auditable trail so that changes made to the data can be traced back to the source, should any issues arise when presenting these changes to the AI.

Covering all these bases often requires multiple technologies. If you however stitch together a couple of different systems, you will end up with a fragile architecture that is difficult to maintain and manage. That is why you need to look at the data platform as something that can evolve and change when new data and/or systems are added to the platform.

We have a long way to go before we perfect our understanding of AI, LLMs and MLMs and how data in any volume or shape can influence the output. But having the right data technology in place is a must if we want to reduce the noise and make it as performant as possible. We must do everything to ensure the AIs we create along the way give us the clearest signals and importantly the most correct answers possible.

Conclusion

Companies are already investing in enhancing the accuracy, transparency, trustworthiness and security of these AI systems and are embedding them in their businesses to enhance their business operations and efficiency. Find out more about how you can benefit from the combined power of Progress MarkLogic and Progress Semaphore to achieve increased security, improved trustworthiness, cost-savings and intuitive prompt creation and response understanding.

Philip Miller

Philip Miller serves as the Senior Product Marketing Manager for AI at Progress. He oversees the messaging and strategy for data and AI-related initiatives. A passionate writer, Philip frequently contributes to blogs and lends a hand in presenting and moderating product and community webinars. He is dedicated to advocating for customers and aims to drive innovation and improvement within the Progress AI Platform. Outside of his professional life, Philip is a devoted father of two daughters, a dog enthusiast (with a mini dachshund) and a lifelong learner, always eager to discover something new.

Read next Understanding Data Bias and Its Impact on AI/ML Decision Making