If your Artificial Intelligence (AI) project is bogged down with a wasteful, time-consuming and expensive data-preparation strategy, it could put your career at risk. Here is how you can anticipate and optimize your data-access needs to more effectively determine AI project outcomes, while avoiding potential technology pitfalls.
Artificial Intelligence is a broad topic that encompasses natural language processing, speech and facial recognition, self-driving cars, fraud detection, medical diagnoses and much more.
The data needs for feeding the creation and operation of these uses is not as broad as the use cases themselves, but they are also varied and can be complicated.
At one end of the spectrum, machine learning applications may simply require access to a very large set of streamed data from a small number of data sources that they then analyze to find patterns. While the data sets are large, little processing is needed to isolate and make the data usable. However, at the other end of the spectrum, there are many AI applications with far more demanding data needs.
For example, suppose a bank builds an insight engine designed to understand its customers’ needs, background and goals in order to provide useful suggestions to them. To be fully useful, the insight engine may need access to a customer’s portfolio data, real-time access to trading activity, age and information about their family situation, balances in accounts and outstanding loans and associated payments. With this information, the insight engine can predict likely investments that customers may be interested in making, what kind of advertisements are likely to provide useful information and possible warnings about missing payments or hitting account limits.
The data and processing needs of this kind of application can be extremely demanding. When testing approaches, data must be pulled from many different and often overlapping sources and processed to make it understandable to the application.
A key factor that makes data needs especially difficult is the fact that during initial development, there will be constantly changing data-access requirements, and it will often be impossible to fully anticipate them all in advance.
Since many AI technologies are new—both in general and also new to specific organizations—hypothesis testing is essential. Each hypothesis needs its own data.
In the early stages of development, new hypotheses will be regularly created, and in many AI projects like this, data preparation becomes a major effort and cost.
AI researchers do not want to become database administrators or ETL experts. Instead, they prefer to keep their data-modeling expertise focused on the needs of the application (not for general-purpose data integration), focusing on research, creating hypotheses, testing them and putting successful systems into production.
Integrating data from complex, overlapping siloed data can take years, and most big data projects fail or come in over budget. This is because, in a relational environment, data integration requires creating a common data model for all the data and writing ETL code to pull the primary data into this format before development can begin. For big data projects, the data modeling and ETL can take a couple of years before development can begin.
Doing all this work before you can determine if you are following an optimal approach is not efficient. The final implementation may try dozens of approaches before determining which works the best. Researchers need a “succeed or fail-fast” approach where ideas can be tested with minimal effort—moved forward if they work, and discarded if they don’t. NoSQL technology, especially MarkLogic®, is very friendly to a succeed or fail-fast approach. Instead of being required to understand, model and process data before development, you can load your data “as is” and immediately begin accessing and using it. With a sophisticated search, appropriate records can be identified and accessed with no processing.
Because the database treats schemas as data and loads the schema information of each record as the record is loaded, power users can perform structured queries against data without formally explaining the underlying data models to the database and without any ETL processing.
If there is modeling or ETL necessary to make the data usable, it can be limited to only what is needed for the specific requirements, and those enhancements remain available for future requirements. For example, if it is necessary to create triples that link related data to make the data easier to access, then once those triples are created, they can be available to all future projects.
Another common example is that if different data sources use different units (e.g., dollars vs pounds) it might be helpful to create a canonical version of the element and make it part of the database.
Once testing is complete and the final hypotheses have been selected, a full-fledged data integration can be performed with data harmonization conducted to provide a full canonical view of the data needed for the application.
Even here, the effort will be far less than with traditional applications. This is because the final application will not require modeling or the processing of all the data elements in the data sources. A given database may have 5,000 columns in its various record definitions, but a given application may only need to access a few hundred of these. With NoSQL and MarkLogic, it will only be necessary to model these few hundred, but without having to “throw away” the other columns in case they’re needed later.
Relational technology allows for a similar limitation of effort by building a separate database for each use case. The database for the use case can be very limited with only those data columns needed by the application. This approach, while making faster development possible, has major drawbacks. First, if an application decides it needs data elements that were not originally specified, it is difficult to add them to the data set. With MarkLogic, all of the data is there and it is always available. Developers simply decide which data elements they need to harmonize and enrich. Incremental additions are easy.
Another more important drawback is that separate databases for separate applications lead to endless new silos of data. Enhancements made to one silo do not appear in others. If there is a change in the primary datastore, it is necessary to incorporate fixes to handle it in all of the silos. Over time, data-quality issues will appear because the data in different silos will vary from silo to silo because of different processing or update policies. With MarkLogic, all of the data is stored once and all of the users of the system can access that single datastore. With no silos, enhancements to the data can be made available to all users without duplication of effort.
Many AI projects seek to avoid data-integration issues by using Hadoop as the staging ground for the data. The idea is that users can just load the data and write routines to pull out what is needed for the project. For some AI projects, especially machine learning projects that learn by processing large amounts of simple streaming data, this approach can work, although setting up a Hadoop infrastructure will still be a major effort.
In other AI projects, much of the data consumed needs to be highly structured and pulled from complex, relational sources while other data may be unstructured. Here, a simple Hadoop implementation will be suboptimal for several reasons. First, most Hadoop infrastructures do not give users the ability to query data “as is” with structured queries against the implied schemas in imported data. Instead, out-of-the-box search/query capabilities are often limited to search from SOL/Elastic, which is much less powerful and precise.
Second, building a reusable infrastructure in which any modeling and ETL performed on previous projects remains available to later ones, is a major effort in most Hadoop environments.
Finally, many AI projects of this type depend on access to personal data. With Hadoop-based infrastructures, it can be difficult to secure access to Personally Identifiable Information (PII) data. With MarkLogic, document- and element-level security can be implemented with little effort.
AI allows firms to offer customized experiences for users as well as the ability to increase the efficiency of their operations and better understand their businesses. To gain these benefits and not have AI become a money pit that eats time, effort and resources, it is essential that your data-access technology allow the project to quickly and easily find and use the data it needs. In comparison to relational database and Hadoop technology, MarkLogic’s Enterprise NoSQL Database solution provides a better way for businesses to test, learn and succeed or fail fast when developing AI applications.
For additional insights on optimal data infrastructure requirements for AI, check out Gary Bloom’s article on What to Do Before You Start an AI Project.
David Kaaret has worked with major investment banks, mutual funds, and online brokerages for over 15 years in technical and sales roles.
He has helped clients design and build high performance and cutting edge database systems and provided guidance on issues including performance, optimal schema design, security, failover, messaging, and master data management.
Subscribe to get all the news, info and tutorials you need to build better business apps and sites