How NoSQL Can Help Analytics in Life Sciences

I often get asked — how does MarkLogic help with analytics? Where does MarkLogic fit in the analytics landscape?

Before I answer that question – let’s look at what happens today.

The world of Artificial Intelligence (AI) and cognitive computing is all about extracting information and relationships from piles of data. The insights (or signals) are typically buried in very particular parts of the data that can be narrowed by the “features” of interest. A good example of this is a life sciences related Real World Evidence (RWE) study that Stanford performed for detect adverse events from EMR clinical data, where the Vioxx-MI association could have been detected three years prior to the drug’s recall.

Speed Up Filtering ‘Features’ of Interest

Adverse events are typically closely tied to drugs and the patients pre-existing conditions or co-morbidities. In this case, if I had 1 billion records, the brute force way of narrowing down the datasets in a Hadoop infrastructure would be to load everything and filter away the narrow “features” of interest such as Vioxx, cardiac disease, and death. Typically, this brute force method takes hours to sort through and filter the data before it gets to the machine learning aspect of detecting signals.

BUT what if you had all this data in a real-time, search-oriented database so you could narrow the set down by 10x or 100x to just the features of interest? You could shave that machine learning cycle to only minutes — even seconds.

What you need is a NoSQL database that has co-occurrence capabilities. MarkLogic allows you to find value pairs and runs such queries against any number of indexes and any type.

The picture below provides a nice summary of how MarkLogic can help narrow down the data sets to the features of interest using its real-time co-occurrence capabilities. In this example, we can see healthcare-related co-occurrences for diseases treatments symptoms from more than 2.6 million articles.

There is a second way MarkLogic can help too. And that’s to operationalize the analytics.

Now that we have generated the insights, what do we do with them? Per the diagram below, MarkLogic provides a search focused multi-model transactional operational data hub. This means we can store flexible schema agnostic content as documents and graphs and can provide access to the data via various indexing models such as full-token-search, key-values, row-column, documents, semantic graphs, and geospatial views. Typically, cognitive computing insights can be mapped nicely into semantic graphs. MarkLogic provides a very nice way to tie these insights to content in the database via embedded triples or via RDF inferencing. As new insights are generated, they become directly accessible by applications running on MarkLogic.

Back to the question: How does MarkLogic help with analytics? MarkLogic’s real-time content indexes can speed up signal detection 10x or 100x by providing the algorithms and just the data they need to generate the insights. Once the insights are created, MarkLogic can store the insights as RDF graphs that can then be used to build semantically smart real-time applications. And you can take action with full confidence that you are using all your data.

MarkLogic

Imran Chaudhri

At Progress MarkLogic, Imran focuses on enterprise quality genAI and NoSQL solutions for managing large diverse data integrations and analytics to the healthcare and life sciences enterprise. Imran co-founded Apixio with the vision of solving the clinical data overload problem and has been developing a HIPAA compliant clinical AI big data analytics platform. The AI platform used machine learning to identify what is truly wrong with the patient and whether best practices for treatment were being deployed. Apixio’s platform makes extensive use of cloud computing based NOSQL technologies such as Hadoop, Cassandra, and Solr. Previously, Imran co-founded Anka Systems and focused on the execution of EyeRoute's business development, product definition, engineering, and operations. EyeRoute was the world's first distributed big data ophthalmology image management system. Imran was also the IHE EyeCare Technical Committee Co-chair fostering interoperability standards. Before Anka Systems, Imran was a founder and CTO of FastTide, the worlds first operational performance based meta-content delivery network (CDN). Imran has an undergraduate degree in electrical engineering from McGill University, a Masters degree in the same field from Cornell University and over 30 years of experience in the industry.