Zero Friction Analytics

March 18, 2013 Data & AI, MarkLogic

For many Star Trek fans, the image of a crewmember interacting with the ship’s computer stands in stark contrast to how we interact with our computers today. While we have progressed by leaps and bounds in the field of human-to-computer interaction, including even natural language processing, we are still a ways away from the seamless Q&A that takes place when we see a crewmember interacting with even an early model Star Fleet ship computer – Watson notwithstanding. What is apparent in the (for now) fictional model of interaction is the complete lack of what we might call technological friction between the user (i.e. the crew-member) and the system (the ship’s computer).

Since joining MarkLogic, I have heard and been part of a number of discussions around the evolution of data processing; and not in the euphemistic sense of referring to the IT field in general, but specifically around how data has been processed through the years. During these discussions I can’t help but think of the friction points that have materialized along the way. We often talk about three eras of data processing (and two specific inflection points), starting with the file-centric era dominated by IBM during which both flat and hierarchical models of data stored in VSAM, ISAM and similarly named file-types was how bits were stored and analyzed. During this time, there was a close affinity between the written program (often in COBOL) and the data being processed. As we know, writing a program is a human-centric process, which takes some amount of time. And while the correlation between a program and its data set was not one-to-one, more often than not, to bridge the gap between business data and actionable information, a program had to be written by a human being.

This paradigm worked well enough for a time. As we wrote programs to process more data (and answer more business questions), we also created more data, with more questions to ask. After all, we weren’t only writing query-centric programs (which on their own begat more questions) but data entry was also being automated and eventually the pace at which new data was created exceeded our ability to answer all of the questions being asked about the data. This is when the opportunity for the relational database era became ripe. We had hit our first data inflection point where the friction created by the technology became a hindrance.

One of the key concepts of the relational era (and which is still in practice today), was the concept of the centralized database. The goal was simple; store all of your data in “one place,” and then allow for ad-hoc access to this data via general-purpose analysis tools. The idea was to remove (or at least minimize) the program-to-query dependencies that were often in place, thus reducing the human-centric and time-consuming friction point of writing complex code. To accomplish this however, a simpler method of interacting with the data was needed and that is where SQL (Structured Query Language) came in.

SQL is a language with a foundation in a field of mathematics known as relational algebra. So as long as data could be represented in a tabular format (i.e. rows and columns) and expose certain properties of identity (i.e. primary keys) and relationships (i.e. foreign keys), the data may be analyzed in a standard way, regardless of the domain to which the data belongs. And since SQL is a declarative language (i.e. it focuses on what is being asked as opposed to how to answer the question), the effort required to ask a question about a data set is almost always significantly less than the effort required to write a special-purpose program. Because of this, the domain of analytics became democratized and spawned a whole new field of business intelligence, which still flourishes today. We now have a myriad of data analytics tools and techniques, allowing us to slice and dice data in a multitude of ways and, more often than not, a programmer (in the more traditional sense of the word) is not involved in the day-to-day activities.

All of this ad-hoc reporting capability however, came at something of a price. The data after all did not magically form itself neatly into rows and columns, nor did the data scream out “here is my primary key” without another form of human intervention known as modeling. As many of us understand today, during the shift from the file-based era to the relational DB era, a good portion of the intellectual capital expenditure of IT personnel was redirected away from one-off report-writing and toward the tasks of data modeling and, as needed, data transformation. And while these activities consumed time in their own right and were prerequisite to technology implementation in the relational era, the trade-off between writing a program for each set of queries vs. modeling (and transforming) “every now and then” was well worth it.

That is until today.

We have again hit an inflection point where the pace of data creation has far surpassed our ability to process it using only relational tools. As with the file-based era, we were also not idle during the relational era with respect to our propensity for new data creation. Accelerated greatly by the World Wide Web, we are now deluged with a myriad of applications not only at our desktops but also on our phones and anywhere the web and/or technology has reach (cars, homes, etc.). And while many of these web-enabled applications are powered by relational technologies, the subsequent data emitted by them (in the form of tweets, blog posts, FaceBook “likes”, etc.) is inherently unstructured.

Today, the prerequisite exercises of creating and implementing a relational model are the new friction points of data processing as was the need to write purpose-built programs in years past. We no longer have the time to engage in a holistic modeling exercise for all of our data before we analyze it. Moreover, a huge chunk of our data doesn’t even benefit much from modeling at all. Textual data is the obvious example. It is why text search is such a valuable technology and why Google is so ubiquitous. And of course all of this is compounded by the sheer volume of data that we have. We have never in our history been so deluged. We hear so much about Big Data and NoSQL technologies today, that discussion of the need to address the 3 V’s of data – volume, variety and velocity – has become commonplace.

We are thus turning to technology once again. Now it is to assist us with another set of unsustainable human-centric tasks that are getting in between our data and the actionable business intelligence we wish to derive from our data. We are essentially asking technology to remove the barrier of modeling as a prerequisite exercise to analytics. It’s not because the model isn’t important, it’s just that it’s not always apparent. We are now in a place where we can do ad-hoc modeling (aka schema-on-read) during times when the models (plural) only become apparent after iteratively churning through the data (sometimes referred to as extracting signal from the noise). Or better yet, we can now engage in model discovery (e.g. machine learning), where the technology itself discovers relationships that are otherwise not apparent. And things are moving beyond that as well, with advances in semantics (i.e. inferring meaning and context from data) progressing into the mainstream. In short, we are leveraging technology to remove friction points yet again.

Thus we are at the 3rd era of data processing and it’s very big and quite unstructured. All in all it’s an exciting time to be in technology, specifically in the field of data processing, which is why I’m happy to be a working at MarkLogic. We have some compelling technology that is addressing the very real data friction problems of today. We have removed the distinction between database querying and text search by combining the two into one technology (absent this, you would either own an analytical friction point or an integration one). Our schema-agnostic capabilities remove the modeling barriers until the time you want and/or need to address them. We do all of this without sacrificing the things to which you’ve grown accustomed and need, such as SQL and ACID transactions. And we’re not stopping there as we are not only improving on what we have today but continue to add to our capabilities by incorporating more forward-thinking features such as machine learning and semantic search. And while it’s true that we have some friction points of our own (as with any technology), we are a driven group always seeking to remove the obstacles between data and actionable information.

Whether you’re a bank, a publisher, a government agency, or Star Fleet academy, data is what is driving your decisions like never before.

Ken Krupa