Enterprise Big Data: It’s Not About Size

May 02, 2013 Data & AI, MarkLogic

Data is at the center of most challenges facing our industry today, with business drivers such as new regulations, aggregated risk management, and deep customer insight all having critical data management implications. The term Big Data has become a common way to describe this, and while some of these challenges are associated with large volumes, it isn’t really the size of the data that’s at issue. I’d argue that at this point we know how to handle large volumes: use shared-nothing architectures that scale horizontally on commodity hardware. The trickier problem has to do with a different “V” of Big Data – variety – and it is that aspect that I’d like to focus on.

There are countless examples of business value locked up in data that does not fit neatly into rows and columns. The most frequently cited is Social Media, with its ability to offer deep customer insight and sentiment analysis. And there are many others within the company’s firewall as well: Gleaning information from on-boarding documents for FATCA and AML compliance, getting a better handle on credit risk by analyzing ISDA agreements, lowering cost per trade by consolidating the processing of diverse asset classes with varied and complex structures, etc.

How can we effectively handle all this information, which is either hidden in free-form text, or scattered across incompatible schemas? Hierarchical structures such as XML and JSON certainly come to mind, as they can accommodate various degrees of structure, organized in a way that mirrors intuitive human perception. Indeed, many organizations have been using XML to handle these business challenges and have reaped some benefits, but found themselves constrained by the underlying RDBMS platforms that actually managed the data.

The problem with the typical approach to handling hierarchical information is that data is “shredded” into tables: a customer / derivative trade / legal document, with all its hierarchical attributes, is shoehorned into an ER model that satisfies referential integrity. Don’t get me wrong – I love relational modeling and I have spent years doing it, but 3rd Normal Form has its limitations when it comes to diverse data: just consider the typical first step when analyzing normalized data: de-normalize it!

There is an alternative to shredding though, in the form of NoSQL – a wide set of technologies that transcend the boundaries of relational schemas. The name is somewhat unfortunate since SQL is actually one of the best features associated with an RDBMS (some call it the most successful Domain Specific Language). The problem with RDBMSs is not SQL but the prerequisite of a schema definition for data ingestion and analysis, which hinders business agility. We’ve all seen cases where the business needs have been delayed while data models, transformations and analytical schemas were being developed. NoSQL databases free us from the rains of the schema to enable real business agility.

However, one factor has prevented a wide adoption of NoSQL technologies within the enterprise: the BASE architectural principle underlying most of them. It stands for Basically Available, Soft state, Eventual Consistency – a play on ACID transactions (Atomicity, Consistency, Isolation, Durability), which are associated with relational databases. BASE has several advantages when it comes to non-transactional systems, as it relaxes consistency to allow the system to process requests even in an inconsistent state. Social media sites are a perfect example – No one would mind if their Facebook status or latest tweet were inconsistent within their social network for a short period of time; it’s much more important to get an immediate response than to have a consistent state of users’ information.

Financial and other enterprise systems are a different matter though. Imagine for instance, a corporate merger action, occurring at the same time a firm is trading the affected instrument: The post-trade processing systems would certainly have to be consistent with the Reference Data system, or costly exceptions would ensue.

So how do we avoid schema woes without giving up ACID transactions, as well other enterprise qualities such as fine-grained entitlements, point-in-time recovery, and high availability, all of which we’ve come to expect for mission-critical system?

The answer lies within a different category of technology called Enterprise NoSQL, which has been designed and built with transactions and enterprise features from the ground up, just like relational databases. But unlike RDMBS, an Enterprise NoSQL database models the data as hierarchical trees rather than rows and columns. These trees are aggressively indexed in-memory as soon as the data is ingested, and then used for both element retrieval and full text search, unifying two concepts that have traditionally been separate – the database and the search engine.

An Enterprise NoSQL database also offers full SQL access, thus combining the benefits of both worlds – the business agility associated with NoSQL and search, and the data integrity and sophisticated querying associated with a traditional RDBMS.

In the next installment of this blog I will explore the mechanisms by which this is achieved.

Amir Halfon