Make your Big Data small—why you need to leverage standards-based connectivity for your Big Data and NoSQL data sources
In the past 5 years we've seen an explosion of data. According to Bernard Marr in this recent Forbes magazine article "Big Data: 20 Mind-Boggling Facts Everyone Must Read":
- Data volumes are exploding; more data has been created in the past two years than in the entire previous history of the human race.
- Data is growing faster than ever before and by the year 2020, about 1.7 megabytes of new information will be created every second for every human being on the planet.
- By then, our accumulated digital universe of data will grow from 4.4 zettabytes today to around 44 zettabytes, or 44 trillion gigabytes.
- Every second we create new data. For example, we perform 40,000 search queries every second (on Google alone), which makes it 3.5 billion searches per day and 1.2 trillion searches per year.
During these same five years, the cost of storing data dropped tremendously. In the past, older data had to be archived or eventually deleted. Now, it makes economic sense to capture more data and keep it available to leverage it for analysis and business intelligence. Since no one knows what data will eventually be needed, it has become easier to simply keep it all.
Let's look at the data social media generates as an example. In every minute in 2014:
- Twitter users tweeted 300,000 times
- Instagram users posted 270,000 new photos
- YouTube users uploaded 72 hours of video content
- Facebook users shared 2.5 million pieces of content
Clearly, a lot of this data is not high value at first glance. The challenge of making Big Data manageable, or “small,” is not in the storage of the data. The difficulty is extracting granular information and insights, in order to make better business decisions.
RDBMS Were Not Designed for Big Data
Relational database management systems (RDBMS) are not built to store huge amounts of “semi-structured” data. To fulfill these new requirements, new databases with better horizontal scaling, like “schema on read,” were created. (“Schema on read” allowed companies to easily store data without having to impose a data model until later, when the data was read.)
These new databases presented problems for mainstream companies that needed to get critical business information and value out of this data. As with many new technologies, early adopters were willing to accept highly technical, less user-friendly interfaces. Many of these new databases had proprietary interfaces, very different from the standard interfaces business users were familiar with. For example, when NoSQL databases first came out, NoSQL meant exactly what it said; no SQL access to these databases. Later it was altered to mean “Not Only SQL.”
Easy Access to Big Data and NoSQL Data Sources
To get the full value of Big Data, data scientists were originally needed, people skilled in statistics, programming (to write algorithms), and business. Traditional business analysts at mainstream companies didn’t have this skill set, and it was obvious early on that there was a business demand for standard SQL access to Big Data and NoSQL. Hive provided SQL access to Hadoop, which was batch-oriented and too slow. When Cloudera announced Impala at Hadoop World 2013 in NYC, user interest was so great that the fire marshals wouldn’t let more people into the large conference room. More forms of SQL access to Big Data sources came later, like HAWQ and Spark SQL to name a few.
Standardized SQL Access Makes Data Integration Possible
There are major benefits to a familiar, standardized interface to these new Big Data/NoSQL data sources. It allows normal business analysts and power users to continue with the tools and applications they had been using for years with relational databases. Tools like Cognos, Microsoft Excel, QlikView, SAP Crystal Reports, SAS and Tableau can provide huge value when processing Big Data and NoSQL data sources. Anything that can be done to make these new Big Data and NoSQL data sources look and behave more like relational databases makes it easier for users to get business value out of them. For example, normalizing the complex document structures of MongoDB into parent-child table relationships makes MongoDB easy to use with tools that were originally built for relational databases.
At the same time, there is no value in using familiar standardized interfaces to Big Data if the performance is degraded. This was the problem with Hive and its batch-oriented approach. Today, performance advantages can be gained by pushing down queries to the underlying database and taking advantage of these capabilities whenever possible. Coding to the “wire protocol” of each database allows performance and usability enhancements that are just not possible when using the client libraries provided by the database. This allows, for example, a single wire protocol driver to connect to multiple distributions of Hadoop and also outperform drivers built using client libraries.
Software Vendors Must Adapt to Market Changes
Software vendors generally accept that the value of their product increases with the more Big Data and NoSQL data sources they support out of the box. In the past, some software vendors had preferred to let prospects and customers download and configure drivers to get to external data sources. Today this also is changing. Many software vendors recognize a trend where their products are less likely to be installed and configured by corporate IT and more likely by the end user. These end users are less technical and more likely to struggle with finding and configuring a driver for Big Data and NoSQL data sources. By embedding drivers in their products, installation and configuration is much easier. This makes for shorter and more successful proof-of-concepts with prospective customers.
Another fact that software vendors are well aware of is that this explosion of new data sources will not last forever. There will be winner and losers in the market; the number of new databases will stabilize again. Those new databases that don’t attract a critical mass of customers and developers will go out of business and disappear. We are already seeing this start to happen. Most software vendors want to focus on their core competencies and not have to try to pick winners and losers in the Big Data world. They see value in letting a best-of-breed data connectivity company become a “driver factory” for them.
A Data Connectivity Solution to Make Your Big Data Small
When they first came out, Big Data and NoSQL datasources had unfamiliar and proprietary interfaces and were difficult to integrate with the rest of enterprise data. Over the past couple of years this has changed and it’s much easier now to access this new data using familiar, standards-based connectivity. Big Data is much more manageable and can (almost) seem “small.”
Progress® DataDirect® offers connectivity and integration solutions with a suite of more than 45 drivers that let you connect all your data wherever it is, even from sources like:
- RDBMS (Oracle, DB2, MS. SQL Server)
- NoSQL (MongoDB, Cassandra)
- SaaS (Salesforce, Oracle Service Cloud)
- Hadoop (Apache Hive, Hortonworks)
Get Small for Free
You can find free trials of all of our drivers on our website. For more information or further questions, contact us or leave a comment below.
Dennis Bennett
Dennis Bennett is a Principal Systems Engineer with Progress, Data Connectivity and Integration.