Data & AI

Bulk Data Movement to SQL Databases in Hybrid Cloud

by Saikrishna Teja Bobba Posted on March 13, 2017

Saikrishna recently experimented with bulk exporting huge datasets from Hadoop to SQL Server and measured the performance of various scenarios. Take a look at the results.

With the rise of big data in the cloud and demand for open analytics, we’re seeing bulk data movement as a common data integration pattern.

Big data platforms are migrating to the cloud (such as Amazon EMR, IBM BigInsights on Cloud, Microsoft Azure HDInsight and SAP Altiscale) to better handle elastic workloads that consume expanding cloud-resident data sets and streams. Advanced analytics techniques are then used to crunch these data sets into valuable insights.

But how can these insights be exported to enterprise on-premises business analytics platforms for distribution, such as Microsoft Power BI or Oracle Analytics?

This article looks into the performance of moving big data insights from the cloud to on-premises data warehouses or marts using a cloud-agnostic hybrid data pipeline.

The following are the observations recorded while performing Bulk Export of data to SQL Server using Apache Sqoop in various scenarios using Progress DataDirect Hybrid Connectivity solutions.

Environment Setup

This scenario was set up to check the Bulk Export capabilities to SQL Server using DataDirect Cloud. Here I have a Hadoop Single Node that was installed in Pseudo distributed mode on AWS EC2, where I had Sqoop and Hive installed alongside.

Behind the firewall, we have SQL Server 2016 installed on an on-premises Progress vCloud. A huge dataset was exported from Hadoop to SQL Server using Sqoop and the JDBC client from the DataDirect Cloud service.

Bulk Export SQL Server

Test Methodology

Dataset Used:

New York City Taxi Data 2016

Dataset Size:

11 million rows / 1.5914 GB
22 million rows / 3.1827 GB
44 million rows / 6.3655 GB

For each scenario and dataset size, I exported the data from Hadoop to SQL Server using the drivers’ Bulk Load feature being enabled and disabled on separate tests.

The goal of this test is to establish preliminary throughput numbers for hybrid cloud/ground environments with minimal tuning of batch sizes at the JDBC driver level.

To see more details on properties related to tuning, see BCP Performance on Sqoop EXPORT to SQL Server from Hadoop with additional details using our on-premises JDBC connector.

The recovery model for the database on SQL Server is set to BULK_LOGGING, which is the only change on the SQL Server end.

Results of Bulk Export from Hadoop to SQL Server

11 Million Rows

Bulk Load Enabled:

Bulk Load 11 Million Rows

Bulk Load Disabled:
Time taken to complete the load: 3,142.6827 seconds

Bulk vs. Non-Bulk:
Minimum performance gain observed at batch size 4,000: 164%
Maximum performance gain observed at batch size 500: 188%

22 Million Rows

Bulk Load Enabled:

Bulk Load 22 Million Rows

Bulk Load Disabled:
Time taken to complete the load: 6,980.4868 seconds

Bulk vs. Non-Bulk:
Minimum performance gain observed at batch size 32,000: 233%
Maximum performance gain observed at batch size 6,4000: 236%

44 Million Rows

Bulk Load Enabled:

Bulk Load 44 Million Rows

Bulk Load Disabled:
Time taken to complete the load: 102,208.5821 seconds

Bulk vs. Non-Bulk:
Minimum performance gain observed at batch size 16,000: 1,908%
Maximum performance gain observed at batch size 4,000: 2,366%

Sqoop Scripts Used in Experiment

sqoop export --connect

'jdbc:datadirect:ddcloud:database=SQLNonBulk;user=<user>;password=<password>; TransactionMode=ignore' --driver com.ddtek.jdbc.ddcloud.DDCloudDriver --table 'NYCTAXIDATA' --export-dir /user/hive/warehouse/datadirect.db/nyctaxidata --input-lines-terminated-by "\n" --input-fields-terminated-by ',' --batch -m 10

Note: The batch sizes in hybrid connectivity scenarios are configured in the DataDirect Cloud service, rather than the JDBC client.

Conclusion

The performance gains demonstrated in this experiment, up to 2,300%, prove that cloud-based Hadoop platforms make it possible to integrate data insights with enterprise analytics platforms for inclusion in intra-day reporting and analytics.There were discrepancies in throughput due in part to using non-production infrastructure, but the bulk export performance is consistently improved with DataDirect hybrid bulk export capabilities.

While the experiment is with a SQL Server database, similar bulk movement facilities are available with other popular DataDirect hybrid data sources such as Oracle and Salesforce.

As always with any data connectivity performance testing, Progress recommends testing in a production-like environment.

Try the experiment against your big data platform.

Saikrishna Teja Bobba

Saikrishna is a DataDirect Developer Evangelist at Progress. Prior to working at Progress, he worked as Software Engineer for 3 years after getting his undergraduate degree, and recently graduated from NC State University with Masters in Computer Science. His interests are in the areas of Data Connectivity, SaaS and Mobile App Development.

Related Tags

bulk data Cloud data connectivity SQL databases sqoop

Progress DataDirect Now Connects to Denodo

Progress DataDirect has added Denodo, a data virtualization software platform, to its catalog of connectors.

Data & AI DataDirect

Todd Wright April 07, 2023

Powerful Data Connectors Unlock Valuable Data within Progress OpenEdge

Only Progress DataDirect offers direct access to Progress OpenEdge data.

Application Development Data & AI DataDirect OpenEdge

Jessica Malakian January 31, 2023

Delivering Relevant Notifications When Monitoring Complex Systems and Applications

Corticon.js helps deliver relevant notifications in complex systems and applications monitoring.

Digital Experience

Thierry Ciot January 12, 2023

MarkLogic

Semaphore

OpenEdge

DataDirect

Sitefinity

Telerik

Kendo UI

Corticon

DataDirect

MOVEit

Chef

Flowmon

Kemp LoadMaster

WhatsUp Gold