Saikrishna recently experimented with bulk exporting huge datasets from Hadoop to SQL Server and measured the performance of various scenarios. Take a look at the results.
With the rise of big data in the cloud and demand for open analytics, we’re seeing bulk data movement as a common data integration pattern.
Big data platforms are migrating to the cloud (such as Amazon EMR, IBM BigInsights on Cloud, Microsoft Azure HDInsight and SAP Altiscale) to better handle elastic workloads that consume expanding cloud-resident data sets and streams. Advanced analytics techniques are then used to crunch these data sets into valuable insights.
But how can these insights be exported to enterprise on-premises business analytics platforms for distribution, such as Microsoft Power BI or Oracle Analytics?
This article looks into the performance of moving big data insights from the cloud to on-premises data warehouses or marts using a cloud-agnostic hybrid data pipeline.
The following are the observations recorded while performing Bulk Export of data to SQL Server using Apache Sqoop in various scenarios using Progress DataDirect Hybrid Connectivity solutions.
Environment Setup
This scenario was set up to check the Bulk Export capabilities to SQL Server using DataDirect Cloud. Here I have a Hadoop Single Node that was installed in Pseudo distributed mode on AWS EC2, where I had Sqoop and Hive installed alongside.
Behind the firewall, we have SQL Server 2016 installed on an on-premises Progress vCloud. A huge dataset was exported from Hadoop to SQL Server using Sqoop and the JDBC client from the DataDirect Cloud service.
Test Methodology
Dataset Used:
New York City Taxi Data 2016
Dataset Size:
- 11 million rows / 1.5914 GB
- 22 million rows / 3.1827 GB
- 44 million rows / 6.3655 GB
The goal of this test is to establish preliminary throughput numbers for hybrid cloud/ground environments with minimal tuning of batch sizes at the JDBC driver level.
To see more details on properties related to tuning, see BCP Performance on Sqoop EXPORT to SQL Server from Hadoop with additional details using our on-premises JDBC connector.
The recovery model for the database on SQL Server is set to BULK_LOGGING, which is the only change on the SQL Server end.
Results of Bulk Export from Hadoop to SQL Server
11 Million Rows
Bulk Load Enabled:Bulk Load Disabled:
Time taken to complete the load: 3,142.6827 seconds
Bulk vs. Non-Bulk:
Minimum performance gain observed at batch size 4,000: 164%
Maximum performance gain observed at batch size 500: 188%
22 Million Rows
Bulk Load Enabled:
Bulk Load Disabled:
Time taken to complete the load: 6,980.4868 seconds
Bulk vs. Non-Bulk:
Minimum performance gain observed at batch size 32,000: 233%
Maximum performance gain observed at batch size 6,4000: 236%
44 Million Rows
Bulk Load Enabled:Bulk Load Disabled:
Time taken to complete the load: 102,208.5821 seconds
Bulk vs. Non-Bulk:
Minimum performance gain observed at batch size 16,000: 1,908%
Maximum performance gain observed at batch size 4,000: 2,366%
Sqoop Scripts Used in Experiment
sqoop export --connect
'jdbc:datadirect:ddcloud:database=SQLNonBulk;user=<user>;password=<password>; TransactionMode=ignore' --driver com.ddtek.jdbc.ddcloud.DDCloudDriver --table 'NYCTAXIDATA' --export-dir /user/hive/warehouse/datadirect.db/nyctaxidata --input-lines-terminated-by "\n" --input-fields-terminated-by ',' --batch -m 10
Note: The batch sizes in hybrid connectivity scenarios are configured in the DataDirect Cloud service, rather than the JDBC client.
Conclusion
The performance gains demonstrated in this experiment, up to 2,300%, prove that cloud-based Hadoop platforms make it possible to integrate data insights with enterprise analytics platforms for inclusion in intra-day reporting and analytics.There were discrepancies in throughput due in part to using non-production infrastructure, but the bulk export performance is consistently improved with DataDirect hybrid bulk export capabilities.
While the experiment is with a SQL Server database, similar bulk movement facilities are available with other popular DataDirect hybrid data sources such as Oracle and Salesforce.
As always with any data connectivity performance testing, Progress recommends testing in a production-like environment.
Try the experiment against your big data platform.
Saikrishna Teja Bobba
Saikrishna is a DataDirect Developer Evangelist at Progress. Prior to working at Progress, he worked as Software Engineer for 3 years after getting his undergraduate degree, and recently graduated from NC State University with Masters in Computer Science. His interests are in the areas of Data Connectivity, SaaS and Mobile App Development.