Saikrishna recently experimented with bulk exporting huge datasets from Hadoop to SQL Server and measured the performance of various scenarios. Take a look at the results.
With the rise of big data in the cloud and demand for open analytics, we’re seeing bulk data movement as a common data integration pattern.
Big data platforms are migrating to the cloud (such as Amazon EMR, IBM BigInsights on Cloud, Microsoft Azure HDInsight and SAP Altiscale) to better handle elastic workloads that consume expanding cloud-resident data sets and streams. Advanced analytics techniques are then used to crunch these data sets into valuable insights.
But how can these insights be exported to enterprise on-premises business analytics platforms for distribution, such as Microsoft Power BI or Oracle Analytics?
This article looks into the performance of moving big data insights from the cloud to on-premises data warehouses or marts using a cloud-agnostic hybrid data pipeline.
The following are the observations recorded while performing Bulk Export of data to SQL Server using Apache Sqoop in various scenarios using Progress DataDirect Hybrid Connectivity solutions.
This scenario was set up to check the Bulk Export capabilities to SQL Server using DataDirect Cloud. Here I have a Hadoop Single Node that was installed in Pseudo distributed mode on AWS EC2, where I had Sqoop and Hive installed alongside.
Behind the firewall, we have SQL Server 2016 installed on an on-premises Progress vCloud. A huge dataset was exported from Hadoop to SQL Server using Sqoop and the JDBC client from the DataDirect Cloud service.
Dataset Used:
New York City Taxi Data 2016
Dataset Size:
sqoop export --connect
'jdbc:datadirect:ddcloud:database=SQLNonBulk;user=<user>;password=<password>; TransactionMode=ignore' --driver com.ddtek.jdbc.ddcloud.DDCloudDriver --table 'NYCTAXIDATA' --export-dir /user/hive/warehouse/datadirect.db/nyctaxidata --input-lines-terminated-by "\n" --input-fields-terminated-by ',' --batch -m 10
Note: The batch sizes in hybrid connectivity scenarios are configured in the DataDirect Cloud service, rather than the JDBC client.
The performance gains demonstrated in this experiment, up to 2,300%, prove that cloud-based Hadoop platforms make it possible to integrate data insights with enterprise analytics platforms for inclusion in intra-day reporting and analytics.There were discrepancies in throughput due in part to using non-production infrastructure, but the bulk export performance is consistently improved with DataDirect hybrid bulk export capabilities.
While the experiment is with a SQL Server database, similar bulk movement facilities are available with other popular DataDirect hybrid data sources such as Oracle and Salesforce.
As always with any data connectivity performance testing, Progress recommends testing in a production-like environment.
Try the experiment against your big data platform.
Subscribe to get all the news, info and tutorials you need to build better business apps and sites