Sumit Sarkar’s tutorial on exporting data out of Hadoop lists steps for boosting throughput using the DataDirect SQL Server JDBC driver and Apache Sqoop.
We've gotten everyone connected to SQL Server using Progress DataDirect's exclusive support for both NTLM and Kerberos authentication from Linux with Sqoop. Now, we plan to blow your minds with high flying bulk insert performance into SQL Server using Sqoop's Generic JDBC Connector. Linux clients will get similar throughput as the Microsoft BCP tool.
So far, Cloudera and HortonWorks have been pointing shops to the high performance DataDirect SQL Server JDBC driver to help load data volumes anywhere from 10GB to 1TB into SQL Server Data Marts and Warehouses. It's common for the DataDirect SQL Server JDBC driver to speed up load times by 15-20X; and Sqoop will see similar improvement since it leverages JDBC batches that we transparently convert into SQL Server's native bulk load protocol. Moving data out of Hadoop and into external JDBC sources are exciting projects that represent the democratization of big data for downstream application consumers. You're definitely doing something -fr -pl3 -pb3 if you are ready to read on!
1- Download the DataDirect Connect for JDBC drivers for and follow the quick start guides supplied with the download.
2- Copy the sqlserver.jar file to $SQOOP_HOME/lib directory on your client machine. (This will be /usr/lib/sqoop/lib
if you installed from an RPM or Debian package). The JDBC driver needs to be installed only on the machine where Sqoop is executed; and not on each node in your Hadoop cluster.
3- Verify the Database’s recovery mode per the msdn article on Considerations for Switching from the Full or Bulk-Logged Recovery Model. To verify the recovery mode, the database user can run the following query:
SELECT name, recovery_model_desc FROM sys.databases WHERE name = ‘database_name’ ;
Note the recovery_model_desc returned by this query (expect to return,’BULK_LOGGED’).
4- From command line, run the Sqoop export command using similar properties as below. Or specify the equivalent using the Hue web UI for Sqoop jobs.
sqoop export --connect 'jdbc:datadirect:sqlserver://nc-sqlserver:1433;database=test;user=test01;password=test01;EnableBulkLoad=true;BulkLoadBatchSize=1024;BulkLoadOptions=0' --driver com.ddtek.jdbc.sqlserver.SQLServerDriver --table 'blah_1024MB' --export-dir /user/hdfs/blah_1024MB/ --input-lines-terminated-by "n" --input-fields-terminated-by ',' --batch -m 10
Notes:
The Sqoop option sqoop.export.records.per.statement controls the number of JDBC addBatch calls that sqoop makes before calling execute batch. Once executeBatch is called, the driver then uses the BulkLoadBatchSize value to determine how many round trips to make to the server in order to execute the batch insert. If someone were using Sqoop with sqoop.export.records.per.statement=100000, they are executing 100k row batches. The driver’s default BulkLoadBatchSize is 1024, so the driver would effectively split the 100k row batch into ~98 round trips to the server which is inefficient. Reducing the number of round trips is a means to improve performance, so you would want to increase the BulkLoadBatchSize to match the sqoop.export.records.per.statement value in order for the executeBatch to only use one round trip. This will require more memory, so there is likely a sweet spot value to set both options to in order to execute a large batch of insert values using a single round trip to the server while also not blowing out the JVM heap. The width of the row data will have an impact on the sweet spot, so this will require some tuning.
INSERT INTO films (code, title, did, date_prod, kind) VALUES ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'), ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');
View the Sqoop user's guide for complete reference.
Special thanks to Mike Spinak, Principal Software Engineer and Danh Huynh, Systems Administrator with their help with setup and testing in the 6 node Cloudera CDH5.2.0-1.cdh5.2.0.p0.36 cluster to export data into SQL Server 2008 R2.
Results are still coming in from several shops and the best to date is a load of 40 GB into SQL Server within 10 minutes. In the above system, we were loading 37 GB in 18.5 minutes. There are several properties across Hadoop, Sqoop, JDBC driver and SQL Server you can tune to improve performance even further.
Tweet your load times to @SAsInSumit. If you have questions, please contact us; or call me at 1-800-876-3101.
Subscribe to get all the news, info and tutorials you need to build better business apps and sites