Connect Apache Spark SQL to Node.js on Linux via ODBC Driver

August 02, 2016 Data & AI

This tutorial explains how to access Apache Spark SQL data from a Node.js application using the DataDirect Apache Spark SQL ODBC driver on a Linux machine/server.

Apache Spark is changing the way Big Data is accessed and processed. While MapReduce was a good implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster, it was not optimized for interactive data analysis that involves iterative algorithms. Spark was designed to overcome this shortcoming.

As you implement Apache Spark in your organization, we understand that you need ways to connect your Apache Spark to other ODBC applications. Apache Spark SQL allows you to connect with any ODBC data source. We put together a tutorial that explains how you can connect to a Node.js application on Linux using a Spark SQL ODBC driver.

If you are looking to connect to a Node.js JDBC application using a Spark SQL JDBC driver, visit this tutorial.

Before You Start

  1. Make sure that you have Java installed on your machine. You can check this by running the command java -version on your terminal.
  2. If Java is installed, you will see the current version of Java being displayed on your screen. Otherwise, install Java before proceeding to next steps.
  3. Make sure you have Apache Spark SQL installed and ready to use on your machine. If you haven’t done so, download and install Apache Spark SQL pre-built for Hadoop from here.

Install DataDirect Spark SQL ODBC driver

  1. Download the DataDirect Spark SQL ODBC driver from here.
  2. Extract the contents from the downloaded package by opening a terminal at the downloaded location and running the following command:

    tar xvf PROGRESS_DATADIRECT_ODBC_SPARKSQL_x.x.x_LINUX_yy.tgz

     

  3. To install the driver, execute the .bin package and run the following command in terminal:

    ./PROGRESS_DATADIRECT_ODBC_x.0_LINUX_yy_INSTALL.bin

     

  4. This will launch an interactive Java-based installer, which you can use to install the Spark SQL ODBC driver to your desired location or default directory as either a licensed or evaluation installation.
  5. Go to the ODBC driver install directory, it should be something similar to /install/directory/Progress/ODBC_XX_YYbit/
  6. For the applications to be able to use the DSN and locate the drivers, we have to configure the environment variables ODBCINI, ODBCINST and LD_LIBRARY_PATH. This can be done by executing a shell script that you can find in the install directory. For Korn, Bourne and the equivalent, run the odbc.sh script file. For C shell, use the odbc.csh script file.

Load Data in to Spark SQL

  1. For the purpose of the tutorial, I will be loading the data from a CSV file that can be found here.
  2. Start the Spark shell using the following command, which has been configured to to run the Thrift server in single-session mode as I am only going to register the imported data as Temporary table. I am also including a package that can be used to import data from the CSV, as it is not supported natively:

    spark-shell --conf spark.sql.hive.thriftServer.singleSession=true --packages com.databricks:spark-csv_2.11:1.4.0

     

  3. Once the Spark shell starts successfully, run the following commands to import the data from the CSV and register it as temporary table.

    import org.apache.spark.sql._
    import org.apache.spark.sql.hive._
    import org.apache.spark.sql.hive.thriftserver._
    //Read from CSV
    val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema","true").option("header","true").load("/path/to/InsuranceData.csv")
    //Check if CSV was imported successfully
    df.printSchema()
    df.count()
    //Register Temp Table
    df.registerTempTable("InsuranceData")
    sqlContext.sql("select count(*) from InsuranceData").show()
    val hc = sqlContext.asInstanceOf[HiveContext]
    HiveThriftServer2.startWithContext(hc)

     

Configuring the Data Source

  1. To configure the data source for the Spark SQL server, go to the install/directory/Progress/ODBC_XX_YYbit/, where the default odbc.ini file would be so you can create DSN definitions. For your reference, the odbc.ini in the installation directory has a template, which will be used to create the DSN definition.
  2. Open the odbc.ini in editor and under the tag [Apache Spark SQL], which will be the name of the DSN, set the Hostname to Apache Spark SQL server address, database as "default" and port to 10000. Save the odbc.ini and close the editor.

Connect to Your Data from Node.js

  1. In your Node.js application, install the module node-odbc using npm. Read this page for a guide on how to install node-odbc module.
  2. You can now access the data from Spark SQL using DataDirect Spark SQL ODBC driver by loading the ODBC module in your code. The following code snippet demonstrates how you can do it:

    //Connection Parameters configuration
    var db = require('odbc')()
        , cn = "Dsn=Apache Spark SQL;UID=<username>;PWD=<password>;db=default";
    //open the connection
    db.open(cn, function (err) {
        if (err) {
          return console.log(err);
        }
    // Run a sample query
         db.query("select * from InsuranceData", function (err, rows, moreResultSets) {
          
          if (err) {
            return console.log(err);
          }
          
          console.log(rows);
         });
      });
  3. When you run the above code, it should print all the records in your temporary table to console.

Next Steps

The DataDirect SparkSQL ODBC driver is a best-in-class certified connectivity solution for Apache Spark SQL. For additional information about our other solutions for Apache and other Big Data frameworks, check here. To learn more about the Spark SQL ODBC driver, visit our Spark SQL ODBC driver  page and try it free for 15 days. Please subscribe to our blog via email or RSS feed  for more tutorials.

Try Now


Saikrishna Teja Bobba

Saikrishna is a DataDirect Developer Evangelist at Progress. Prior to working at Progress, he worked as Software Engineer for 3 years after getting his undergraduate degree, and recently graduated from NC State University with Masters in Computer Science. His interests are in the areas of Data Connectivity, SaaS and Mobile App Development.