This tutorial explains how to access Apache Spark SQL data from a Node.js application using the DataDirect Apache Spark SQL ODBC driver on a Linux machine/server.
Apache Spark is changing the way Big Data is accessed and processed. While MapReduce was a good implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster, it was not optimized for interactive data analysis that involves iterative algorithms. Spark was designed to overcome this shortcoming.
As you implement Apache Spark in your organization, we understand that you need ways to connect your Apache Spark to other ODBC applications. Apache Spark SQL allows you to connect with any ODBC data source. We put together a tutorial that explains how you can connect to a Node.js application on Linux using a Spark SQL ODBC driver.
If you are looking to connect to a Node.js JDBC application using a Spark SQL JDBC driver, visit this tutorial.
tar xvf PROGRESS_DATADIRECT_ODBC_SPARKSQL_x.x.x_LINUX_yy.tgz
./PROGRESS_DATADIRECT_ODBC_x.0_LINUX_yy_INSTALL.bin
spark-shell --conf spark.sql.hive.thriftServer.singleSession=true --packages com.databricks:spark-csv_2.11:1.4.0
Once the Spark shell starts successfully, run the following commands to import the data from the CSV and register it as temporary table.
import org.apache.spark.sql._
import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive.thriftserver._
//Read from CSV
val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema","true").option("header","true").load("/path/to/InsuranceData.csv")
//Check if CSV was imported successfully
df.printSchema()
df.count()
//Register Temp Table
df.registerTempTable("InsuranceData")
sqlContext.sql("select count(*) from InsuranceData").show()
val hc = sqlContext.asInstanceOf[HiveContext]
HiveThriftServer2.startWithContext(hc)
//Connection Parameters configuration
var db = require('odbc')()
, cn = "Dsn=Apache Spark SQL;UID=<
username
>;PWD=<
password
>;db=default";
//open the connection
db.open(cn, function (err) {
if (err) {
return console.log(err);
}
// Run a sample query
db.query("select * from InsuranceData", function (err, rows, moreResultSets) {
if (err) {
return console.log(err);
}
console.log(rows);
});
});
The DataDirect SparkSQL ODBC driver is a best-in-class certified connectivity solution for Apache Spark SQL. For additional information about our other solutions for Apache and other Big Data frameworks, check here. To learn more about the Spark SQL ODBC driver, visit our Spark SQL ODBC driver page and try it free for 15 days. Please subscribe to our blog via email or RSS feed for more tutorials.
Try Now
Saikrishna is a DataDirect Developer Evangelist at Progress. Prior to working at Progress, he worked as Software Engineer for 3 years after getting his undergraduate degree, and recently graduated from NC State University with Masters in Computer Science. His interests are in the areas of Data Connectivity, SaaS and Mobile App Development.
Let our experts teach you how to use Sitefinity's best-in-class features to deliver compelling digital experiences.
Learn MoreSubscribe to get all the news, info and tutorials you need to build better business apps and sites