Today, we are excited to announce the availability of the MarkLogic Connector for Apache Spark. Apache Spark has gained significant user adoption and is an important tool for complex data processing and analytics, especially when it involves machine learning and AI. By combining Spark with MarkLogic’s data persistence and governance capabilities, organizations can build a modern integration hub that is more consistent, powerful, and well-governed than Spark alone can provide.
To get started, users can download the MarkLogic Connector for Apache Spark here
Apache Spark is an in-memory, distributed data processing engine for analytical applications, including machine learning, SQL, streaming, and graph. As a unified analytical tool, it is widely used by developers to build scalable data pipelines that span diverse data sources, including relational databases and NoSQL systems. Spark supports a variety of programming languages (like Scala, Java, Python) making it a tool of choice for data engineering and data science tasks.
While Apache Spark is widely used for analytical processing at scale, it does not include its own distributed data persistence layer. This is where MarkLogic Data Hub shines as a unified operational and analytical platform for integrating and managing heterogeneous data from multiple systems.
The combination of Apache Spark and MarkLogic enables organizations to modernize their data analytics infrastructure for faster time-to-insights while reducing cost and risk. Using the MarkLogic Connector for Apache Spark, developers can run Spark jobs for advanced analytics and machine learning directly on data in MarkLogic. This removes the ETL overhead that would otherwise be required when moving and wrangling data between separate operational and analytics systems. Instead, organizations can achieve a simpler architecture and speed up delivery of analytical applications that rely on durable data assets managed in a MarkLogic data hub.
Below are few use cases for Spark with MarkLogic:
The MarkLogic Connector for Spark is compatible with Spark’s DataSource API providing a seamless developer experience. The connector returns the data in MarkLogic as a Spark DataFrame that can quickly be processed using Spark SQL and other Spark APIs. Developers can leverage existing skills as they use Spark native libraries (like SQL, machine learning, and others) in a variety of programming languages (like Java, Scala, and Python) to build sophisticated analytics on top of MarkLogic.
Together, the combination of MarkLogic and Spark provides huge benefits for building intelligent analytical applications. The MarkLogic Connector for Spark ensures that organizations are maximizing the benefits of MarkLogic as the trusted source of durable data assets and Spark as the high-performance analytical framework.
To get started, follow along with the hands-on, step-by-step tutorial. To learn more about how you can configure the MarkLogic Connector for Apache Spark, please check out the documentation here. Apache Spark documentation is available here.
Subscribe to get all the news, info and tutorials you need to build better business apps and sites