We are excited to announce the availability of the MarkLogic Connector for AWS Glue. AWS Glue is a serverless ETL tool provided as a managed service in the AWS cloud ecosystem. When connected to MarkLogic, AWS Glue provides a simple way to build data pipelines for moving data in and out of MarkLogic using visual and code-based interfaces. To get started, subscribe to the MarkLogic Connector for AWS Glue on the AWS Marketplace.
What Is AWS Glue?
AWS Glue provides a fully-managed, serverless Apache Spark infrastructure to graphically create, run, and monitor ETL pipelines. Its graphical interface, Glue Studio, automatically generates code, saving developers time and effort from the challenges of coding and optimizing Spark jobs.
Using AWS Glue, developers can build ETL pipelines using readily-available connectors for AWS services like Aurora, RDS, S3, Redshift, Kinesis, and DynamoDB as well as third-party databases like Oracle or SnowFlake. It provides a data catalog and a rich library of out-of-the-box data transformations (like filter, joins, etc.) to easily model ETL pipelines in Glue Studio. Additionally, developers can choose to code a data pipeline in either Scala or Python.
Note that for those who want to use Apache Spark with MarkLogic but are not using AWS Glue service, we have also released a MarkLogic Connector for Apache Spark.
Using AWS Glue with MarkLogic
MarkLogic customers now can easily use AWS Glue to implement Spark ETL pipelines for fast data ingestion and data export.
High-performance Data Ingestion
The MarkLogic Connector for AWS Glue makes it simple to bulk load or stream relational and non-relational data as is into MarkLogic. Additionally, it provides the flexibility of using Glue’s data transformation capabilities to combine and transform tabular data from multiple sources into hierarchical data formats like JSON before loading into MarkLogic.
As an example, users can easily use the new Glue connector to build a batch or a change data capture pipeline to load complex data (or source entities) into MarkLogic Data Hub Service. Once loaded, Data Hub Service has the necessary capabilities to integrate source data into durable data assets for later use in operational and analytical applications.
Secure Data Sharing
The MarkLogic Connector for Glue also makes it easy to consume data from MarkLogic with complete security and governance. Users can easily build scalable data pipelines for complex analytical processing using Spark libraries (like machine learning, SQL, etc.) on clean, curated, and governed data in MarkLogic. Additionally, users can also leverage MarkLogic’s multi-model querying capabilities to securely share fit-for-purpose data with various AWS services like SageMaker, Redshift, S3, and other third-party data stores like Snowflake.
Get Started
To use the MarkLogic Connector for AWS Glue, simply subscribe to the connector in the AWS marketplace. Once subscribed, the MarkLogic connector will appear in your AWS Glue studio, where users can graphically build data pipelines.
To get started, follow along with the hands-on, step-by-step tutorial. To learn more about configuring the MarkLogic Connector for AWS Glue, please check out the documentation here. AWS Glue documentation is available here.