Streaming Data into MarkLogic with the Kafka-MarkLogic Connector

Why use Kafka with MarkLogic?

The amount of data flowing into and between systems continues to grow every day. With these ever-increasing volumes of data, system integrators are turning to tools like Apache Kafka to provide a central routing service for streaming that data. One of the primary consumers of the data are databases like MarkLogic.

However, in order to subscribe to the Kafka topics, retrieve the message, and subsequently load them into a MarkLogic database, we need an efficient and reliable tool to act as the bridge: the Kafka-MarkLogic-Connector.

This tool is intended for anyone interested in using Kafka to stream data to MarkLogic. For instance, they could be a solutions engineer working with a potential customer that is considering Kafka, or a consultant who is working with an existing customer to design a solution around Kafka and MarkLogic. Or, they may simply be an experimenter– somebody who is trying out different technologies for learning or for fun.

How does the Kafka-MarkLogic-Connector work?

The Kafka-MarkLogic-Connector, written in Java, uses the standard Kafka APIs and libraries to subscribe to Kafka topics and consume messages. The connector then uses the MarkLogic Data Movement SDK (DMSDK) to efficiently store those messages in a MarkLogic database. As messages stream onto the Kafka topic, the threads of the DMSDK will aggregate the messages and then push the messages into the database based on a configured batch size and time-out threshold.

All three components of the system– Kafka, MarkLogic, and Kafka-MarkLogic-Connector– are designed to easily permit new servers to be added to the system. New Kafka nodes can be used for redundancy to prevent data loss. Combined with MarkLogic’s ACID transactions, the system has extremely high reliability. New server nodes can also quickly and dynamically increase available bandwidth. As resources are maxed out, each of the three components may be expanded independently to meet data flow requirements.

What are the advantages of using the tool?

Scalability: A system made up of Kafka, the Kafka-MarkLogic-Connector, and MarkLogic, is broadly scalable and reliable. Accordingly, each of these components can scale independently. As the demands on Kafka increase, additional connectors may be added to monitor the topic. With the MarkLogic cluster behind a load-balancer, a properly configured system is capable of processing a very large number of messages per minute.
No Code: The Kafka-MarkLogic-Connector is convenient and simple to use. As a part of the system described previously, each component may be set up and integrated without writing any code. All that is required is configuring the connector so its’s aware of the Kafka cluster and is connected to the MarkLogic cluster. By properly setting those parameters, you are ready to stream messages from a Kafka topic to a MarkLogic database.
AWS-Ready: All these components are all compatible with AWS Cloud Computing Services; in turn, all the advantages of AWS are available as well. We can design and deploy our system using tools such as CloudFormation and monitor the system using CloudWatch. Additionally, since each of the three main components are scalable, we can take advantage of the AWS auto-scaling to automatically grow and shrink each of the components in our system as needs dictate.

What can be done with this tool in MarkLogic?

To summarize, this tool would be used primarily for streaming large amounts of data into MarkLogic. Kafka is a message streaming system that is capable of incredible volumes. Those messages may need to be stored somewhere, and that somewhere is MarkLogic. Using just a single MarkLogic server on an AWS t2.xlarge instance, the connector can retrieve and store approximately 4000 messages per second.

Thus, this system has the potential to work with high-bandwidth data sources, such as IoT sensors, satellite constellations, or internet traffic data. Ultimately, the speed of each of the components means that the data can be stored reliably, which has universal value.

If you’d like some hands-on experience with the tool, read the Quickstart with the Kafka-MarkLogic-Connector in AWS to get the basic version of this system set-up.

MarkLogic

Phil Barber

Phil has been building solutions using MarkLogic for nearly eight years including the last five as a MarkLogic consultant. He has more than 30 years of experience in the software industry and loves solving problems. Phil lives with his wife and family in Fredericksburg, Va, and enjoys games and learning new skills.