The amount of data flowing into and between systems continues to grow every day. With these ever-increasing volumes of data, system integrators are turning to tools like Apache Kafka to provide a central routing service for streaming that data. One of the primary consumers of the data are databases like MarkLogic.
However, in order to subscribe to the Kafka topics, retrieve the message, and subsequently load them into a MarkLogic database, we need an efficient and reliable tool to act as the bridge: the Kafka-MarkLogic-Connector.
This tool is intended for anyone interested in using Kafka to stream data to MarkLogic. For instance, they could be a solutions engineer working with a potential customer that is considering Kafka, or a consultant who is working with an existing customer to design a solution around Kafka and MarkLogic. Or, they may simply be an experimenter– somebody who is trying out different technologies for learning or for fun.
The Kafka-MarkLogic-Connector, written in Java, uses the standard Kafka APIs and libraries to subscribe to Kafka topics and consume messages. The connector then uses the MarkLogic Data Movement SDK (DMSDK) to efficiently store those messages in a MarkLogic database. As messages stream onto the Kafka topic, the threads of the DMSDK will aggregate the messages and then push the messages into the database based on a configured batch size and time-out threshold.
All three components of the system– Kafka, MarkLogic, and Kafka-MarkLogic-Connector– are designed to easily permit new servers to be added to the system. New Kafka nodes can be used for redundancy to prevent data loss. Combined with MarkLogic’s ACID transactions, the system has extremely high reliability. New server nodes can also quickly and dynamically increase available bandwidth. As resources are maxed out, each of the three components may be expanded independently to meet data flow requirements.
To summarize, this tool would be used primarily for streaming large amounts of data into MarkLogic. Kafka is a message streaming system that is capable of incredible volumes. Those messages may need to be stored somewhere, and that somewhere is MarkLogic. Using just a single MarkLogic server on an AWS t2.xlarge instance, the connector can retrieve and store approximately 4000 messages per second.
Thus, this system has the potential to work with high-bandwidth data sources, such as IoT sensors, satellite constellations, or internet traffic data. Ultimately, the speed of each of the components means that the data can be stored reliably, which has universal value.
If you’d like some hands-on experience with the tool, read the Quickstart with the Kafka-MarkLogic-Connector in AWS to get the basic version of this system set-up.
Phil has been building solutions using MarkLogic for nearly eight years including the last five as a MarkLogic consultant. He has more than 30 years of experience in the software industry and loves solving problems. Phil lives with his wife and family in Fredericksburg, Va, and enjoys games and learning new skills.
Subscribe to get all the news, info and tutorials you need to build better business apps and sites