We are excited to announce support for using Apache NiFi to ingest data into MarkLogic. Apache NiFi is an open source tool for distributing and processing data. When used alongside MarkLogic, it’s a great tool for building ingestion pipelines. NiFi has an intuitive drag-and-drop UI and over a decade of development behind it, with a big focus on security and governance.
One of the historical challenges to adopting new NoSQL databases is getting legacy relational data migrated over. Relational databases store data in rows and columns in a highly normalized form. MarkLogic, a multi-model NoSQL database, stores data as JSON and XML documents and RDF triples. Typically, you group data into natural “entities” that are modeled as documents, and you add RDF triples to capture meaningful relationships among the entities.
NiFi helps to naturally group your data by either converting relational rows to small documents or joining groups of rows together into hierarchical structures using primary/foreign key relationships. With new MarkLogic processors, this data then moves quickly into MarkLogic with minimal configuration and high performance.
The NiFi approach uses the data model that already exists in the relational database to the extent possible, avoiding costly, fragile and slow ETL jobs. Existing approaches such as MarkLogic Content Pump (mlcp) still work well for getting data into MarkLogic. But, NiFi makes the whole process of ingesting relational data to MarkLogic faster and easier. And, you don’t need to buy a separate ETL tool.
If you are interested and want to become an expert, read the white paper that discusses why you should Rethink Data Modeling, or watch the presentation on Becoming a Document Modeling Guru.
Here is an example:
The above screenshot shows a simple process for getting relational data into MarkLogic. An SQL query is executed to get data out of a relational system. Then, a NiFi processor converts the resulting Avro serialized data to JSON, and the JSON data is put into MarkLogic. Watch this five-minute demo that shows how to get relational data ingested into MarkLogic using NiFi.
NiFi is designed and built to handle real-time data flows at scale. But, NiFi is not advertised as an ETL tool, and we don’t think it should be used for traditional ETL. The sweet spot for NiFi is handling the “E” in ETL. It extracts data easily and efficiently. If necessary, it can do some minimal transformation work along the way. We think it’s better to let the database (i.e., MarkLogic) take care of the data transformation and harmonization.
The main benefits of NiFi include the following:
The main concepts to understand when using NiFi are dataflows, processors and connections. You create a dataflow by wiring together processors with connections. A dataflow can be saved as a template, and these templates can be combined into more complex flows and reused or replicated across servers.
The following table from Hortonworks provides a very nice summary of the individual components and how they map to dataflow programming:
NiFi Term | FBP Term | Description |
FlowFile | Information Packet | A FlowFile represents each object moving through the system, and for each one, NiFi keeps track of a map of key/value pair attribute strings and its associated content of zero or more bytes. |
FlowFile Processor | Black Box | Processors actually perform the work. In Enterprise Integration Terms, a processor is doing some combination of data routing, transformation or mediation between systems. Processors have access to attributes of a given FlowFile and its content stream. Processors can operate on zero or more FlowFiles in a given unit of work and either commit that work or rollback. |
Connection | Bounded Buffer | Connections provide the actual linkage between processors. These act as queues and allow various processes to interact at differing rates. These queues then can be prioritized dynamically and can have upper bounds on load, which enable back pressure. |
Flow Controller | Scheduler | The Flow Controller maintains the knowledge of how processes actually connect and manage the threads and allocations that all processes use. The Flow Controller acts as the broker facilitating the exchange of FlowFiles between processors. |
Process Group | Subnet | A Process Group is a specific set of processes and their connections, which can receive data via input ports and send data out via output ports. In this manner, process groups allow for the creation of entirely new components through the composition of other components. |
Source: Hortonworks
Using NiFi with MarkLogic is similar to using NiFi with any other database—you just need to use the processors specifically built for getting data in and out of MarkLogic.
There are currently two processors built for MarkLogic: the PutMarkLogic processor for ingesting data into MarkLogic and the QueryMarkLogic processor for querying documents in MarkLogic. Both of these processors are built on top of MarkLogic’s Data Movement SDK.
The below list of capabilities provides a general idea of what each processor is capable of.
The steps below illustrate how fast and easy it is to get started using NiFi with MarkLogic.
Download the NiFi binaries from http://nifi.apache.org/download.html. Make sure you’re on the latest release of NiFi (1.7). Unpack (i.e., unzip) the tar or zip files in a directory of your choice (for example: /abc).
Clone the MarkLogic/nifi-nars repository to get the MarkLogic-specific processors located in the GitHub repository.
Place the MarkLogic-specific processor files in the correct directory. To do this, copy the two .nar files provided by MarkLogic in the zip folder into the lib folder (nifi-1.7.0/lib) of the unpacked NiFi distribution.
Go to the Apache NiFi Development Quickstart and follow the commands in the Decompress and Launch sections. Note that you do not need to follow the decompress instructions. Also, make sure that you are in the directory of your NiFi installation. If not, change your directory using a command (e.g., “cd /abc/nifi-1.7.0”). Now, you are ready to follow the launch instructions provided in the Apache NiFi Development Quickstart for your particular environment.
Now, you’re ready to run NiFi using your browser. You can point to a web browser at http://localhost:8080/nifi/ to run NiFi. Make sure you are running MarkLogic version 9.0+.
Matt Allen is a VP of Product Marketing Manager responsible for marketing all the features and benefits of MarkLogic across all verticals. In this role, Matt interfaces with the product and engineering team and with sales and marketing to create content and events that educate and inspire adoption of the technology. Matt is based at MarkLogic headquarters in San Carlos, CA and in his free time he is an artist who specializes in large oil paintings.
Subscribe to get all the news, info and tutorials you need to build better business apps and sites