How to Bolster Data and AI Projects with MarkLogic Flux

November 04, 2024 Data & AI, MarkLogic

Organizations house information in many different systems, workflows and formats that require data to be used together to form a holistic view. There is a deeper level of insights you can gather by unifying data and creating relationships between the data points.

However, these systems don’t usually speak the same language—and managing and accessing data for operational or analytics purposes has become increasingly complex. This complexity has largely arisen due to the explosion of data volumes and types, with organizations accumulating overwhelming amounts of unstructured and semi-structured formats. As data collection continues to grow, a large portion of that data remains unused for analytics or decision-making.

Organizations that want to consolidate multiple data sources across distributed environments into a low-latency, scalable, high-performance data hub need a straightforward solution to get data in and out of their data platform. They also need the ability to easily transform or repurpose that data once it’s in MarkLogic to support new business requirements.

The Challenges of Effective Production Data Flows

The Progress MarkLogic team has been helping organizations integrate data of every shape, size and format for many years. The common dilemma they all face: disorganized data. In addition to dealing with the volume, organizations must handle the velocity, variety and veracity of data necessary to fuel new projects.

The building of effective production data pipelines to import, transform and load information, often called plumbing, is usually associated with intricate schema designs to reconcile the amount of siloed data, sophisticated transformation logic and complex aggregation and grouping, resulting in:

  1. Repetitive overhead that creates significant inefficiencies for data and dev teams
  2. Prolonged project timeframes that stall the delivery of projects
  3. Scalability and performance issues that drive up integration and operational costs

Supporting transformations can be highly computationally demanding, so you need a high level of agility built into your data tooling to support just-in-time, high-volume data jobs. Luckily, MarkLogic Flux simplifies all your data movement use cases, so you can easily support any analytics, operational or AI workflow. This includes transformations like splitting documents and adding vector embeddings to documents stored in MarkLogic Server on demand.

MarkLogic Flux for Scalable Data Movement

Both data and IT teams are responsible for managing the movement, preparation and security of enterprise information to generate business value. They are often overwhelmed with requests to fetch specific data subsets, add or update data sources and create new reports and dashboards for business intelligence.

This is why the Progress MarkLogic team released MarkLogic Flux—a single, extensible application that helps developers and data engineers more easily connect data sources to MarkLogic Server. With MakLogic Flux, teams can import, export, copy and process large structured and unstructured data formats via an easy-to-use command-line interface. Flux can also be easily embedded in your enterprise applications to support any data flow to and from MarkLogic.

MarkLogic Flux adds improved data connectivity, data transformation and access capabilities to the core functionality of the MarkLogic platform, now supporting vector data, significantly more data sources—such as relational databases and cloud storage systems—and better performance through faster data ingestion. The MarkLogic connector for Apache Spark and Progress DataDirect JDBC driver integration tap into the vast relational data ecosystem of SQL-based sources, providing seamless data access for analytics and business intelligence.

MarkLogic Flux allows you to leverage all your data, unstructured types and relational formats, to provide a comprehensive perspective for enhanced decision-making.

Data Access for Analytics and BI

Let’s take an example from a common use case: a company merger. Companies grow through mergers and acquisitions constantly and this often leads to technical debt and poor interoperability.

Imagine a major health insurance provider that has grown at astonishing levels through multiple acquisitions. With nearly 60 ERP systems, the organization faced a major challenge with schema design.

As traditional relational stores require you to model all data upfront, schema design alone would have taken over two years to analyze all systems.

Taking an agile data approach, the customer loaded data as-is into MarkLogic and followed the envelope pattern to harmonize the data they require for analytics and APIs. This approach allowed them to accomplish the task in months instead of years.

Let’s now explore how MarkLogic Flux makes these types of projects much easier. These systems have varying underlying databases and data structures. MarkLogic Flux will allow you to connect to these databases directly and load a variety of data.

Ingesting and exporting clean and traceable data are now significantly easier.

Importing Data

MarkLogic Flux simplifies large-scale data ingestion operations by importing structured and unstructured data formats from various sources, including databases, local filesystems and cloud storage like Amazon S3.

Importing from Files

Constructing rich search applications, semantic knowledge graphs, data hubs, and more requires data ingest. MarkLogic Flux enables the ingest of many common files such as delimited text, XML and Aggregate XML, JSON and Line Delimited JSON, Avro, RDF, ORC and Parquet.

Importing files such as these can be easily handled with a single command.

./bin/flux import-files \
    --path "../data/claims/" \
    --document-type XML \
    --connection-string "user:password@localhost:8010" \
    --permissions rest-reader,read,rest-writer,update \
    --collections https://example.com/insurance/auto/claim \
    --uri-prefix "/data/claims/" \
    --uri-replace ".*/data/claims/,''"

This command will ingest content into MarkLogic with permissions and a collection that groups the content together for easy retrieval.

Importing from JDBC

Organizations have many systems that store content in relational systems. You can now directly pull from these upstream systems and easily import them into MarkLogic. When connecting to these systems, you will provide a JDBC driver for handling communications with the RDBMS or system-supporting JDBC connections. Check out Progress DataDirect for a full library of high-performance drivers.

./bin/flux import-jdbc \
    --query "SELECT * FROM collisions" \
    --jdbc-url "jdbc:datadirect:postgresql://localhost:5332;DatabaseName=datagov" \
    --jdbc-driver com.ddtek.jdbc.postgresql.PostgreSQLDriver \
    --jdbc-user user \
    --jdbc-password password \
    --connection-string "user:password@localhost:8010" \
    --permissions rest-reader,read,rest-writer,update \
    --collections https://catalog.data.gov/dataset/motor-vehicle-collisions-crashes \
    --uri-template "/data/collisions/{collision_id}.json" 

 

Transforming Records

While data can easily be imported into MarkLogic Server with Flux, it is not always neat in the upstream systems. You can supply data transformations along with the import statement. This will allow you to run a server-side transform in MarkLogic.

You can also run Data Hub transformations as part of the ingestion:

--transform mlRunIngest \
--transform-params trans:flow-name=DataGovFlow,trans:step=1 

By supplying a transform, you can start to build better quality, governed data.

Exporting

MarkLogic Flux gives you multiple options to export data. Some of these options would be archives, files and remote JDBC connections. Using MarkLogic to integrate and curate data is a common use pattern. Often, data is inconsistent across systems or only gives a partial view. Once the data has been curated, you can use that data in other systems if needed.

Through Flux, you can seamlessly integrate with your larger data and tech ecosystem and enable data services and applications to consume your high-quality, curated data by easily exporting as rows or documents to a variety of destinations.

Exporting to Files

Some downstream systems can accept XML or JSON files. You can export these files directly to an archive for this use case. You can supply a collection name or a MarkLogic CTS query to find documents. Note the pattern shown above for transforms can also be leveraged for exports.

./bin/flux export-files \
    --connection-string user:password@localhost:8011 \
    --collections Claim \
    --path exports --compression zip


Exporting to Remote RDBMS

You can also export row-based data directly to a remote RDBMS using the JDBC connection and MarkLogic Optic queries. This row data can come from curated views within your MarkLogic database.

./bin/flux export-jdbc 
    --connection-string user:password@localhost:8011 \
    --query "op.fromView('Claim', 'Claim', '')" \
    --jdbc-url "jdbc:datadirect:postgresql://localhost:5332;DatabaseName=insurance" \
    --jdbc-driver com.ddtek.jdbc.postgresql.PostgreSQLDriver \
    --jdbc-user user \
    --jdbc-password password \
    --table "claims" \
    --mode APPEND

 

Conclusion

It is important to work in a connected enterprise ecosystem and dismantle silos. MarkLogic Flux allows you to easily participate in the larger enterprise architecture and collaborate on projects across teams by joining all your data sources and environments together. With Flux, you can direct the data you need anywhere in a scalable and flexible way to fuel any data consumption pattern and business initiative, driving efficient data engineering and simplified data access.

Visit the Flux Getting Started guide or watch our launch webinar for a live demo of MarkLogic Flux.

Drew Wanczowski

Drew Wanczowski is a Principal Solutions Engineer at MarkLogic in North America. He has worked on leading industry solutions in Media & Entertainment, Publishing, and Research. His primary specialties surround Content Management, Metadata Standards, and Search Applications.

Read next Introducing MarkLogic Data Hub Central