Using Python to Interact with Data in Progress MarkLogic

August 07, 2023 Data & AI, MarkLogic

Introduction

Many data scientists use Python and other adjacent tools (Jupyter Notebook, pandas, etc.) to manipulate and view data in their work. Python integrates with MarkLogic to provide unique ways to access and manage data through MarkLogic Server. MarkLogic Server is a trusted multi-model database with built-in search functionality for creating value with complex data. MarkLogic provides a Client REST API to start using MarkLogic with little prior knowledge. By utilizing a few familiar tools, MarkLogic’s versatile technology makes it easy to use MarkLogic Server with any Python application.

In this post, we will learn how easy it is to interact with data in MarkLogic Server using Python scripts in Jupyter Notebook. We will also learn how to convert documents into rows to view data in a pandas DataFrame. Find the examples below in our GitHub repository for further reference.

Getting Started

Before you can begin, you must install:

Create a New Notebook

Start a new notebook with the jupyter notebook command from a terminal window. Open a new file:

In the first cell, run the following script:

# Change this if needed to point to your MarkLogic instance.
base_url = "http://localhost"
 
# The admin account that can be used to create a new user and role.
adminUser = "admin"
adminPassword = "admin"
 
# The user to create which will be used for all of the examples.
user = "python-blog-user"
password = "pyth0n"
role_name = "python-blog-role"  
 
import json
import requests
from requests.auth import HTTPDigestAuth
 
admin_session = requests.Session()
admin_session.auth = HTTPDigestAuth(adminUser, adminPassword)
 
user_session = requests.Session()
user_session.auth = HTTPDigestAuth(user, password)

 

This script initializes a connection to the Server using the Python requests HTTP library. We will use this library to send several simple HTTP requests to the MarkLogic Manage REST API and the MarkLogic Client REST API. The ability to receive HTTP requests is a significant aspect of integrating Python development with MarkLogic.

Note: MarkLogic recommends changing the "base_url" and "adminPassword" values depending on the configuration of your Server.

Create a MarkLogic Role

Next, run the following script:

# Create a MarkLogic role that allows a user to run all of the examples below.
# See https://docs.marklogic.com/guide/security/intro for more information on MarkLogic security.
 
admin_session.post(
    f"{base_url}:8002/manage/v2/roles",
    headers={"Content-type": "application/json"},
    data=json.dumps({
        "role-name":role_name,
        "role":["tde-admin", "rest-writer", "qconsole-user"],
        "privilege":[
            {"privilege-name":"xdmp:document-get", "action":"http://marklogic.com/xdmp/privileges/xdmp-document-get", "kind":"execute"},
            {"privilege-name":"unprotected-collections", "action":"http://marklogic.com/xdmp/privileges/unprotected-collections", "kind":"execute"},
            {"privilege-name":"any-uri", "action":"http://marklogic.com/xdmp/privileges/any-uri", "kind":"execute"}
        ]
    })
)

 

This script sends a JSON message from the admin user to the Manage REST API. This creates a new role named "python-blog-role" which inherits several configurations necessary for the subsequent scripts in the procedure. While you can use the MarkLogic "admin" user, we encourage creating a user with fewer privileges for this kind of role.

Create a MarkLogic User

Then, run the following script:

# Create a new MarkLogic user with the role that was just created.
 
admin_session.post(
    f'{base_url}:8002/manage/v2/users',
    headers={'Content-type': 'application/json'},
    data=json.dumps({"user-name": user, "password": password, "role":[role_name]}),
)

 

This script creates a new MarkLogic user named "python-blog-user."

Load Data

To start interacting with the data, we must first add data.

Run this script:

# Load 500 JSON employee documents into the Documents database.
 
r = requests.get("https://raw.githubusercontent.com/marklogic/marklogic-spark-connector/master/src/test/resources/500-employees.json")
 
for employee in json.loads(r.text):
    user_session.put(
        f'{base_url}:8000/v1/documents',
        params={'uri': employee['uri'], 'collection': 'employee'},
        data=json.dumps(employee['value'])
    )
 
print("Finished loading employees.")

 

This script downloads a JSON file that contains 500 employee objects. Then, it loads each object as a new document in the database using the MarkLogic documents REST Service, a Client API for documents. The database collects these documents inside a MarkLogic collection, named "employee." Navigate to the Query Console to view the loaded documents. Switch the database field to "Documents" and click "Explore." The documents appear in the "Explorer" list.

Search Employee Documents

MarkLogic Server can both store and query data. When MarkLogic stores new documents, it creates an index for immediate search. In a new cell, run the following query:

# Search all employee documents, printing out the response from MarkLogic.
 
r = user_session.get(
    f'{base_url}:8000/v1/search',
    headers={'Accept': 'application/json'},
    params={'collection': 'employee'}
)
print(json.dumps(json.loads(r.text), indent=2))

 

This query sends a search request for the collection "employee" to the MarkLogic search REST service (a Client API for queries), which returns a JSON result containing snippets of the first ten documents.

To make the query more specific, enter the following script:

# Search employee documents containing "San Jose" and verify only 1 result is returned.
 
r = user_session.get(
    f'{base_url}:8000/v1/search',
    headers={'Accept': 'application/json'},
    params={'collection': 'employee', 'q': '"San Jose"'}
)
results = json.loads(r.text)
assert results['total'] == 1
print(json.dumps(results, indent=2))

 

This query includes a search string with the "q" parameter and a value of "San Jose." This informs MarkLogic to only return results that contain the phrase "San Jose." This task does not require the user to configure anything in the Server itself. The ability to search for any text is part of the MarkLogic universal index, which is immediately available after loading data into the database.

Update The Data

The MarkLogic documents REST service allows the user to modify documents. Run the following script to alter the contents of a document:

# Update an employee to contain the phrase "San Jose".
 
url = f'{base_url}:8000/v1/documents'
uri = '/data/employees/093caccf-f7ed-4572-a8fa-6390caf4d20e.json'
 
r = user_session.get(url, params={'uri': uri})
doc = json.loads(r.text)
doc['City'] = 'San Jose'
doc['State'] = 'CA'
user_session.put(url, params={'uri':uri}, data=json.dumps(doc))

 

The document now contains the phrase "San Jose."

In a new cell, run the following script to search for "San Jose" to verify that the previous script modified the document:

# Search for San Jose again and verify that 2 results are now returned.
 
r = user_session.get(
    f'{base_url}:8000/v1/search',
    headers={'Accept': 'application/json'},
    params={'collection': 'employee', 'q': '"San Jose"'}
)
results = json.loads(r.text)
assert results['total'] == 2
print(json.dumps(results, indent=2))

 

The query now returns two results with the phrase "San Jose."

Create a View

While MarkLogic stores data as documents and the examples above interact with data as documents, many data analytics tools require data to be in rows. To utilize those tools, we must convert the documents into rows. MarkLogic’s Optic API supports this conversion by retrieving data as rows.

A common method in MarkLogic for projecting rows from documents is to create a view, which effectively creates an indexed table that the user can query via Optic or SQL. You can create a view using MarkLogic’s TDE Feature, which utilizes templates to model how the view should appear.

Run the following script to access this feature:

# Load a TDE template into the Schemas database.
# This creates a view projecting rows from the employee documents.
# See https://docs.marklogic.com/guide/app-dev/TDE for more information.
 
template = {
    "template": {
        "context": "/",
        "collections": ["employee"],
        "rows": [
            {
                "schemaName": "example",
                "viewName": "employee",
                "columns": [
                    {"name": "GivenName", "scalarType": "string", "val": "GivenName"},
                    {"name": "Surname", "scalarType": "string", "val": "Surname"},
                    {"name": "BaseSalary", "scalarType": "double", "val": "BaseSalary"},
                    {"name": "City", "scalarType": "string", "val": "City"},
                    {"name": "Department", "scalarType": "string", "val": "Department"},
                ],
            }
        ],
    }
}
 
admin_session.put(
    f'{base_url}:8000/v1/documents',
    data=json.dumps(template),
    headers={'Content-Type': 'application/json'},
    params={
        'uri': '/employee-template.json',
        'collection': 'http://marklogic.com/xdmp/tde',
        'database': 'Schemas'
    }
)

 

This script also uses the MarkLogic documents REST Service to load a JSON document, but this time, the document appears in the default "Schemas" database. Additionally, the document appears in a specific collection that informs MarkLogic that it is a TDE template that should be used to construct a view. Switch to the "Schemas" database to view the template:

Query for Rows

Now that we have a view, we can query for rows. Run the following:

# Retrieve employees as JSON rows using an Optic query, printing the response from MarkLogic.
# See https://docs.marklogic.com/11.0/guide/optic-guide/en/getting-started-with-optic.html for more information.
 
optic_query = 'op.fromView("example", "employee").where(op.eq(op.col("City"), "San Jose"))'
r = user_session.post(
    f'{base_url}:8000/v1/rows?column-types=header',
    headers={
        'Content-type': 'application/vnd.marklogic.querydsl+javascript',
        'Accept': 'application/json'
    },
    data=optic_query
)
results = json.loads(r.text)
print(json.dumps(results, indent=2))

 

This script sends an Optic query to the MarkLogic rows REST Service (a Client API for rows) to select rows from the view where the "City" column in each row has a value of "San Jose."

Create a pandas DataFrame

The Python pandas library is a common tool for analyzing and manipulating data. The MarkLogic rows REST Service returns rows as a CSV, which is useful for creating a pandas DataFrame.

Run the following script:

# Retrieve employees as CSV data and create a pandas DataFrame.
!pip install pandas
import pandas as pd
import io
 
query = 'op.fromView("example", "employee", "").where(op.eq(op.col("City"), "San Jose"))'
r = user_session.post(
    f'{base_url}:8000/v1/rows',
    headers={
        'Content-type': 'application/vnd.marklogic.querydsl+javascript',
        'Accept': 'text/csv'
    },
    data=query
)
 
df = pd.read_csv(io.StringIO(r.text))
print(df.head())

 

This script generates a DataFrame which appears below the cell. After the rows convert into a CSV, the pandas read_csv function creates a pandas DataFrame.

Conclusion

The sample task above demonstrates how easy it is to use Python to interact with data in MarkLogic. Using familiar tools like Jupyter Notebook and pandas show how the user can build any Python application with MarkLogic. Developers and data scientists can use the Python HTTP requests library to load and access data in MarkLogic with ease. They can also convert rows into CSV data to utilize the Python pandas library for any project. With the MarkLogic REST API to simplify loading and updating data, users experience a more efficient process overall and integrate any Python application with MarkLogic's data platform.

Rekiah Hinton

Rekiah Hinton is an Associate Information Developer for Progress MarkLogic. She enjoys reading and playing chess.

Read next Better Data: Progress’ Acquisition of MarkLogic Is a Win for Customers