Handling Whitespace in URIs

April 10, 2016 Data & AI, MarkLogic

I am frequently asked about using MLCP to load documents with whitespace in their filenames or paths, since well-formed database document URIs may not have whitespaces. Let’s discuss the impact of loading documents with whitespace and ways to handle the whitespaces.

The Issue with Whitespace in URIs

Let’s suppose we have a directory on the filesystem called “white space dir” and we want to load files in that directory into MarkLogic. This is the example setup:

/
  tmp/
    blog/
      white space dir/
        sample.json

I can load that with MLCP

$ ~/software/mlcp-8.0-4/bin/mlcp.sh import -username admin -password admin -host localhost -port 8000 -input_file_path "/tmp/blog"

Loading the example document using MLCP will result in the following document URI: “/tmp/blog/white%20space%20dir/sample.json”. Why? According to the section on Character Encoding of URIs in the MLCP User Guide, well-formed database document URIs may not have whitespaces and therefore MLCP automatically encodes the illegal whitespaces using %20.

Let’s think about this issue some more. If we loaded the example document as is and try to retrieve the document, we may build an application that expects to find documents based on their paths on the filesystem using the following call:

fn:doc('/tmp/blog/white space dir/sample.json')

The URI based on the filesystem doesn’t match any URI in the database, so we won’t get any results.

What if we build an application that tries to access our document through the REST API? The following call will return a 404 response:

http://localhost:8000/v1/documents?uri=/tmp/blog/white space dir/sample.json

<error-response xmlns="http://marklogic.com/xdmp/error">
  <status-code>404</status-code>
  <status>Not Found</status>
  <message-code>RESTAPI-NODOCUMENT</message-code>
  <message>RESTAPI-NODOCUMENT: (err:FOER0000) Resource or document does not exist: category: content message: /tmp/blog/white space dir/sample.json</message>
</error-response>

Handling the whitespace in URIs

There are several things to consider when deciding how to handle the whitespaces. Are you able to change the paths on the filesystem? Is it better for your application to use another character like a dash or to use the encoded whitespace? If you leave the whitespaces encoded, is there a way to work around the encoding? Read on for the various methods.

Change the Path on the Filesystem

If changing the filesystem is an option, the easiest solution is to simply to change the filesystem path to avoid white spaces, allowing the in-database URIs to match. If the filesystem path doesn’t have spaces, MLCP won’t need to adjust the paths to make them match.

Transform the Paths During Load

Another way to adapt to having whitespaces in the filesystem paths is to transform the whitespaces during the load. MLCP lets us execute a write transform. We’ll start by writing an MLCP transform that converts the spaces into dashes. Note that by the time the transform runs, the spaces have already been encoded as “%20”.

xquery version "1.0-ml";

module namespace space = "http://marklogic.com/transform/space-to-dash";

declare function space:transform(
  $content as map:map,
  $context as map:map)
as map:map*
{
  map:put(
    $content,
    "uri", fn:replace(map:get($content, "uri"), "%20+", "-")),
  $content
};

There are multiple ways to deploy our transform. We will do it through the REST API.

$ curl --anyauth --user admin:admin -X PUT -i 
    --data-binary @"./space-to-dash.xqy" 
    -H "Content-type: application/xquery" 
     'http://localhost:8000/v1/ext/mlcp/space-to-dash.xqy'

Now we can call the transform as we load our data:

$ ~/software/mlcp-8.0-4/bin/mlcp.sh import -username admin -password admin 
    -host localhost -port 8000 -input_file_path "/tmp/blog" 
    -transform_module /ext/mlcp/space-to-dash.xqy 
    -transform_namespace http://marklogic.com/transform/space-to-dash

The result is a URI that is accessible without encoding: “/tmp/blog/white-space-dir/sample.json”.

Use Encoded URIs in Requests

Application development is simpler if you avoid encoded URIs, but there may be cases where we need to leave the URIs with whitespaces encoded. If using encoded URIs is a requirement, apply the necessary encoding when requesting a document:

fn:doc(xdmp:url-encode('/tmp/blog/white space dir/sample.json', fn:true()))

We can take the same approach when using the REST API, but an extra step is required. If we ask for “/v1/documents?uri=/white%20space%20dir/sample.json”, normal processing turns that back to “/white space dir/sample.json” (resulting in a 404 response). However, if we encode the % signs themselves (%25), it works:

http://localhost:8000/v1/documents?uri=/white%2520space%2520dir/sample.json

Use Search to Obtain the Document URIs

Encoding means that the URI change has effects throughout the application—which is not ideal. For cases where external systems need to be able to directly address documents with predicted URIs, either adjusting the predicted URIs (change on the filesystem before load) or encoding in the application is necessary. For other applications, however, the solution is to rely on search functionality instead of using the original URIs. When you search in MarkLogic, you are given access to the URI of matching documents. If your application can rely on discovery instead, the problem goes away.

When searching with cts:search(), you can call xdmp:node-uri() on any result, getting the document URI. With the Search API or REST API, the response include the URI of each search result. This method guarantees the correct document URIs, regardless of the original filesystem path or filename.

Additional Resources

Dave Cassel