Warning: text() is a code smell

January 10, 2012 Data & AI, MarkLogic

I often see the text() node test being used in much of the XQuery code I come across. It is, of course, perfectly legal and valid to use, and sometimes essential. But most often I see text() being used where the string() function should be used instead, making the code a sitting duck, waiting to be broken. Moreover, text() is rarely needed in simple queries. In my experience, it’s used inappropriately far more than it’s used appropriately, so I feel perfectly confident positing this: text() is acode smell. If you have a habit of using text() a lot, then read on.

The simplest way to explain what I mean is through an example. Consider the following query, which is intended to output one <li> element per <item> element in the input:

<ul>{
  for $item in doc("/test.xml")/list/item/text()
  return <li>{$item}</li>
}</ul>

And consider some sample input (at “/test.xml”):

xdmp:document-insert("/test.xml",
<list>
  <item>My first item</item>
  <item>My second item</item>
  <item>My third item</item>
</list>
)

Running the above query yields exactly the desired result:

<ul>
  <li>My first item</li>
  <li>My second item</li>
  <li>My third item</li>
</ul>

So what’s the problem? text() did just fine. Is it essential here, though? If we were to leave it out, using:

<ul>{
  for $item in doc("/test.xml")/list/item(:/text():)
  return <li>{$item}</li>
}</ul>

Then the <item> elements themselves would get copied to the result, which is not what we want:

<ul>
  <li>
    <item>My first item</item>
  </li>
  <li>
    <item>My second item</item>
  </li>
  <li>
    <item>My third item</item>
  </li>
</ul>

Good point. But now let me throw a wrench in your code. Let’s say test.xml now looks like this:

xdmp:document-insert("/test.xml",
<list>
  <item>My first<!--was the second--> item</item>
  <item>My second item</item>
  <item>My third item</item>
</list>
)

Here is the new output from the original query that uses text():

<ul>
  <li>My first</li>
  <li> item</li>
  <li>My second item</li>
  <li>My third item</li>
</ul>

Uh-oh. Now it’s doing what the code instructed (output one <li> per text node), but it’s not doing what we wanted. The presence of the comment (<!–was the second–>) caused the first <item> to contain not one, but two text nodes. No need to fret, we just need to fix the code. Let’s move the text() part to the expression inside the <li> element constructor:

<ul>{
  for $item in doc("/test.xml")/list/item
  return <li>{$item/text()}</li>
}</ul>

Now we get the original desired output. Problem solved, right? …Not so fast. I’ve got another wrench:

xdmp:document-insert("/test.xml",
<list>
  <item>My first item</item>
  <item>My <em>second</em> item</item>
  <item>My third item</item>
</list>
)

Here’s what our newly fixed query outputs for the above document:

<ul>
  <li>My first item</li>
  <li>My  item</li>
  <li>My third item</li>
</ul>

Where did the word “second” go? Well, it’s not a text node child of <item>, so it didn’t get copied through. Only the text node children of <item> were copied through, just as your code instructed.

You may be thinking: give me a break! You changed the data, and your code broke — big deal. That doesn’t mean text() is a code smell. After all, I can fix it again like this (by using //text() instead of /text()):

<ul>{
  for $item in doc("/test.xml")/list/item
  return <li>{$item//text()}</li>
}</ul>

To which I would have one question: do you always want to have to set yourself up for chasing down subtle bugs when seemingly innocuous changes are made to your data? There’s a more robust way. It’s called the string() function:

<ul>{
  for $item in doc("/test.xml")/list/item
  return <li>{string($item)}</li>
}</ul>

The string() function converts its argument to a string. In the case of a node, it returns the string-value of the node. In the case of an element (or a document node), the string-value of the node is the concatenation of the string-values of all its descendant text nodes. Lo and behold, that exactly what we were looking for! Now it doesn’t matter if any inline markup constructs — sub-elements, comments, or processing instructions — are added. Our code will continue to work as intended.

If you call string() with no arguments, the context node is taken to be the implicit, default argument. In other words, string() is short for string(.). That means you can use it as a direct replacement for text() in many cases, and it will give you the desired result without being so easily breakable:

<ul>{
  for $item in doc("/test.xml")/list/item
  return <li>{$item/string()}</li>
}</ul>

As I mentioned before, the text() node test (remember it’s not a function even though it looks like one) has its perfectly legitimate uses. Out of curiosity, I did a search on the code base for a production application I worked on that uses XQuery and XSLT code, and I only found two instances of text() in all of the view-related code. They were both cases where using text() was essential (and string() couldn’t be used instead). In contrast, there were many uses of string(). I mention this not in order to suggest that my code is absolutely exemplary (Ha!), but that I’m at least practicing what I’m preaching. If you’ve got a different view, feel free to share it in the comments section below!

Additional Resources

Read about how name() is a code smell as well. And even more smelly.

Evan Lenz