Good XML design and performance

by Evan Lenz Posted on July 14, 2011

MarkLogic has always tried to ensure that well-designed XML performs well “as is” in MarkLogic Server. For example, if your schema uses descriptive and unique element names, that is not only going to make your application code clean and readable, but fast as well. On the other hand, if your schema contains a lot of generic element names (such as “item”) used in multiple ways, then it’s going to make for harder-to-read code (in XQuery or XSLT), and it might also require you to do some extra leg work to get the best performance.

For example, consider a schema that has a lot of elements named <group> (or <section> or <item> or some other generic name) but which play very different roles—in this case indicated by the value of an attribute:

<doc>
  <group type="widget">
    <item type="sprocket">...</item>
    ...
  </group>
  <group type="employee">
    <item type="executive">...</item>
    ...
  </group>
  <group type="place">
    <item type="city">...</item>
    ...
  </group>
</doc>

Since MarkLogic indexes elements by their name, it is not automatically going to make a distinction between the various <group> elements you have, because they have the same name. That being said, certain queries will still run maximally fast, such as when you want to restrict your results to a particular attribute value, using a simple XPath expression like this: //group[@type eq 'widget']. MarkLogic Server will use its Universal Index to avoid reading any documents that don’t have a <group> element whose “type” attribute is equal to “widget”. So we’re okay so far.

But there are still a few issues here. For one thing, your code will not be very readable. This expression:

//group[@type eq 'widget']/item[@type eq 'sprocket']

is pretty noisy compared to, for example:

//widgets/sprocket

which is what your code would look like if you used more descriptive element names.

The other issue is that you may run into some problems when you want to start doing more advanced, for instance, word search in subsets of your documents. Specifically, if you want to restrict your search results to all group elements except widget groups, that will be challenging. (Fields can help you do the converse, but in that case you may have to enumerate all the ones you are interested in getting results for.)

Another issue with the above design is that, despite the potential benefit of being data-driven and extensible, it’s not possible to apply schema constraints that are unique to specific classes of <group> elements (at least in W3C XML Schemas). You can’t, for example, restrict the content of <group> elements to <sprocket> and <gear> elements only when its type attribute is “widget”. If you want different content models, then you need to use different element names. Starting off with generic <group> elements may lead you down a slippery slope. You’ll find yourself using other generic names like “item”, and even then you won’t be able to effectively restrict the “type” values to only the applicable ones.

Here’s what an arguably better (and more readable) schema design would look like:

<doc>
  <widgets>
    <sprocket>...</sprocket>
    ...
  </widgets>
  <employees>
    <executive>...</executive>
    ...
  </employees>
  <places>
    <city>...</city>
    ...
  </places>
</doc>

To conclude, there are lots of good reasons to use descriptive, unique element names whenever possible, and doing so plays nicely with human readers, XQuery, XSLT, XML Schemas, and MarkLogic Server.


Evan Lenz
View all posts from Evan Lenz on the Progress blog. Connect with us about all things application development and deployment, data integration and digital business.
More from the author

Related Tags

Prefooter Dots
Subscribe Icon

Latest Stories in Your Inbox

Subscribe to get all the news, info and tutorials you need to build better business apps and sites

Loading animation