Founder’s Online: A Lesson in Performance

April 17, 2014 Data & AI, MarkLogic

This post is a snapshot of the talk given at MarkLogic World, titled, “Planning for Growth with and without Performance Metering,” given by David Sewell, Editorial and Technical Manager for University of Virginia Press, with support from Tim Finney, Lead Programmer out of Perth, Australia.

MarkLogic is great for large enterprises running large applications, but MarkLogic is also great for small shops that want to do great things. Founder’s Online launched in summer 2013, and providing public access to almost 150,000 searchable documents from six of the founding fathers: George Washington, Benjamin Franklin, John Adams, Thomas Jefferson, Alexander Hamilton, and James Madison. The site, a joint venture between University of Virginia Press in cooperation with National Archives, and powered by MarkLogic, is incredibly fast and scalable, delivering sub-second response times to thousands of concurrent users. Surprisingly, however, Founder’s Online was developed by an amazingly small team of people – on a relatively small budget.

Here are some quick facts about the project:

  • Small Team:  1.5 dedicated FTE to develop the site
  • Big Data:  150,000 searchable documents with an average size of 2MB
  • Fast Queries:  15,300 documents in 0.02 seconds
  • Serious Scale:  120 ms response time with five thousand concurrent users

So, how did the Founder’s Online team achieve such high performance?

According to David Sewell, there were three key elements that helped Founder’s Online achieve the great performance results:

1.  Leverage the XML Data Model

All of the text from the letters was transcribed and transformed to XML. Each letter was then stored as an individual document within MarkLogic, making up a collection of 150,000 documents. For querying the XML, the team avoided using XPath node traversal, which was too slow and created hard-coded links and expansions. An example of the simple code in production for search queries is below:

search:search(
	$q-full,
	c:map-search-options($map),
	$start,
	$length
)

2.  Rethink the Code

The team had to get away from legacy code and strategies and embrace new approaches. To help, they relied heavily on MarkLogic’s documentation onQuery Performance and Tuning Guide. The team also used the XQMVC framework, and is like many of the other MVC frameworks for languages such as Java, Python, PHP, Ruby, etc., except XQMVC is designed specifically for building complex applications in XQuery. Some of the other key things that the team did included:

  • Using maps instead of session fields
  • Used run-time switches
  • Ignored bottlenecks possibly deriving from search internals

With the new architecture, they were able to query 15,300 documents in 0.02 seconds.

An example of the application code showing a lexicon function is below:

let $publ :='JSMN'
let $duplicate :=
	cts:element-attribute-values(
		xs:QName('FGEA:mapData'),
		fn:QName(",'id'),
		(),
		(),
		cts:collection-query($publ)
	)[cts:frequency(.) gt 1]
return count($duplicates)

3. Rely Heavily on Caching

The team moved from dynamic to static wherever possible, both in rendering and search results, by relying on caching. They did this by developing a front-end caching proxy called Nginx; creating an HTML cache in MarkLogic to avoid the need for run-time XSLT rendering; and, developing a cache output from searches, facets, and result pages in the database for potential re-use. The documents in the search cache are stored as binaries in MarkLogic to avoid index overhead. By avoiding indexes, a document call simply pulls it in as XHTML, which is very efficient. An example of the code is below:

Binary {
  xs:hexBinary(
    xs:base64Binary(
       xdmp:base^64-encode(
          xdmp:quote($HTML-node)
       )
     )
   )

Using this approach to caching, the site showed serious improvements in query speeds. A 90-page document that originally took 19 seconds of query time on the old platform could be delivered in as little as 1.86 ms. IBM’s Global Technology Services even did some testing on the application and found that even with 5,00 concurrent users, average response time was still only 120 ms.

*Load testing by IBM Global Technology Services using SOASTA, Inc.

Using these tactics to optimize performance, the Founder’s Online team was able to build a successful app that eventually will go on to support 90 volumes of over 175,000 of founder’s letters.

Matt Allen

Matt Allen is a VP of Product Marketing Manager responsible for marketing all the features and benefits of MarkLogic across all verticals. In this role, Matt interfaces with the product and engineering team and with sales and marketing to create content and events that educate and inspire adoption of the technology. Matt is based at MarkLogic headquarters in San Carlos, CA and in his free time he is an artist who specializes in large oil paintings.