This series of posts pertaining to the Office OpenXML formats provides an excellent introduction on how to create and manipulate the XML components that compose Word, PowerPoint, and Excel documents. To get started with OpenXML and MarkLogic, review the document components as well as general XML and code examples. However, please note that each version of Microsoft Office introduces new namespaces and new XML elements, and Office applications can change how they produce and consume XML as well. As a result, you may need to update the examples to work with versions of Office after 2010. Since we first published this series, MarkLogic has continued to introduce new tools and programming languages that are useful in working with these documents as well, but the fundamentals demonstrated here remain the same.
The MarkLogic Sample Authoring App for Word was designed as a jump start for anyone doing Microsoft Office development in MarkLogic, showcasing the use of the MarkLogic Toolkit for Word in MarkLogic version 8 with Office 2007 or 2010. The toolkit uses content controls — individual controls that you can add and customize in templates, forms, and documents — to let authors enrich Word documents, and associate and manage metadata. In addition, users can search and reuse existing controls and their metadata in new Word documents.
Getting Started
In order to use this application, install the add-in and supporting XQuery library from the toolkit. The toolkit provides a guide for installation and comes with its own separate sample application.
To get started using this sample application, first update three areas with the URL of the application. See the README.txt, as well as the Sample Authoring App Developer’s Guide, both included in the download for the application.
Following is an overview of the Sample Authoring App functionality, but we can also configure this application to meet additional requirements. A “files of interest” section is also included in the guide, in case we just want to get in there and start hacking.
Enriching Content
In the Authoring screenshot below, there are two sub-tabs to choose from, Controls and Boilerplate. The Controls tab provides a palette of content controls, which we select from the navigation bar at the top. The screenshot below shows the Rich Text controls selected. The other content control icons represent picture, calendar, drop down, and combo box.
Enrich a document section by selecting text or sections of content in the Word document and then clicking the button to add a rich text control around the selection. Only rich text controls can embed other controls and rich sections of content.
Clicking a button without anything selected in the active document results in the control being added at the current cursor position:
The amount of buttons you have for each control and their labels is determined by the controls configuration. For combo box and drop down controls, the selectable items for lists is also configurable (see the developer’s guide for more detail). You can rename the controls, as well as create complex controls that embed other controls so an author could click a button that inserts a form of controls for entry, if required.
The configuration is a simple XML file and it generates the HTML for the buttons on your palette as well as the associated JavaScript functions required for inserting the controls. The names you use for enrichment will be the values you use for search once you’ve saved your document to MarkLogic Server.
By clicking on a control within an active document, the Properties section beneath the control palette helps to navigate the controls by providing information about the control you are currently authoring within, which includes:
- an icon for the type of control that is currently active in the document
- the name of the title of the control
- the name will be followed by the tag value for the control
- if the control is embedded, a parent label is also provided to identify the parent control of the control that is active
The Properties section also allows you to lock and unlock controls and their content. Notice that the Recommendation control is found within a Policy control. Neither the Policy control, nor its contents, are currently locked.
The Boilerplate tab provides configurable buttons that will insert boilerplate Word documents saved in MarkLogic Server into the document actively being authored. The Word, picture, and chart icons represent the types of documents that we can insert. The documents are inserted as XML, so they’re appended to the active document at the current cursor position.
Note: You can think of this as inserting a .docx into a .docx, but the inserted document is not an embedded binary object. The inserted XML becomes a natural part of the document being authored.
Working with Metadata
Each time we add a content control to your document, a custom XML metadata part is added to the .docx being authored. This custom metadata part is associated by ID with the control we’ve added. Here, we can edit the metadata values for associated controls.
The metadata pane provides a hierarchical tree view of the content controls within a document. By clicking on a label in the tree, the focus of the active document will go to that control in the document and the form for associated metadata will be displayed beneath the tree view. Edit values directly on the form; values will be automatically saved to the metadata form as we change fields.
For larger documents, navigating controls by using the tree view may become tedious, so there’s also the option to click within a control in an active document. When we do this, the tree view focus will set to the associated control label in the tree and the associated metadata form will be displayed.
In this sample app, there are three types of metadata forms associated with the controls. We are also using the configurable Dublin Core metadata. We can use other XML for our metadata, and can even have a different custom XML metadata form for each control we define in the control palette. Please refer to the developer’s guide included with the Sample Authoring App.
If we delete a control from the document, its associated metadata part is also removed from the .docx.
Searching within MarkLogic
When you save a .docx to MarkLogic Server, it is automatically unzipped and made available for search and reuse. On this pane, you can search for text found within any content controls in documents saved in MarkLogic.
The results provide a count of search results as well as pagination.
For each result, for each control found that includes our search text, we see:
- the document title (if present in the extracted document properties part) or the URI of the document in MarkLogic
- the last modified date and last modified by for the source document
- an icon informing which type of control our text was found within
- the name of the control that includes our search text
- a snippet of text, with our search text in bold
- rollover of a snippet will display more text for the result
- an insert button
- an open button
Clicking insert will insert the content control into the active document at the current cursor position. The search result may have embedded controls as well. If any of the inserted controls had associated metadata parts in their source document in MarkLogic, those metadata parts are retained and added to the document being authored. We can view and edit any metadata values by examining them in the metadata tab.
By default, every control added to the document will have a custom metadata part associated with it, so if the source document didn’t include metadata for the inserted control, the application associates a default metadata part. This is part of the configuration and detailed in the developer’s guide.
Clicking open will open the source Word document from MarkLogic Server into Word on the client machine.
Finally, we can restrict searches through use of the Filter tab. Click the drop down arrow to see a list of control labels we can select to apply to search, which, of course, are also configurable. To apply the criteria, click the Filter button. You can close the filter selection and keep your filter applied.
Compare Metadata Values across MarkLogic
Compare provides a search for metadata values found in custom XML parts that are within Word documents saved in MarkLogic Server. The drop down list is a configurable list of values. When you select a value from the drop down, a search is performed in MarkLogic for any Word documents including the value in the drop down in a custom XML part within a Word document.
The results returned are a slimmed-down version of the search tab results, resulting in:
- the title or URI of the source document
- the last modified date and last modified by for the source document
- a compare button
Clicking compare opens the document into Word, alongside the document being authored, in Word’s native merge functionality so authors can compare documents.
Conclusion
The Sample Authoring App is intended as a way to provide authors a way to enrich content in Word, as well as define and identify parts or sections of Word documents for search, reuse, auditing, and tracking.
By providing a way for authors to reuse sections of content across documents, identified by rich metadata, authors can then use MarkLogic to query where document components are being re-used. Also, if some document component includes text that needs to be updated across documents, the metadata can be used to run an update of components across all Word documents in MarkLogic that include those sections.