Using XQuery to join two Excel spreadsheets? Why not?

March 21, 2008 Data & AI

This afternoon I was having some fun (not!) reviewing some third party license agreements, when someone emailed me a couple of Excel spreadsheets saying "Hey Minollo! I have these two spreadsheets; one has a list of all employees since 1990 in Progress Corporation and the cities where they lived when they were hired; the other one has a list of the current employees and in which Progress division they work; I need to create a list of all the current employees with details about the division they work in and the town the live (or lived) in. Is there anything that "XQuery thing" you always talk me about can do for me here?"

Well, the temptation to keep reviewing license agreements rather than playing with XQuery was strong... but I resisted! I had to help this guy! :)

I was happy to notice that the spreadsheets had been saved using the Office Open XML (OOXML) format; OOXML is basically a package of XML files describing the document (text, spreadsheet, slides) that it represents. Being XML-based, it's quite easy to inspect it using an XML query language; the package is a zip format, so individual documents can be retrieved using the standard "jar:" file scheme, for example.

So, coming back to the guy asking for help joining two spreadsheets; what can we do for him? If you look at the structure of the two spreadsheets he sent me (attached to this post), you'll see two XML documents that are relevant to us:

  • xl/worksheets/sheet1.xml This file contains the content of each non-empty cell in the spreadsheet. The document is split in elements and each element has a number of cells represented by elements. When the content of a element is a character string, like in this particular case, the element contains a element whose value is the index of the corresponding character string in a list of all the character strings used in the spreadsheet

 

  • xl/sharedStrings.xml This file contains all the character strings used as content of cells in the spreadsheet; the order is relevant, as the character strings are referenced using an index (like described above)

 

To retrieve the whole content of xl/worksheets/sheet1.xml from c:doc1.xlsx using a Java-based XQuery processor like DataDirect XQuery, you can just us the standard fn:doc() function:

[cc lang="xquery"]fn:doc(“jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml”)[/cc]

The index corresponding to the character string contained in the first cell of the first row in the spreadsheet can be retrieved doing:

[cc lang="xquery"]fn:doc(“jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml”)// ooxml:sheetData/ooxml:row[1]/ooxml:c[1]/ooxml:v[/cc]

...where ooxml is a namespace prefix associated to the URI http://schemas.openxmlformats.org/spreadsheetml/2006/main .

And if you have the index of the character string, the character string itself can be retrieved as:

[cc lang="xquery"]fn:doc(“jar:file:///c:/doc1.xlsx!/xl/sharedStrings.xml”)// ooxml:si[$theIndex+1]/ooxml:t/text()[/cc]

Easy, isn't it!?

Add some syntax, a couple of FLWOR expressions, a few HTML tags, and here you go! This XQuery just joins the two spreadsheets and creates an easy to read HTML report:

[cc lang="xquery"]declare namespace ooxml= “http://schemas.openxmlformats.org/spreadsheetml/2006/main”; let $doc1Strings := doc(“jar:file:///c:/doc1.xlsx!/xl/sharedStrings.xml”)//ooxml:si let $doc2Strings := doc(“jar:file:///c:/doc2.xlsx!/xl/sharedStrings.xml”)//ooxml:si return

{

NameDivisionCity

, for $row1 in doc(“jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml”)// ooxml:sheetData/ooxml:row, $row2 in doc(“jar:file:///c:/doc2.xlsx!/xl/worksheets/sheet1.xml”)// ooxml:sheetData/ooxml:row let $col1-1 := $doc1Strings[$row1/ooxml:c[1]/ooxml:v/number()+1]/ooxml:t/text() let $col2-1 := $doc2Strings[$row2/ooxml:c[1]/ooxml:v/number()+1]/ooxml:t/text() where $col1-1 = $col2-1 return let $col1-2 := $doc1Strings[$row1/ooxml:c[2]/ooxml:v/number()+1]/ooxml:t/text() let $col2-2 := $doc2Strings[$row2/ooxml:c[2]/ooxml:v/number()+1]/ooxml:t/text() return

{$col1-1}{$col2-2}{$col1-2}

}

[/cc]

It's that simple! Here are the two spreadsheets, if you want to try it yourself: doc1.xlsx, doc2.xlsx

Minollo