A few weeks ago I blogged about XQuery your office documents. We can query our Office Open XML and OpenDocument Format documents, because they are XML based. But what about older formats? For example, there are a zillion of Excel 2003 spreadsheets, and they will be around for another few years. Wouldn't it be great if we can query those, just as we can query OOXML and ODF documents through DataDirect XQuery?
URI Resolvers
JAXP defines the URIResolver interface. A URIResolver turns a URI into a 'virtual' XML document. The concept of URIResolver is supported by most Java-based XPath, XSLT and XQuery implementations and thus also by DataDirect XQuery.
URIResolvers allow you to the query any proprietary format through XQuery, as long as you go through the effort of some Java coding to transform the legacy format to XML.
There are also products using the URIResolver interface to make non-XML data standards available to the XML eco-system. The XML Converters, for example, allow to query many non-XML data formats like the EDI standards X12, EDIFACT, EANCOM, HL7, etc, as well as dBASE, CSV, JSON and many others.
Reading Excel 2003 files
We're going to use the concept of URIResolvers to query our Excel 2003 spreadsheets. But of course, we need a Java implementation to read these XLS files. Apache POI is such Java API to access Microsoft file formats, including Excel 2003 documents.
The advantages of Apache POI for Excel are,
- Cross platform, as there are no dependencies on native Windows DLLs
- The API is powerful enough to translate an XLS into XML
- Formula support, most of the time you want to query the 'data'. For a cell containing a formula (for example SUM(A1:A5)) you're not interested in the formula itself but its result.
The ExcelURIResolver
So we have written a URIResolver to read Excel 2003 files. The xqexcel.jar file is available here, and needs to be added to your CLASSPATH. You also need two Apache POI 3.0.1 jar files, poi-3.0.1-FINAL-20070705.jar and poi-scratchpad-3.0.1-FINAL-20070705.jar. The Apache POI distribution can be downloaded here.
Enabling the ExcelURIResolver in DataDirect XQuery is trivial. If you are using the XQJ API, you can simply register the ExcelURIResolver through your DDXQDataSource.
[cc lang="xquery"]... DDXQDataSource ddds = new DDXQDataSource(); ddds.setDocumentUriResolver( "com.ddtek.xquery.excel.ExcelURIResolver"); ...[/cc]
Using the DataDirect XQuery command-line utility, all you need to do is adding the -r option specifying the class name of the Excel URIResolver, com.ddtek.xquery.excel.ExcelURIResolver.
And you're all set to query Excel 2003 documents through the fn:doc() function. Use the excel: URI scheme, specifying the file name of the .XLS
[cc lang="java"]fn:doc('excel:C:/my office documents/sales2007.xls')[/cc]
The virtual Excel XML document
An Excel 2003 document is called a workbook and can contain several sheets, each sheet is a grid of cells. Our ExcelURIResolver makes the following information available through the virtual XML document.
- All sheets, each with the name of the sheet.
- Within a sheet, all used row. For each row, the row number as available in Excel is available.
- Within a row, each cell being used. Note that you can have a different number of cells within each row. So, its not like a relational table where each row has a fixed number of columns. For each of the cells, the name of the column is available, consistent with the scheme used by Excel.
As an example, consider a sample Excel file, ciscoexpo.xls, from Microsoft's web site. The file consists of one sheet called Sheet1 and it looks as follows.
The columns A and B (Year and Sales) contain plain data (numbers) and the columns C and D (Predication and Ratio) contain formula's. Cell C5 is for example =58.552664*EXP(0.569367*A5) and D5 is =C5/C4.
When we query the complete document,
[cc lang="java"]fn:doc('excel:C:/my office documents/ciscoexpo.xls')[/cc]
We get the following virtual XML document.
[cc lang="xquery"]
Year 1=1990
Year Sales Prediction Ratio
1 70 103.4712285029616
2 183 182.84898408571283 1.767148092578018
3 340 323.1212334568959 1.7671480925780185
...
10 12154 17389.060639019517 1.767148092578018
16 529558.3247555149
[/cc] Accessing cell B5 to B7 in XQuery world,
[cc lang="xquery"]let $xls := fn:doc('excel:C:/my office documents/ciscoexpo.xls') let $sheet := $xls/workbook/sheet[@name="Sheet1"] return $sheet/row/cell[@name=("B5","B6","B7")][/cc]
ConclusionWe have shown how to open data locked up for years in your Excel spreadsheets. We can now query this virtual XML document like we can with any other XML document, opening a wide range of use cases.
- Transform Excel 2003 documents into any XML standard format
- Join your .xls data with your relation database to generate complex XML documents
- Create EDI messages using the XML Converters with data stored in Excel 2003 spreadsheets
- Extract information out of Excel 2003 documents and upload it into your database
- Publish data out of your Excel spreadsheets in PDF format using XSL-FO
- etc
But also, the concept of URI Resolver is powerful, and allows you basically to query any proprietary data through XQuery.