A few months ago, my friend Minollo blogged about XQuery against RDBMS: let the engine optimize your SQL. And last week again, a question on some newsgroup attracted our attention.
Imagine you have an XML document with customers, there might be one or a few hundred, and want to retrieve all orders for these customers out of your Oracle database. One could write something as follows, using a general comparison in the where clause.
[cc lang="xquery"]for $order in collection("ORDERS")/ORDERS where $order/O_CUSTKEY = doc("customers.xml")/customers/customer return
{$order/O_CUSTKEY/data(.)} {$order/O_ORDERDATE/data(.)} {$order/O_TOTALPRICE/data(.)} [/cc]
As explained in the white paper Implementing XQuery efficiently for relational databases, DataDirect XQuery always handles general comparisons efficiently and translates them into SQL - taking advantage of database indexes and only retrieving the relevant data out of the database.
Most other XQuery implementations don't translate such general comparisons into SQL, because the second argument of the general comparison is a sequence of customers. In other words, the where clause is not evaluated by the database engine, with all the negative performance and scalability consequences that implies. Basically, all records in the ORDERS table are retrieved out of the database, and filtering is accomplished in memory.
As advised on the newsgroup, one can rewrite the query as follows.
[cc lang="xquery"]for $customer in doc("customers.xml")/customers/customer for $order in collection("ORDERS")/ORDERS where $order/O_CUSTKEY = $customer return
{$order/O_CUSTKEY/data(.)} {$order/O_ORDERDATE/data(.)} {$order/O_TOTALPRICE/data(.)} [/cc]
Both queries yield the same result. Well, to be technically accurate, there are some differences.
- the order of the returned orders might be different. With the second query, the orders are grouped per customer
- in order to make both queries equivalent, with the second alternative, you need to get distinct values out of the customers.xml
But ok, let's assume for a moment both queries yield the same result.
One interesting thing about this second query, compared to the first, is that most XQuery implementations are capable of translating this where clause in SQL as both arguments of the comparison are single values. And you end up with SQL as follows, executed through a nested loop algorithm.
[cc lang="xquery"]select O_ORDERKEY, O_CUSTKEY, O_ORDERDATE, O_TOTALPRICE from ORDERS where O_CUSTKEY = ?[/cc]
Further, better implementations will try to minimize the number of roundtrips to the database, and combine multiple requests in a single SQL statement. Smells like a topic for another post, anyway...
Looks great, our initial concerns are answered, the where clause is evaluated by the SQL engine. We're perfectly happy, right?
Unfortunately, there is a significant drawback with the second query! Suppose, for example, we want to order our results by ORDERDATE and TOTALPRICE.
[cc lang="xquery"]for $customer in doc("customers.xml")/customers/customer for $order in collection("ORDERS")/ORDERS where $order/O_CUSTKEY = $customer order by $order/O_ORDERDATE, $order/O_TOTALPRICE return
{$order/O_CUSTKEY/data(.)} {$order/O_ORDERDATE/data(.)} {$order/O_TOTALPRICE/data(.)} [/cc]
Although the where clause is translated into SQL, the order by clause is not. It is simply not possible! Remember that multiple SQL statements will be executed in a nested loop, all these results are buffered in memory, after which we need to order the buffered results. This is clearly a significant drawback of the proposed alternative. Handing order by clauses out of your database engine, implies a serious performance penalty and excessive memory consumption, even with medium sized query results.
Let's go back to our initial query and add the order by clause.
[cc lang="xquery"]for $order in collection("ORDERS")/ORDERS where $order/O_CUSTKEY = doc("customers.xml")/customers/customer order by $order/O_ORDERDATE, $order/O_TOTALPRICE return
{$order/O_CUSTKEY/data(.)} {$order/O_ORDERDATE/data(.)} {$order/O_TOTALPRICE/data(.)} [/cc]
If you have an XQuery implementation that is capable of translating such where clauses into SQL, the topic this post started with, it can as a consequence go further and also translate the order by clause into SQL! With DataDirect XQuery, the execution plan for this last query is as follows. Note that both the where and order by clause are pushed into SQL and processed by your Oracle database.
As you see, choosing the right XQuery implementation can make a huge difference. If you want to find out more about how DataDirect XQuery generates SQL and leverages the strengths of your current database investment, check out this white paper.