Skip to content

2010 09 06 handling the defaultifempty result operator in a re linq based provider

Fabian Schmied edited this page Sep 6, 2010 · 1 revision

Published on September 6th, 2010 at 21:40

Handling the DefaultIfEmpty result operator in a re-linq-based provider

My recent musings about the DefaultIfEmpty LINQ query operator and how it is represented by re-linq had a cause: On the NHibernate mailing list, there is a discussion going on about (among other things) how to implement DefaultIfEmpty in NHibernate’s LINQ provider. Since that LINQ provider is based on re-linq, I thought I’d write a few things about this.

Let’s take a look at an example query similar to the one I used in my last post. I’ve simplified it a bit because I don’t want to talk about group joins in this post. Group joins are difficult and may be a topic for a future post on their own; here, I want to talk about DefaultIfEmpty. So, here’s the simplified query:

var query = from c in Cooks
            from k in Kitchens.DefaultIfEmpty()
            select new { c, k };

We have a query selecting the Cartesian product of the elements in the Cooks and Kitchens sequences, and we use DefaultIfEmpty so that we get tuples for each cook in the first sequence even if there are no kitchens in the second one.

This is the pretty-printed query model of that query:

from Cook c in TestQueryable<Cook>()  
from Kitchen k in {TestQueryable<Kitchen>() => DefaultIfEmpty()}  
select new <>f__AnonymousType2f`2(c = [c], k = [k])

As in the last post, we can see that the DefaultIfEmpty query operator causes a sub-query to be created, which selects its items from a query source (TestQueryable<Kitchen>()) before applying the operator.

The easiest way to handle the DefaultIfEmpty result operator in this query model is to keep the sub-query structure intact while analyzing the query. To illustrate what I mean, I’m going to show you step-by-step how I’d translate the second from clause to SQL if I were writing a SQL-generating LINQ provider (which, in fact, I am):

Okay, so here I am, I’ve implemented a QueryModelVisitor class which visits the query model’s clauses, one after the other, and generates SQL while doing so. I’ve already visited the main from clause, but no other clause yet, so my generated SQL looks as follows:

SELECT ???  
FROM [CookTable] [t0]

Now, I’m in the VisitAdditionalFromClause method of my visitor, inspecting the additional from clause. An additional from clause usually means a Cartesian product, which is called “CROSS JOIN” in SQL:

SELECT ???  
FROM [CookTable] [t0]  
CROSS JOIN

I apply my ExpressionTreeVisitor to the clause’s FromExpression, and the visitor finds a SubQueryExpression. I keep the sub-query structure for now, and recursively generate SQL for the nested query by invoking a new instance of my QueryModelVisitor on the sub-query’s query model. I visit the sub-query’s main from clause and its select clause:

(SELECT [t1].*  
FROM [KitchenTable] [t1])

Then, my visitor’s VisitResultOperator method is invoked, I detect the DefaultIfEmpty result operator. What now? I need a way to ensure that the query I just generated will always return at least one item; if there is none, I want NULLs – this is the semantics of DefaultIfEmpty.

In SQL, the way to achieve this is to perform a left outer join – but with what? I’m in my sub-query, I don’t have a left side I can join with? Shouldn’t I have descended into the sub-query so quickly? Maybe I should have flattened it somehow?

Yes, I could have done this – in this specific scenario. However, the important thing to notice about DefaultIfEmpty is that it can occur in the most absurd of places – for example on the source of the outermost main from clause. In such cases, there is no simple way to flatten the subquery. If I want a generic solution, I’ll have to find a way that always achieves the DefaultIfEmpty semantics in SQL. And, luckily, there is one:

(SELECT [q2].*  
FROM (SELECT NULL AS [Dummy]) [Dummy]  
LEFT OUTER JOIN (SELECT [t1].* FROM [KitchenTable] [t1]) [q2] ON 1 = 1  
)

Can you see what I’ve done? I’ve put the sub-query emitted so far into yet another sub-query – and used this as the right side of a left outer join – with a dummy left side. Such are the tricks of those who translate LINQ queries to SQL…

Okay, I’ve now finished translating the sub-query, I can insert the sub-query into the outer query and finish translating the outer query:

SELECT [t0].*, [q3].*  
FROM [CookTable] [t0]  
CROSS JOIN (SELECT [q2].*  
FROM (SELECT NULL AS [Dummy]) [Dummy]  
LEFT OUTER JOIN (SELECT [t1].* FROM [KitchenTable] [t1]) [q2] ON 1 = 1  
) [q3]

This does not look pretty, no it doesn’t. But you know what? It works. And, most importantly, this way of translating DefaultIfEmpty to SQL always works, no matter where a DefaultIfEmpty stands in the query. The reason for why it works is that I was able to keep the sub-query structure re-linq presented to me, and find an implementation for DefaultIfEmpty that works locally to that sub-query. This a convenient effect of the sub-query + result operators representation re-linq chooses for complex queries.  If you can find a way to keep the sub-queries and honor the result operators locally, you can literally translate every LINQ query to your target query system.

But what about optimization?

There is definitely potential for optimization here: Whenever I’m inserting sub-queries into their outer queries, I can apply a sub-query optimizer. The optimizer checks whether the sub-query contains a TOP, COUNT, GROUP BY, or similar construct, and, if it doesn’t, realizes that the sub-query’s content can be inlined. It first does so at the innermost sub-query, the one within the join:

(SELECT [t1].*  
FROM (SELECT NULL AS [Dummy]) [Dummy]  
LEFT OUTER JOIN [KitchenTable] [t1] ON 1 = 1  
)

Then, it does the same at the next level:

SELECT [t0].*, [t1].*  
FROM [CookTable] [t0]  
CROSS JOIN (SELECT NULL AS [Dummy]) [Dummy]  
LEFT OUTER JOIN [KitchenTable] [t1] ON 1 = 1

If I also implement the dummy select in such a way that it only appears if the LEFT OUTER JOIN would otherwise stand as the first table in the generated SQL, I’ve got a good final SQL query from my translation:

SELECT [t0].*, [t1].*  
FROM [CookTable] [t0]  
LEFT OUTER JOIN [KitchenTable] [t1] ON 1 = 1

Note that the CROSS JOIN has just become a LEFT OUTER JOIN. Since I’m doing all this while still in the VisitAdditionalFromClause method, this shouldn’t be too hard. There is a bit of work to do to get the SELECT clause to reference the right sub-query (q2, not q3), but that’s totally doable as well.

While working on the re-linq SQL backend (which supports DefaultIfEmpty in exactly the same way outlined above, but at the time of this writing still lacks the sub-query optimizer (which is planned)),  I’ve learned one thing: when dealing with complex LINQ queries, it’s really important to solve transformations as locally as possible. I always try to perform my transformations just in the scope of the current query model; consolidation of the sub-query with the outer query can be done when the visitors return to that outer query.

Any other way lies madness. In my opinion.

- Fabian

Clone this wiki locally