Indexing LINQ
Microsofts LINQ presents a wonderful sql-like syntax for querying arbitrary collections of objects. Under the hood, however, that sql-like syntax is transformed into what is called a monoid comprehension, which, scary though it may sound, is basically one or more nested loops.
This nested loopiness means that if you do anything more complex than filtering or mapping on a collection, for example a join, you will rapidly end up in a world of performance pain and your query starts performing O(n^m), where m is the number of nested loops.
Consider the following LINQ query:
var q = from c in customers join o in orders on c.Key equals o.Key select new {c.Name, o.OrderNumber};
This code will get translated into something like this:
foreach (var c in customers)
foreach (var o in orders)
if (c.Key == o.Key)
yield return new { c.Name, o.OrderNumber };
For small collections, this approach is fine, but when they start to get larger, the number of times through the loop is going to get out of control.
Relational databases sometimes have to perform the same algorithm, but they also provide the possibility of indexing the collections, which in the case of the example above, would eliminate the necessity of scanning through all the orders checking each to see if its key matches a customer – instead, the index would be used to rapidly identify the small number of matching orders.
You can bet your bottom dollar that Microsoft has something in the works to address this issue – LINQ to objects is basically useless without indexes. In the mean time, however, there is i4o, a project to do exactly that:


May 27th, 2007 at 1:22 pm
They can probably deal with this in a straightforward way by passing the restriction (with values substituted from the outer relation) to the inner relation for each record in the outer relation. The ‘back-end’ can possibly provide a more efficient iterator; otherwise you’d fall back to searching the cartesian-product.
You wouldn’t get most-efficient ordering that way (if the inner relation is smaller than the outer relation, it’d be better to quantify on the inner one), but you’d at least optimally have O(n log m) time, rather than O(n m).
And it would still look like a nested loop, which reasonably explains what it’s really doing in a join.
May 31st, 2007 at 11:40 am
Actually the ‘join’ syntax was designed explicitly not to convert into nested loops. You can do that if you want by using multiple ‘from’ clauses. However, the ‘join’ syntax actually maps to the Join() query operator that builds an index over the second sequence and uses the first sequence to probe into the index. Which is a whole lot faster than nested loops in most cases.
May 31st, 2007 at 6:50 pm
Hi Matt,
Are the indexes are rebuilt each time the query is executed?