Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid cartesian product without query splitting #22435

Open
Timovzl opened this issue Sep 7, 2020 · 16 comments
Open

Avoid cartesian product without query splitting #22435

Timovzl opened this issue Sep 7, 2020 · 16 comments

Comments

@Timovzl
Copy link

Timovzl commented Sep 7, 2020

Query splitting was introduced, but I would personally avoid its drawbacks like the plague.

Luckily, we can usually avoid cartesian products through careful aggregate design. Still, sometimes we truly do want to join multiple 1:N relations. Of course, we would rather not risk a huge data set caused by an incidental high number of child entities.

This problem can be solved without resorting to query splitting.

As an example, say we are selecting one Parent and joining both its Sons and its Daughters.

Basically, since the joined siblings are independent, we have no reason to want them multiplied. This can be accomplished by explicitly instructing the database to keep them separate:

  • Join a set of constants. Since we intend to join two 1:N relations, we will join two constants: LEFT JOIN (SELECT 1 AS Id UNION ALL SELECT 2) AS Splitter.
  • To the Son's join condition, add: AND Splitter.Id = 1.
  • To the Daughter's join condition, add: AND Splitter.Id = 2.

This gives us:

SELECT *
FROM Parents p
LEFT JOIN (SELECT 1 AS Id UNION ALL SELECT 2) AS Splitter ON TRUE
LEFT JOIN Sons s ON s.ParentId = p.Id AND Splitter.Id = 1
LEFT JOIN Daughters d ON d.ParentId = p.Id AND Splitter.Id = 2
WHERE p.Id = 1

While transferring a bit more data (mostly in the form of duplication of the Parent), the duplication stays linear and well under control.

When we combine careful aggregate design with avoiding the cartesian product, we have all the tools we need to load reasonable object graphs without introducing the significant drawbacks of split queries.

@roji
Copy link
Member

roji commented Sep 9, 2020

@Timovzl thanks for this proposal, we'll study this and respond soon.

@roji
Copy link
Member

roji commented Sep 12, 2020

@Timovzl this is definitely an interesting proposal and has the potential to improve single queries. However, note that this technique works only for multiple collection navigations being loaded at the same level; nested navigations cannot be loaded in this way, and so even if your proposal is implemented, split queries remain relevant. There are also some scenarios where your technique will cause increased data to be transferred (due to duplication of the parent rows), but I agree the benefits likely outweigh that.

In any case, the 5.0 version is now complete barring bug fixes. We'll look into this more in-depth for 6.0.

@roji
Copy link
Member

roji commented Sep 12, 2020

Note when investigating - do actual cross-database benchmarks to measure and compare perf.

@Timovzl
Copy link
Author

Timovzl commented Sep 13, 2020

@roji I did not mean to imply that split queries will become irrelevant. Just that is really nice to have an alternative. :)

However [...] nested navigations cannot be loaded in this way

I do believe this approach can work for multiple levels of nesting as well. It can be applied as follows:

  • In a query, define a parent as an instance of addressing a table where one or more others are being joined onto it.
  • Define a child as an instance of addressing a table where it is being joined onto another. (Note that intermediate levels act as both parent and child.)
  • Define a 1:1 child as a child where we are certain that it can produce no more than 1 row per parent row.
  • Define a 1:N child as a child that is not a 1:1 child.
  • When applying this feature, for any parent with two or more 1:N children, apply a splitter quasi-table as demonstrated in the original post.

Some points to note: We'll need different names for each splitter quasi-table, e.g. Splitter1, Splitter2. Each one will consist of a number of rows equal to the number of children being joined.

Not only should this approach work for trees of any breadth and depth, but it should be no worse than performing the same query without it! It only applies where a cartesian product is risked, and it avoids it. I'm sure you will do tests to confirm this. I have had good results with it in practice.

As with many of EF's features, I think it would be neat if this one could be applied either per-query or on the registration of a DbContext. I can imagine wanting to use this wherever it may apply (as defined above), or simply having a single query where it comes in handy.

@roji
Copy link
Member

roji commented Sep 13, 2020

However [...] nested navigations cannot be loaded in this way

I do believe this approach can work for multiple levels of nesting as well

Sure. I meant to say that this technique works only for multiple collection navigations being loaded at the same level, but cannot help cases where a single navigation is loaded at each level (unless I'm missing something). For example, following your example above, if instead of a Parent with two collection navigations Sons and Daughters we have Customers, with navigation Orders, which itself has another navigation LineItems, the cartesian explosion effect still occurs when loading all customers with all their orders and line items. Unless I'm mistaken, that would still be a case where only split queries can help.

As with many of EF's features, I think it would be neat if this one could be applied either per-query or on the registration of a DbContext. I can imagine wanting to use this wherever it may apply (as defined above), or simply having a single query where it comes in handy.

This is really too early to say, but it the choice between split and single query already works this way. If we end up implementing the above mode, I don't see why it would be any different.

@Timovzl
Copy link
Author

Timovzl commented Sep 14, 2020

For example, following your example above, if instead of a Parent with two collection navigations Sons and Daughters we have Customers, with navigation Orders, which itself has another navigation LineItems, the cartesian explosion effect still occurs when loading all customers with all their orders and line items. Unless I'm mistaken, that would still be a case where only split queries can help.

You're right in that it doesn't help in this situation, whereas split queries help (at least in theory). But this scenario shouldn't warrant either. Let me illustrate.

If I'm not mistaken, the cartesian explosion is another name for the cartesian product. It refers to the multiplication effect that occurs when loading multiple types of 1:N children onto the same parent.

For example, if parent A has 100 sons and 100 daughters, there would be no single correct way to combine the sons and daughters, and so, instead, every combination is returned. Instead of 100 + 100 = 200 rows, we get 10.000! That is an explosion indeed. If loading 200 rows is acceptable, then loading 10.000 is very likely unacceptable. Note also how even a single pathological entry may cause command timeouts in the application.

In your example, by contrast, we might have 1 Customer with 5 Orders, each with 10 LineItems. The best number of entities we can hope to transfer is 1 + 1*5 + 1*5*10 = 56. The complexity is O(N), where N is the product of the counts, i.e. 1 * 5 * 10.

When we do this as a single query, we know that we get a row for each of the LineItems. Each row contains 3 entities, with the Orders and Customers potentially repeated. As such, we get 1 * 5 * 10 = 50 rows, each containing 3 entities, for a total of 150 entities transferred. The complexity is still O(N). The constant factory has simply gone up from 1.12 to 3.

I believe the takeaway here is that an increase from linear to quadratic (e.g. 200 to 10.000) is unacceptable and warrants a solution, such as split queries or the solution we are talking about here. After all, the overhead of performing multiple queries is safer and generally better than the overhead of going from 200 to 10.000 rows. However, in the latter example, there is no real explosion going on. The constant increases a bit, and we transfer a bit more data, but the overhead is generally less than that of performing multiple queries.

In conclusion, you're right that the proposed solution cannot help with nesting where no multiplication occurs, whereas split queries can, in theory. However, this situation generally needs no help.

Note that the proposed solution can help when multiplication occurs on multiple levels (like this). :)

@roji
Copy link
Member

roji commented Sep 14, 2020

I agree with your analysis that the problem is usually considerably more severe when the collections are on the same level, then when they are nested. However, it's important to understand that it ultimately depends on your specific scenario andBut data.

In the example with 1 Customer/5 Orders/10 LineItems, the columns for the single customer are transmitted 50 times to the client. Now, if the customer happens to have a very heavy field - such as a photo - that data gets duplicated 50 times (just for one customer). With the right data, this could easily be worse than even the 100*100 explosion of the first scenario. So any scenario where data gets duplicated is potentially very dangerous, and split queries eliminate that altogether.

But I'm not disagreeing with anything you've written above. We'd have to thoroughly benchmark your proposal across databases to make sure that there's no surprising perf issues in some scenarios (though I don't expect that would happen), and then evaluate the complexity of producing the query and materializing the results back. Beyond that I think this could be a great improvement.

@smitpatel
Copy link
Member

If I'm not mistaken, the cartesian explosion is another name for the cartesian product. It refers to the multiplication effect that occurs when loading multiple types of 1:N children onto the same parent.

It is same for same level or nested. 100 collection items on same level is 10,000 rows then 100 collection item for nested is also 10,000 rows. Repetition of parents excluding the very last entity is same.

@Timovzl
Copy link
Author

Timovzl commented Sep 14, 2020

@smitpatel We must compare the best possible case to the cartesian product case, to see how much overhead we are causing. 100 sons * 100 daughts = 10.000 children returned, with only 200 distinct children in existence. Returned = (Optimum / 2) ^ 2. Quadratic complexity does not scale.

Having 100 sons who each contain 100 grandsons means there are actually 10.000 descendants in existence! Sure, we're repeating the parent 10.000 times, so there is some growth: the repetition of the parent causes us to return an extra 10.000 entities, and the repetition of the sons another extra 10.000. Returned = 3 * Optimum. That is within a small constant factor of the theoretical optimum. Linear complexity scales.

@Timovzl
Copy link
Author

Timovzl commented Sep 14, 2020

With the right data, this could easily be worse than even the 100*100 explosion of the first scenario. So any scenario where data gets duplicated is potentially very dangerous, and split queries eliminate that altogether.

@roji Fair point!

I appreciate the level of thought that is being given to this proposal.

@smitpatel
Copy link
Member

We must compare the best possible case to the cartesian product case

That is my point and has been said by @roji at multiple point. This works really well for some scenarios but not all and we need to evaluate pros and cons of both sides.

@Timovzl
Copy link
Author

Timovzl commented Sep 15, 2020

An idea hit me that might let us avoid even the multiplication on single navigations per level. I haven't experimented with it yet, but the theory seems solid.

Basically, we would like to repeat as little as possible of any parent in the hierarchy. We could do that by sticking to its primary key, the minimum we need to identify it. We do, however, want that parent's data. Using the same splitter semi-table technique, we could self-join the parent once, separate from its children, in the same way we separate siblings.

Let's use Splitter.Id = 0 for the self-join:

SELECT p.Id, pSelf.*, s.*, d.*
FROM Parents p
LEFT JOIN (SELECT 0 AS Id UNION ALL SELECT 1 UNION ALL SELECT 2) AS Splitter ON TRUE
LEFT JOIN Parents pSelf ON pSelf.Id = p.Id AND Splitter.Id = 0
LEFT JOIN Sons s ON s.ParentId = p.Id AND Splitter.Id = 1
LEFT JOIN Daughters d ON d.ParentId = p.Id AND Splitter.Id = 2
WHERE p.Id = 1

A simple example results set might look like this:

p.Id,	pSelf.Id,	pSelf.Name,	s.Id,	s.ParentId,	s.Name,	d.Id,	d.ParentId,	d.Name
1,	1,		Parent,		NULL,	NULL,		NULL,	NULL,	NULL,		NULL
1,	NULL,		NULL,		1,	1,		Son1,	NULL,	NULL,		NULL
1,	NULL,		NULL,		2,	1,		Son2,	NULL,	NULL,		NULL
1,	NULL,		NULL,		NULL,	NULL,		NULL,	1,	1,		Daughter1
1,	NULL,		NULL,		NULL,	NULL,		NULL,	2,	1,		Daughter2

Provided that the database sends NULLs efficiently (such as indicated by a bit field, with no further data for that field), this has the potential to negate the effect of the multiplication almost entirely.

Of course, we'd need to confirm that the self-join is performed efficiently.

@roji
Copy link
Member

roji commented Sep 15, 2020

I had this in mind too when I read your original proposal :)

Transferring nulls is generally pretty efficient, but of course we'd need to confirm. Definitely an interesting direction to explore!

@Emill
Copy link

Emill commented Mar 1, 2021

An approach similar to this is already used in EF6. It works like this.

The case is where there is one parent table and at least two child tables.

Instead of doing

SELECT
parent.id AS parent_id,
parent.col1 AS parent_col1,
son.id AS son_id,
son.col1 AS son_col1,
daughter.id AS daughter_id,
daughter.col1 AS daughter_col1
FROM parent
LEFT JOIN son ON parent.id = son.parent_id
LEFT JOIN daughter ON parent.id = daughter.parent_id
ORDER BY parent.id, son.id, daughter.id

it does the following

SELECT parent_id, parent_col1, son_id, son_col1, daughter_id, daughter_col1, c1
FROM (
    (SELECT CASE WHEN (son.id IS NULL) THEN (NULL) ELSE (1) END AS c1,
    parent.id AS parent_id,
    parent.col1 AS parent_col1,
    son.id AS son_id,
    son.col1 AS son_col1,
    NULL AS daughter_id,
    NULL AS daughter_col1
    FROM parent LEFT JOIN son ON parent.id = son.parent_id)
  UNION ALL
    (SELECT 2 AS c1
    parent.id AS parent_id,
    parent.col1 AS parent_col1,
    NULL AS son_id,
    NULL AS son_col1,
    daughter.id AS daughter_id,
    daughter.col1 AS daughter_col1,
    FROM parent INNER JOIN daughter ON parent.id = daughter.parent_id)
) AS t
ORDER BY parent_id, c1

Don't know why it fetches all the parent columns the second time though, could easily be replaced by nulls (except for the identifier).

@Emill
Copy link

Emill commented Mar 10, 2021

Provided that the database sends NULLs efficiently (such as indicated by a bit field, with no further data for that field), this has the potential to negate the effect of the multiplication almost entirely.

Of course, we'd need to confirm that the self-join is performed efficiently.

Both the EF6 strategy and the self-join strategy have the issue that the "parent" table/subquery must be evaluated twice. If it's a complicated one, you might get bad performance.

For db engines supporting APPLY / JOIN LATERAL, we can instead do the following:

SELECT p.Id, Splitter.Id, p2.*, s.*, d.*
FROM Parents p
INNER JOIN (SELECT 0 AS Id UNION ALL SELECT 1 UNION ALL SELECT 2) AS Splitter ON TRUE
LEFT JOIN LATERAL (SELECT p.*) p2 ON Splitter.Id = 0
LEFT JOIN Sons s ON s.ParentId = p.Id AND Splitter.Id = 1
LEFT JOIN Daughters d ON d.ParentId = p.Id AND Splitter.Id = 2
ORDER BY p.Id, Splitter.Id, s.Id, d.Id

In this case no data will be duplicated whatsoever (except the parent identifier).

The same approach could be used in case of just one join, if we want to avoid repeated data over the network.

@Emill
Copy link

Emill commented Mar 14, 2021

Another way of avoiding duplicating the outer table for every inner row would be to add a ROW_NUMBER() OVER (PARTITION BY outer.id). We can then use CASE to replace columns with NULLs if the row number is > 1. Then add this row number != 1 as an order by item to have it emitted first.

In any case, it's a bit sad that there is no good SQL method of ordering the results so that row with common column values are grouped together. To make our shapers work correctly, we don't need a full sorted result by outer.id, but just that all rows having the same order.id are emitted consecutively. With #24360, we can eliminate some unnecessary orderings, but with any solution explained above where we also remove the duplicates from the outer table, we must add something more to order by to get the order the shapers expect, which might hurt performance at the db side, if the result set is large.

If only different db engines could identify this pattern:

SELECT * FROM
(SELECT *, ROW_NUMBER() OVER () AS rn FROM (...) AS t1) t1 -- attach an arbitrary distinct number to each row in t1
LEFT JOIN
(SELECT * FROM ...) t2
ORDER BY t1.rn

then they could completely eliminate any sorting from the query plan and just perform the left join as is - i.e. emit all t2 items consecutively for each t1 item. At least PostgreSQL performs this query exactly as written, which is sad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants