Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove last ORDER BY for collection joins #24360

Merged
6 commits merged into from
Aug 12, 2021
Merged

Remove last ORDER BY for collection joins #24360

6 commits merged into from
Aug 12, 2021

Conversation

roji
Copy link
Member

@roji roji commented Mar 9, 2021

Here's a draft attempt, probably wrong in various ways 🤣 I've tested and modified the baselines in ComplexNavigationsQuerySqlServerTest only for now, if it's a good start, I can continue.

The approach is simple - just have a pending ordering which gets applied only when another ordering is about to be applied (we already like pending things so much).

After implementing the above, I realized something - can't we drop all orderings on any collection joins which doesn't have a deeper join referring back to it? For example, in the following:

SELECT * FROM a
LEFT JOIN b ON b.a_id = a.id
LEFT JOIN c ON c.a_id = a.id

Since b and c are both at the same level (dependents of a), don't we only need to order on a.id in this query?

Fixes #19828

@roji roji requested a review from smitpatel March 9, 2021 21:05
@roji roji marked this pull request as draft March 9, 2021 21:05
@roji
Copy link
Member Author

roji commented Mar 9, 2021

Hmm, composite keys :)

@roji
Copy link
Member Author

roji commented Mar 11, 2021

can't we drop all orderings on any collection joins which doesn't have a deeper join referring back to it?

This seems like it could be done for queries with identity resolution, but queries without it would break as the shaper depends on the ordering for detecting entity boundaries in the resultset. Separating out to #24377.

@roji
Copy link
Member Author

roji commented Mar 11, 2021

Re composite keys... Assuming Blogs with Posts, where Post has a composite key, loading with this PR as is gives:

SELECT [b].[Id], [b].[Name], [p].[Id1], [p].[Id2], [p].[BlogId]
FROM [Blogs] AS [b]
LEFT JOIN [Post] AS [p] ON [b].[Id] = [p].[BlogId]
ORDER BY [b].[Id], [p].[Id1] -- no [p].[Id2]

So only one of the composite key's columns has been chopped off. Getting rid of the other one is trickier: the inner select expressions _identifier is a flattened list, which doesn't group together multiple columns of the same composite key. So this seems to require more serious refactoring.

However, unless I'm missing something, there shouldn't be an issue with leaving this unneeded ordering - it's only a further missed optimization for the composite key scenario, which we can always handle later if we think it's valuable.

@roji roji force-pushed the ThatLastOrderBy branch 3 times, most recently from db61e87 to a783d59 Compare March 11, 2021 14:40
@roji roji marked this pull request as ready for review March 11, 2021 15:45
@roji
Copy link
Member Author

roji commented Mar 11, 2021

All tests are passing - including on PostgreSQL (which is notorious for ordering things differently). Note that ordering of related entities is now no longer deterministic, so entity asserters may need to apply ordering (see the example in OwnedQueryTestBase.cs, without which the PG test failed).

Any further ideas?? This looks correct/safe to me, but maybe @smitpatel @maumar have some other ideas.

@@ -1645,7 +1662,7 @@ static Expression RemoveConvert(Expression expression)
{
var updatedColumn = identifier.Column.MakeNullable();
_childIdentifiers.Add((updatedColumn, identifier.Comparer));
AppendOrdering(new OrderingExpression(updatedColumn, ascending: true));
AppendOrdering(new OrderingExpression(updatedColumn, ascending: true), isPending: true);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to create an AppendPendingOrdering(List<OrderingExpression> orderings) maybe instead, to solve the issue with composite keys? Then when another ordering is added (pending or non-pending), the whole previous pending list is applied.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm misunderstanding the code, when SelectExpression.ApplyCollectionJoin is called, innerSelectExpression has already has a single list of all identifier columns included in it; if it's a subquery over 5 different tables, than it contains a flattened list which doesn't make any distinction between which column corresponds to which table (and so which columns belong together as a composite key). Of course, this could be fixed with further refactoring - but optimizing further for the composite key case seems a bit less important.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a tree of collection joins, and an inner expression has at least one join, then the ordering requirements for that inner expression will already have been applied a few lines above through innerOrderingExpressions, if I'm not mistaken. At least this happens in for example Multi_level_include_one_to_many_optional_and_one_to_many_optional_produces_valid_sql. So I guess my approach will work...

Copy link

@Emill Emill Mar 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this case:

db.Set<Level1>().Include(l1 => l1.Level2s).ThenInclude(l2 => l2.Level3s)
                .Include(l1 => l1.AnotherLevel2s);

In this case the order by items must be identifiers of Level1, Level2, Level3.

I tried to find a test case in the test suite for the above Include pattern, but couldn't find one. Maybe consider add such a test case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's missing, it would definitely be good to add it...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Emill am looking at this again for 6.0. Re our conversation above, I do seem to be seeing a problem, see test ComplexNavigationsCollectionsQuerySqlServerTest.IncludeCollection8 as an example. The PR causes the following changes in the SQL:

-ORDER BY [l].[Id], [t].[Id], [t].[Id0], [t].[Id1]");
+ORDER BY [l].[Id]");

This is because all three identifier orderings of the inner query end up as pending orderings in the outer query, and never actually get added. innerOrderingExpressions is empty for this case - there are no orderings in the inner query, just identifiers that need to be ordered by.

Assuming I haven't missed anything trivial and given that the 6.0 feature deadline is approaching fast, I think it would be OK to merge #24360 for non-composite key orderings and revisit composite ones later - tell me what you think.

The code I'm working on is in https://github.com/roji/efcore/tree/ThatLastOrderBy (based off of your #24401)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following offline conversation with @Emill, am closing this for now - we'll remove only the last ordering for 6.0, and pursue going further for composite keys in the future.

@Emill
Copy link

Emill commented Mar 14, 2021

Just an FYI, I did a simple performance test in PostgreSQL using explain analyze. The outer table has half a million rows and the inner table has two rows per outer element. Using two order by items, the query takes ~1100 ms. After removing the last order by item, the query takes ~700 ms.

Even though PostgreSQL adds an incremental sort (the outer elements are traversed through an index scan) when using two order by items, and the inner sort will in this case sort only 2 elements per outer element, it still adds quite a significant overhead.

It would be interesting to know if MySQL also uses incremental sort or if it uses a full sort. In any case, I hope MySQL will be able to optimize away the sort altogether and just perform an index scan on the outer table to fulfill the order by, when using only one order by item.

@roji
Copy link
Member Author

roji commented Mar 14, 2021

@Emill thanks for the perf measurement - it's good to get further confirmation that this helps.

@Emill
Copy link

Emill commented Mar 14, 2021

I implemented the approach I described at https://github.com/Emill/efcore/tree/ThatLastOrderBy, and it passes all the tests (except for that the sql queries don't match the expected, of course).

@roji
Copy link
Member Author

roji commented Mar 15, 2021

@Emill I've looked at the code again, and I think you're right. Basically, this code copies all the orderings from the inner select expression to the outer, so this code adding pending orderings ends up only copying the inner pending orderings to the outer, as pending orderings. Unless I'm mistaken, this means that all identifiers coming from a single table are preserved as a single pending group as they bubble up the outer queries.

Would you like to submit a PR against my PR branch, including the SQL baselines (assuming the SQL changes make sense of course 🤣)? If not, let me know and I can work on it.

@roji
Copy link
Member Author

roji commented Mar 15, 2021

Ah, I think what I wrote was already written by you in #24360 (comment) :)

@Emill
Copy link

Emill commented Mar 15, 2021

Ok made a PR at #24401. Please go ahead and fix the SQL in the tests.

The test case I talked about above shows an important fact that we must add pending orderings for the first join in the outer expression. Since the inner orderings that are already applied miss the last column, this last column will be finally added when the second outer join is processed, when it applies the pending ordering from previous join.

@smitpatel
Copy link
Member

Also consider cases where the collection itself has user specified orderings so we don't mess up orderings there.

@Emill
Copy link

Emill commented Mar 15, 2021

Also consider cases where the collection itself has user specified orderings so we don't mess up orderings there.

Good point. I saw some tests where this was the case, i.e. where an inner expression was ordered by some text column. It seems this is covered since the orderings added for the collection shapers to work are always added last, after the user-defined orderings, for every level.

So something like db.Level1.Select(l1 => new { a = l1.Level2s.OrderBy(l2 => l2.SomeColumn), b = l1.Level2Others.OrderBy(l2 => l2.SomeColumn2) }) will be ordered by l1.Id, a.SomeColumn, a.Id, b.SomeColumn2.

@roji
Copy link
Member Author

roji commented Mar 15, 2021

FYI we'll be looking at increasing test coverage for composite key scenarios before merging this, so it might take a bit.

@ajcvickers ajcvickers marked this pull request as draft April 23, 2021 17:53
@roji roji marked this pull request as ready for review August 10, 2021 23:18
@roji roji removed the blocked label Aug 10, 2021
@roji
Copy link
Member Author

roji commented Aug 10, 2021

As per #24360 (comment), rebased and retested this (SqlServer/Sqlite/Npgsql). @smitpatel @maumar your move.

@roji
Copy link
Member Author

roji commented Aug 10, 2021

Split out the further optimization for composite keys to #25481.

{
Check.NotNull(orderingExpression, nameof(orderingExpression));

if (orderingExpression.Equals(_pendingOrdering))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to de-dupe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this should apply right away. The purpose of pending ordering to remember last ordering we would actually end up putting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this should apply right away.

Also was not sure. But yeah, it seems pretty sure that trying to append an ordering with pending=true when the same ordering is already pending should just do nothing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to de-dupe

Looking at this again, I'm not sure this is needed...

  • If X is in _pendingOrdering, it already passed dedup to get in there in the last round.
  • If anything else was appended since (pending or not), X would be appended to the actual orderings at that point.
  • If X comes in exactly when X is pending (nothing in between):
    • If the 2nd X comes in with pending=true, we need to append
    • If the 2nd X also comes in pending, I'm not sure. However, that seems edge-casey, and appending when we don't have to isn't a bug, just a missed optimization...

Am I missing something? Do you have a specific scenario in mind?

@roji roji requested a review from smitpatel August 11, 2021 10:30
@smitpatel
Copy link
Member

If pending ordering is first class in SelectExpression and we add ordering for inner level collections also, why there is no changes to SQL in inner level. All SQL changes are very last ordering at the end of SQL.

Wouldn't say I like this implementation. Given we have a code path to apply collections as join, we should try to put avoid appending ordering part of that implementation.

This reverts commit 5a37a7c.
@roji
Copy link
Member Author

roji commented Aug 11, 2021

If pending ordering is first class in SelectExpression and we add ordering for inner level collections also, why there is no changes to SQL in inner level. All SQL changes are very last ordering at the end of SQL.

Remember we only chop the last ordering which comes out of identifiers in an inner query. I did a quick survey and the ORDER BYs that I'm seeing in inner queries are almost always explicit user orderings coming before skip/take/whatever (and other unrelated stuff) - we don't want to touch those. Is there any scenario where we would have orderings in a subquery which come from a identifiers in a collection join?

Wouldn't say I like this implementation. Given we have a code path to apply collections as join, we should try to put avoid appending ordering part of that implementation.

Am open to change! Can you point me to where you think this should live?

@roji
Copy link
Member Author

roji commented Aug 12, 2021

Wouldn't say I like this implementation. Given we have a code path to apply collections as join, we should try to put avoid appending ordering part of that implementation.

@smitpatel made the pending ordering logic self-contained in ApplyProjection as discussed, take a look and tell me if you want any other changes. (BTW it might be possible to make other orderings added by ApplyProjection pending - I gave it a quick shot but got some weird test failures, so deferring).

Re the test, we can wait and see what @maumar comes up with!

@roji
Copy link
Member Author

roji commented Aug 12, 2021

OK @smitpatel, I think I have a better understanding now - hopefully this looks good.

@roji roji requested a review from smitpatel August 12, 2021 22:02
@roji roji requested a review from smitpatel August 12, 2021 22:21
@ghost
Copy link

ghost commented Aug 12, 2021

Hello @roji!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@ghost
Copy link

ghost commented Aug 12, 2021

Apologies, while this PR appears ready to be merged, I've been configured to only merge when all checks have explicitly passed. The following integrations have not reported any progress on their checks and are blocking auto-merge:

  1. Azure Pipelines

These integrations are possibly never going to report a check, and unblocking auto-merge likely requires a human being to update my configuration to exempt these integrations from requiring a passing check.

Give feedback on this
From the bot dev team

We've tried to tune the bot such that it posts a comment like this only when auto-merge is blocked for exceptional, non-intuitive reasons. When the bot's auto-merge capability is properly configured, auto-merge should operate as you would intuitively expect and you should not see any spurious comments.

Please reach out to us at [email protected] to provide feedback if you believe you're seeing this comment appear spuriously. Please note that we usually are unable to update your bot configuration on your team's behalf, but we're happy to help you identify your bot admin.

@ghost ghost merged commit 5bcdd2e into main Aug 12, 2021
@ghost ghost deleted the ThatLastOrderBy branch August 12, 2021 23:08
@roji
Copy link
Member Author

roji commented Aug 12, 2021

Thanks for the guidance @smitpatel

@roji
Copy link
Member Author

roji commented Aug 13, 2021

Remind me to never use auto-merge again (because of the squash).

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove unnecessary *last column* in ORDER BY when joining for collection
4 participants