Skip to content

Remove Limit/TopN/Sort/DistinctLimit node if it's source is a scalar#441

Merged
kokosing merged 6 commits intotrinodb:masterfrom
Praveen2112:scalar_query_simplification
May 17, 2019
Merged

Remove Limit/TopN/Sort/DistinctLimit node if it's source is a scalar#441
kokosing merged 6 commits intotrinodb:masterfrom
Praveen2112:scalar_query_simplification

Conversation

@Praveen2112
Copy link
Member

No description provided.

@cla-bot cla-bot bot added the cla-signed label Mar 11, 2019
@Praveen2112
Copy link
Member Author

@martint , @findepi , @kokosing Can you please review this ?

@Praveen2112 Praveen2112 force-pushed the scalar_query_simplification branch from f32461d to f11830c Compare March 11, 2019 13:49
@Praveen2112
Copy link
Member Author

@findepi Have updated it. Please let me know if there are any more changes.

Copy link
Member

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also DistinctLimitNode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update commit message, here in other commits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comments that do not add value to the code, the name PruneTopNOverScalar says the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... when the subplan is guaranteed to produce fewer rows than the limit"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this "RemoveRedundantLimit"

@Praveen2112 Praveen2112 force-pushed the scalar_query_simplification branch 2 times, most recently from 53d2872 to 38f4a2a Compare March 12, 2019 12:46
@Praveen2112 Praveen2112 changed the title Remove Limit/TopN/Sort node if it's source is a scalar Remove Limit/TopN/Sort/DistinctLimit node if it's source is a scalar Mar 12, 2019
@Praveen2112
Copy link
Member Author

@findepi , @martint , @kokosing Have updated the code. Please let me know if there is any changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is wrong with limitCount == 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since both RemoveRedundantLimit and EvaluateZeroLimit rules are in the same optimizer , we don't want LimitNode with 0 count to be removed if its source is scalar as it might not fire EvaluateZeroLimit rule

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could merge these two rules together, they are very simple and they both remove redundant limit.
Another option is add third function which replaces any plan that isAtMost(node, context.getLookup(), 0) with Values. Then order would not bother.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is add third function which replaces any plan that isAtMost(node, context.getLookup(), 0) with Values. Then order would not bother.

I like that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kokosing , @martint Have merged both of the two rules

Copy link
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. Also, can you add some tests similar to those in presto-main/src/test/java/io/prestosql/sql/query ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TopN cannot be removed blindly. The ordering matters. You can replace it with a SortNode in this case.

It's only safe to remove if the row count is guaranteed to be 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to RemoveSingleRowSort. "scalar" is no a proper classification for a subquery -- it's a feature of the context in which it's used. I.e., "in a place that expects a scalar value". (The "isScalar" method in QueryCardinalityUtil is misnamed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to RemoveSingleRowDistinctLimit

@Praveen2112 Praveen2112 force-pushed the scalar_query_simplification branch 2 times, most recently from 5b1c546 to 3ec0abb Compare March 13, 2019 09:43
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could merge these two rules together, they are very simple and they both remove redundant limit.
Another option is add third function which replaces any plan that isAtMost(node, context.getLookup(), 0) with Values. Then order would not bother.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach would be to use something similar to io.prestosql.sql.planner.TestLogicalPlanner#assertPlanContainsNoApplyOrAnyJoin where you would check that there is no Limit or TopN in the plan. Plan assertion is a bit simpler that way. Reasoning of anyNot that is wrapping anyTree might be not trivial, and I am not sure if the pattern is correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please upper case SQL keywords, here in other tests below as separate commit before this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, you have just extended support for correlated subqueries a bit ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you extend this rule to support regular MarkDistinctNode as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DisitinctLimitNode was not pruned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the subplan here is not a scalar

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I overlooked GROUP BY.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are losing output symbols here. See hashSymbol in DistinctLimitNode. Also notice that DistinctLimitNode::getOutputSymbolsreturn distinctSymbols which might be different than node.getSource().getOutputSymbols().

I wonder why test didn't find that already, so please make sure that there is test coverage for that. Can you please run your test from TestLogicalPlanner with coverage or debugging to see if you rule was triggered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but the hashSymbol in DistinctLimitNode will be added only if we set optimize_hash_generation as true and IIRC it will be added in HashGenerationOptimizer which will be invoked after this optimizer. So we can safely assume that hashSymbol will be empty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then please verify in rule that hashSymbol is empty. Also verify that DistinctLimitNode::getOutputSymbols are same as node.getSource().getOutputSymbols().

@Praveen2112 Praveen2112 force-pushed the scalar_query_simplification branch 2 times, most recently from 6bd6771 to fd9e2a9 Compare March 14, 2019 03:16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check there is no MarkDistinctNode. You could also extract a method from this and reuse in assertion below, like:

assertFalse(planContainsDistinctNode("SELECT distinct(c) FROM (SELECT count(*) as c FROM orders) LIMIT 10");
assertTrue(planContainsDistinctNode("SELECT distinct(c) FROM (SELECT count(*) as c FROM orders GROUP BY orderkey) LIMIT 10"));

Please do the same for TopN and Sort.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I overlooked GROUP BY.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected node for the above query -> format("Unexpected sort node for query: '%s'", query). To the same for all below and above.

@Praveen2112 Praveen2112 force-pushed the scalar_query_simplification branch from fd9e2a9 to 3105607 Compare March 16, 2019 14:27
@martint martint self-requested a review April 22, 2019 22:07
@Praveen2112 Praveen2112 force-pushed the scalar_query_simplification branch 2 times, most recently from 582e861 to 354c49f Compare April 29, 2019 14:10
Copy link
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Praveen2112, I think there are still some comments that need to be addressed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "sor"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static import for OPTIMIZED

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should belong to previous commit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also test SELECT * FROM (VALUES 1,2,3,4,5,6) LIMIT 10?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// cannot enforce LIMIT on correlated subquery

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't you use isAtMost here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but inlining this here would save us one iterative optimizer loop. Also I think you could handle the case with limit = 0 here.

However, as you pointed it could be a matter of taste. Up to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also handle the case where cardinality is 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, DistinctLimit with limit higher than cardinality of its source node can be rewritten to DistinctNode

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will merge that feature to RemoveSingleRowDistinctLimit

@Praveen2112 Praveen2112 force-pushed the scalar_query_simplification branch 2 times, most recently from febdab8 to e4d4d1b Compare May 5, 2019 05:26
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This could be extracted as separate commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message

Prune unnecessary TopNNode

Replace TopN node
1. With a Sort node when the subplan is guaranteed to produce fewer rows than N
2. With it's source node when the subplan produces single row
3. With a Values node when N is 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: else is redundant

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testRedundantTopNRemoval?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please extract each test case as separate test method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please extract each test case as separate test method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please extract each test case as separate test method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why replacing Distinct with Aggregation is better? Shouldn't you use regular DistinctNode here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using MarkDistinct node requires an additional FilterNode and ProjectNode so used the AggregatioNode with no aggregation functions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's ok. MarkDistinct serves a different purpose. @kokosing, there's no explicit DistinctNode -- it's planned as an GROUP BY with no aggregation functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please extract each test case as separate test method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have added each test method for the optimizer we implemented. So should we write each pattern of queries as separate method ?

@martint martint self-requested a review May 6, 2019 23:08
fgwang7w added a commit to fgwang7w/presto that referenced this pull request Dec 14, 2020
…n is know to single row or less rows than requested

Cherry-pick of trinodb/trino#441

Co-authored-by: praveenkrishna <praveenkrishna@tutanota.com>
fgwang7w added a commit to fgwang7w/presto that referenced this pull request Dec 28, 2020
In addition to the Cherry-pick for removing redundant Limit/TopN/Sort/DistinctLimit,
there are a few more rules added to replace any input that is zero-TopN/DistinctLimit/Limit

Cherry-pick of trinodb/trino#441

Co-authored-by: praveenkrishna <praveenkrishna@tutanota.com>
rschlussel pushed a commit to prestodb/presto that referenced this pull request Dec 28, 2020
In addition to the Cherry-pick for removing redundant Limit/TopN/Sort/DistinctLimit,
there are a few more rules added to replace any input that is zero-TopN/DistinctLimit/Limit

Cherry-pick of trinodb/trino#441

Co-authored-by: praveenkrishna <praveenkrishna@tutanota.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants