Skip to content

Cardinality of filter Optimization#13721

Closed
kaikalur wants to merge 2 commits intoprestodb:masterfrom
kaikalur:cardinality_of_filter
Closed

Cardinality of filter Optimization#13721
kaikalur wants to merge 2 commits intoprestodb:masterfrom
kaikalur:cardinality_of_filter

Conversation

@kaikalur
Copy link
Copy Markdown
Contributor

@kaikalur kaikalur commented Nov 19, 2019

Please make sure your submission complies with our Development, Formatting, and Commit Message guidelines.

Fill in the release notes towards the bottom of the PR description.
See Release Notes Guidelines for details.

== RELEASE NOTES ==

General Changes
* We now optimize the following patterns
  * CARDINALITY(FILTER(a, x -> f(x)))   ===>  REDUCE(a, cast(0 as bigint), (s,x) -> case when f(x) then s + 1 else s end, s->s)
  * CARDINALITY(FILTER(a, x -> f(x))) > 0   ===> REDUCE(a, false, (s,x)->case when s then s else f(x) end, s -> s)
  * CARDINALITY(FILTER(a, x -> f(x))) = 0   ===> REDUCE(a, false, (s,x)->case when s then s else f(x) end, s -> NOT s)
  * CARDINALITY(FILTER(a, x -> f(x))) <comparison> n   ===> REDUCE(a, cast(0 as bigint), (s,x)->case when s <comparison> n then s case when f(x) s + 1 else s end, s -> s <comparison> n)

It's controlled by session property: simplify_array_operations with default set to true.
* ...

Hive Changes
* ...
* ...

If release note is NOT required, use:

== NO RELEASE NOTE ==

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Nov 19, 2019

CLA Check
One or more committers are not authorized under a signed CLA as indicated below. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

@wenleix
Copy link
Copy Markdown
Contributor

wenleix commented Nov 20, 2019

Nice idea of expression optimization. @shixuan-fan , @hellium01 wondering if you can help with a first round of review? -- This helps with us to enlarge the reviewer pool and get you familiar with more part of code~

Speaking of commit convention, in general in Presto code based we always do rebase and don't do merge.

Copy link
Copy Markdown
Contributor

@shixuan-fan shixuan-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clever optimization :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could use getOnlyElement(functionCall.getArguments()) assuming cardinality should only have one argument.

I'm not entirely sure about this because if cardinality argument count ever changes, we will fail the query, and the current code would simply ignore this optimization. Let's see what others think.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto to this. cardinality should always has single argument, if not, we should throw.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like throwing from the optimizer. This is just opportunistic optimization and be defensive and have the exact pattern that you want to match,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically you don't need to check that here. Function resolution would have failed if arguments size is not 1. Agree with @hellium01's earlier point that this should be added to RowExpressionRewriteRuleSet. Also when you do that, please use proper API (add isCardinalityFunction isFilterFunction to StandardFunctionResolution) rather than relying on name (CallExpression's name is only used for display purpose)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just name this whenClauses.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inputFunction I believe?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also revamp this method to be in the following layout to facilitate review process, also, let's also rename the variables following the comments in simplifyCardinalityOfFilterComparedToZero:

- create cast (not required if it could be inlined)
- create case statement
- create input function
- create output function
- create reduce function

@shixuan-fan
Copy link
Copy Markdown
Contributor

Let's rename the commit title and PR title to "Optimize cardinality of filter expression" and put benchmark results in commit message. Also, is there a benchmark class (like BenchmarkArrayAggregation) that you are using?

Copy link
Copy Markdown
Contributor

@hellium01 hellium01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really smart optimization :). However, we probably need to port it to RowExpressionRewriteRuleSet as ExpressionRewriteRuleSet is deprecating...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rebase and squash the unrelated commits?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please follow: https://github.com/prestodb/presto/wiki/Presto-Development-Guidelines
If you are using IntelliJ (which I highly recommend), definitely import the style file from here: https://github.com/airlift/codestyle.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had the airift code style. I will doublecheck

Copy link
Copy Markdown
Contributor

@hellium01 hellium01 Nov 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIMPLIFY_ARRAY_OPERATIONS probably is a little too general? probably OPTIMIZE_FILTER_CARDINALITY will be more specific for user?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative question to ask: Do you want this flag to guard all future optimization rules that would be added to arrays?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative question to ask: Do you want this flag to guard all future optimization rules that would be added to arrays?

Yeah - I was thinking about that. It's a tricky balance between having too many flags and controlling what is enabled. Since most of these optimizations are opportunistic and hopefully don't need to be turned off frequently, I thought one is just enough. But i can change it if there is a strong reason.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably indicate it will optimize cardinality(filter(a, x->f(x)) expression in the session description will be more clear to user?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most users won't even understand what that means. This flag is only for any unforeseen bugs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto to this. cardinality should always has single argument, if not, we should throw.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use VariableAllocator to make sure there is no confict.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to worry about this if we are operating on RowExpression :)

@hellium01
Copy link
Copy Markdown
Contributor

From another perspective, the reason user will write cardinality(filter(...)) might be because we lack function do:

any(a, predicate)
count_on(a, predicate)
...

thus probably another solution is to create such functions?

@kaikalur
Copy link
Copy Markdown
Contributor Author

Yes that will be a next step but we have several major pipelines using this construct so this will have immediate benefit.

@kaikalur
Copy link
Copy Markdown
Contributor Author

Let's rename the commit title and PR title to "Optimize cardinality of filter expression" and put benchmark results in commit message. Also, is there a benchmark class (like BenchmarkArrayAggregation) that you are using?

This is a tricky one to measure because it's not an operator. We can add a benchmark test that loops through this operation a thousand times but not sure.

@wenleix
Copy link
Copy Markdown
Contributor

wenleix commented Nov 21, 2019

@kaikalur : The first optimization rule is definitely good to avoid unnecessary array construction. I am wondering if the second and third optimization rule is necessary? , i.e.

  * CARDINALITY(FILTER(a, x -> f(x))) > 0   ===> REDUCE(a, false, (s,x)->case when s then s else f(x) end, s -> s)
  * CARDINALITY(FILTER(a, x -> f(x))) = 0   ===> REDUCE(a, false, (s,x)->case when s then s else f(x) end, s -> NOT s)

In my opinion we should just backport any_match from PrestoSQL: https://prestosql.io/docs/current/functions/array.html#any_match .

And I am not sure if I understand the fourth optimization rule will be helpful, i.e.

  * CARDINALITY(FILTER(a, x -> f(x))) <comparison> n   ===> REDUCE(a, cast(0 as bigint), (s,x)->case when s <comparison> n then s case when f(x) s + 1 else s end, s -> s <comparison> n)

With the first optimization rule, we already avoid unnecessary array construction. And even with this rule we still need to go through the whole array. And my intuition is the overhead of reduce function itself will dominate and for now we probably don't need to optimize for this "+1" operation? :)

@kaikalur
Copy link
Copy Markdown
Contributor Author

In my opinion we should just backport any_match from PrestoSQL: https://prestosql.io/docs/current/functions/array.html#any_match .

Technically, anymatch works only for > 0, for = 0 it doesn't short circuit to return false on the first match.

@kaikalur
Copy link
Copy Markdown
Contributor Author

nd my intuition is the overhead of reduce function itself will dominate and for now we probably don't need to optimize for this "+1" operation? :)

It will stop on the first time the comparison succeeds so if you have a large array, it will benefit

@wenleix
Copy link
Copy Markdown
Contributor

wenleix commented Nov 21, 2019

all_match(), any_match(), and none_match() are being backported in #13734

@wenleix
Copy link
Copy Markdown
Contributor

wenleix commented Nov 21, 2019

@kaikalur :

Technically, anymatch works only for > 0, for = 0 it doesn't short circuit to return false on the first match.

We have any_match, all_match and none_match 😄 . I believe it should cover this case ~

@kaikalur
Copy link
Copy Markdown
Contributor Author

@kaikalur :

Technically, anymatch works only for > 0, for = 0 it doesn't short circuit to return false on the first match.

We have any_match, all_match and none_match 😄 . I believe it should cover this case ~

Too much code, yuck! I would rather have a better short-circuiting reducer. But may be I will back port them for now.

@kaikalur
Copy link
Copy Markdown
Contributor Author

@kaikalur :

Technically, anymatch works only for > 0, for = 0 it doesn't short circuit to return false on the first match.

We have any_match, all_match and none_match 😄 . I believe it should cover this case ~

Too much code, yuck! I would rather have a better short-circuiting reducer. But may be I will back port them for now.

And also NonMatch just calls AnyMatch! I'm not backporting that. It's not a performant implementation.

@rongrong
Copy link
Copy Markdown
Contributor

We should also add a SQL function for this one!

Copy link
Copy Markdown
Contributor

@rongrong rongrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the implementation I think it would be much simpler if we add a SQL function and just rewrite it with the SQL function. Maybe you want to switch to work on supporting SQL function as builtin functions instead? 🤣

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative question to ask: Do you want this flag to guard all future optimization rules that would be added to arrays?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically you don't need to check that here. Function resolution would have failed if arguments size is not 1. Agree with @hellium01's earlier point that this should be added to RowExpressionRewriteRuleSet. Also when you do that, please use proper API (add isCardinalityFunction isFilterFunction to StandardFunctionResolution) rather than relying on name (CallExpression's name is only used for display purpose)

@kaikalur
Copy link
Copy Markdown
Contributor Author

Regarding doing it for RowExpressions - I would say no. The idea is to move these rewrites up the stack to just after parse/resolve and eventually they will be string/template based ones - parse -resovle-(unparse(rewrite)-parse-resolve)* so people get to see the query structure to do more interesting. I'm doing this one for now because it's widely used pattern.

@kaikalur kaikalur force-pushed the cardinality_of_filter branch from 964421a to e403600 Compare December 3, 2019 19:54
@kaikalur kaikalur closed this Dec 4, 2019
@kaikalur kaikalur force-pushed the cardinality_of_filter branch from e403600 to 6b3c211 Compare December 4, 2019 18:00
@kaikalur kaikalur reopened this Dec 6, 2019
@kaikalur kaikalur force-pushed the cardinality_of_filter branch 3 times, most recently from e48d1f3 to b8f7656 Compare December 6, 2019 00:56
@kaikalur kaikalur force-pushed the cardinality_of_filter branch 7 times, most recently from bbbb1d9 to e5528c0 Compare December 6, 2019 01:28
@kaikalur kaikalur force-pushed the cardinality_of_filter branch 2 times, most recently from c067cf9 to d4e6d42 Compare December 13, 2019 22:25
@kaikalur kaikalur force-pushed the cardinality_of_filter branch 3 times, most recently from 11de632 to 1a7a3c1 Compare December 13, 2019 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants