Extend payload join optimization for cases without map columns; fix b…#19904
Conversation
b93b8ec to
42a152a
Compare
49a7fb1 to
341c8fb
Compare
69d1dbb to
7b93ac7
Compare
rschlussel
left a comment
There was a problem hiding this comment.
Still reviewing the second commit. Can you add more context to the commit message about what the problem is/why it's a problem?
Also, can you make sure all of your commit messages follow the guidelines here: https://cbea.ms/git-commit/
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/sql/analyzer/FeaturesConfig.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
7b79b50 to
72950f2
Compare
cd731ee to
97f7c45
Compare
rschlussel
left a comment
There was a problem hiding this comment.
I find the flow with all the context setting a bit confusing (even before this change), so I have a hard time being confident about the change.
It would be good to have more tests for cases that we don't expect the optimizer to kick in (e.g. left join with nested right or inner joins). Also, looks like the tests for correctness just validate the row count, but not the values or even column types. Would be good to have tests do a regular assertQuery correctness test for all the test queries to fully validate the results.
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
f508f23 to
29b2c1c
Compare
7b33404 to
1e86fea
Compare
The logic is basically as follows:
Regarding testing, I'd prefer to create a verifier suite (like this one that I used to debug the remaining issues: https://www.internalfb.com/intern/presto/verifier/results/?suite=test_payload_0808) as it gets harder and harder to enumerate all combinations of joins and unary operators |
bc1fffe to
e9a49ea
Compare
|
Note: will squash the commits after review and will update the commit description |
rschlussel
left a comment
There was a problem hiding this comment.
some of the intermediate nodes may require columns from the payload table that are not join keys (and the Aggregate hid them): if that is the case I abort the optimization and try from scratch with a new context. I refactored this logic to use a utility function usedColumns that returns the necessary columns for each node type, and use that in validateUsedColumnsForUnaryNode
I wonder if it's possible to have the initial rewrite take into account the required columns to begin with. It seems not ideal that the rewrite can produce a "wrong" plan that we need to undo afterwards.
Regarding testing, I'd prefer to create a verifier suite (like this one that I used to debug the remaining issues: https://www.internalfb.com/intern/presto/verifier/results/?suite=test_payload_0808) as it gets harder and harder to enumerate all combinations of joins and unary operators
The verifier suite is great, but it's only available internally at Meta, and also doesn't run automatically with every PR as someone develops. Of course it's impossible to be 100% comprehensive, but we do also need sufficient correctness tests in the codebase itself (of both the plan and the results).
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/sql/planner/optimizations/PayloadJoinOptimizer.java
Outdated
Show resolved
Hide resolved
during top-down traversal when we explore a node we might not have encountered all the join keys yet. So we won't know if a column used in an operator is a join key from below, or is not a join key and will be hidden by the rewrite. |
But we'll know when we explore the child joins that A is a needed output column for a later node, so we should be able to choose not to rewrite at that point. |
That's a great point - I ran into this in another optimization where I wanted to know the required columns from the top node (and that one was a rule-based optimization so things were even harder). I think it's worth having built-in support for passing required properties (such as required columns) from parent operators and making it a first-class citizen in the optimizer. |
c3583ee to
2bc6828
Compare
Added more tests to cover for the various operators where we abort the transformation. Also added assertQuery tests to check for correctness of the result |
2bc6828 to
813e033
Compare
813e033 to
a565008
Compare
a565008 to
6c8e6b3
Compare
6c8e6b3 to
6e4674a
Compare
There was a problem hiding this comment.
The comment says for binary and n-ary, but the code would do the same for unary operators as well.
There was a problem hiding this comment.
recently fixed that, forgot to update/remove the comment
There was a problem hiding this comment.
It seems this is only called when plan is FilterNode. Why do we need to handle other cases?
There was a problem hiding this comment.
I narrowed down the supported cases to only filter/project/scan + join, will clean up the implementation
There was a problem hiding this comment.
Can calling rewriter multiple times from a node lead to high time complexity?
There was a problem hiding this comment.
ideally, we want to do this upfront and not have to discard plans after they have been computed - I'd prefer to do this in a separate PR as this one was meant to fix a large number of the original bugs with the optimization (and the optimization is disabled by default so no latency impact on production workloads at the moment)
ba00fd4 to
b74eac7
Compare
0953a2a to
475ee94
Compare
1698753 to
e837b7c
Compare
…fixes Several fixes and improvements related to payload join optimization: - enable optimization even if the fact table doesn't have any map or array columns - abort rewrite if there are operators other than select/project/filter/join as it may introduce wrong results - abort rewrite if collected join keys contain a column from the RHS of the current join SELECT l.* FROM lineitem l left join orders o on (l.orderkey = o.orderkey) left join part p on (l.partkey=p.partkey) left join (select 1 as m) mm on p.partkey=mm.m The join key from the last join condition (p.partkey) cannot be pushed to the left-most table in the join (lineitem). This change takes care of only propagating those join keys that come from the LHS of the join. - other fixes in propagating optimization context across operators - additional tests
e837b7c to
b963b41
Compare
A collection of bug fixes for the payload join optimization, including the following: