feat(optimizer): Enhance PayloadJoinOptimizer with null-check skipping, chain flattening, and LOJ reordering#27404
Conversation
Reviewer's GuideOptimizes payload join rejoin predicates by detecting non-null guarantees on join keys in scan-filter-project trees and simplifies the generated join predicates accordingly, plus adds regression coverage for the new optimization behavior. Sequence diagram for payload join optimization with non-null join keyssequenceDiagram
participant Planner
participant PayloadJoinOptimizer
participant Rewriter
participant JoinContext
Planner->>PayloadJoinOptimizer: optimize(plan)
PayloadJoinOptimizer->>Rewriter: rewriteScanFilterProject(planNode, context)
Rewriter->>Rewriter: extractNonNullVariablesFromScanFilterProject(planNode, joinKeys)
Rewriter->>JoinContext: addNonNullKeys(nonNullVars)
Planner->>PayloadJoinOptimizer: optimize join
PayloadJoinOptimizer->>Rewriter: transformJoin(keysNode, context)
Rewriter->>JoinContext: getNonNullKeys()
Rewriter->>Rewriter: build joinPredicateBuilder
Rewriter->>Rewriter: if key in nonNullKeys then equalityPredicate(newVar, var)
Rewriter->>Rewriter: else IS_NULL projection and COALESCE comparisons
Rewriter-->>Planner: new JoinNode with joinCriteria
Class diagram for updated PayloadJoinOptimizer and JoinContextclassDiagram
class PayloadJoinOptimizer {
}
class Rewriter {
+rewriteScanFilterProject(planNode, context) PlanNode
-transformJoin(keysNode, context) PlanNode
-supportedJoinKeyTypes(joinKeys) boolean
-extractNonNullVariablesFromScanFilterProject(node, joinKeys) Set~VariableReferenceExpression~
-extractNonNullVariablesRecursive(node, joinKeys, functionResolution, result) void
}
class JoinContext {
-Set~VariableReferenceExpression~ joinKeys
-Map~VariableReferenceExpression, VariableReferenceExpression~ joinKeyMap
-Map~VariableReferenceExpression, RowExpression~ projectionsToPush
-Set~VariableReferenceExpression~ nonNullKeys
-int numJoins
-PlanNode payloadNode
+reset() void
+getNumJoins() int
+needsPayloadRejoin() boolean
+getNonNullKeys() Set~VariableReferenceExpression~
+addNonNullKeys(keys) void
}
PayloadJoinOptimizer o-- Rewriter
PayloadJoinOptimizer o-- JoinContext
Rewriter --> JoinContext
Flow diagram for extracting non-null join key variables from scan-filter-projectflowchart TD
A[Start extractNonNullVariablesFromScanFilterProject] --> B[Initialize empty nonNullVars set]
B --> C[Call extractNonNullVariablesRecursive with node, joinKeys, functionResolution, result]
C --> D{Node is FilterNode?}
D -->|Yes| E[Get predicate from FilterNode]
E --> F[Extract conjuncts]
F --> G[For each conjunct]
G --> H{Conjunct is CallExpression and NOT function?}
H -->|Yes| I{Argument is IS_NULL SpecialFormExpression?}
I -->|Yes| J{IS_NULL argument is VariableReferenceExpression?}
J -->|Yes| K{Variable in joinKeys?}
K -->|Yes| L[Add variable to result]
K -->|No| M[Ignore]
J -->|No| M
I -->|No| M
H -->|No| M
M --> N[Done with conjuncts]
N --> O[Recurse on FilterNode source]
D -->|No, node is ProjectNode| P[Recurse on ProjectNode source]
D -->|No, node is TableScanNode| Q[Stop recursion]
O --> R[Return]
P --> R
Q --> R
R --> S[Build and return nonNullVars set]
S --> T[End]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The
extractNonNullVariablesRecursivelogic only detects a very specificNOT(IS_NULL(var))shape; consider handling equivalent patterns (e.g., additional casts/aliases, redundant boolean wrappers) or at least documenting that only this exact form is recognized so future changes don’t break the optimization silently. - The new test asserts optimization by comparing entire
EXPLAINoutput strings, which can be brittle as other planner rules evolve; it may be more robust to assert for the presence/absence of specific join predicates or patterns instead of full-plan string inequality.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `extractNonNullVariablesRecursive` logic only detects a very specific `NOT(IS_NULL(var))` shape; consider handling equivalent patterns (e.g., additional casts/aliases, redundant boolean wrappers) or at least documenting that only this exact form is recognized so future changes don’t break the optimization silently.
- The new test asserts optimization by comparing entire `EXPLAIN` output strings, which can be brittle as other planner rules evolve; it may be more robust to assert for the presence/absence of specific join predicates or patterns instead of full-plan string inequality.
## Individual Comments
### Comment 1
<location path="presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestDistributedQueries.java" line_range="1473-1470" />
<code_context>
+ @Test
</code_context>
<issue_to_address>
**suggestion (testing):** Strengthen the plan assertion to verify that null-handling logic is actually removed, not just that the plan changed
The current test only verifies that the optimized and non-optimized EXPLAIN plans differ, which doesn’t prove that the null-handling (`IS NULL`/`COALESCE`) was actually removed. To better validate the intended behavior, assert on the explain text: e.g., confirm the non-optimized plan includes `IS NULL` (or `COALESCE`) in the join predicate while the optimized plan does not. Simple string checks like `assertTrue(explainNoOpt.contains("IS NULL"));` and `assertFalse(explainWithOpt.contains("IS NULL"));` would more directly ensure we’re testing null-check removal rather than an unrelated plan change.
Suggested implementation:
```java
assertNotEquals(explainWithOpt, explainNoOpt);
// Strengthen the assertion to verify that null-handling logic is actually removed
assertTrue(
explainNoOpt.contains("IS NULL") || explainNoOpt.contains("COALESCE"),
"Expected non-optimized plan to contain null-handling in the join predicate");
assertFalse(
explainWithOpt.contains("IS NULL") || explainWithOpt.contains("COALESCE"),
"Expected optimized plan to have null-handling removed from the join predicate");
```
I assumed the existing test already computes two EXPLAIN plans as `String explainNoOpt` and `String explainWithOpt`, and already had a line `assertNotEquals(explainWithOpt, explainNoOpt);`. If the variable names differ, or if the inequality assertion is written differently, adjust the SEARCH section accordingly to match the existing code.
If the EXPLAIN text does not literally include `"IS NULL"` or `"COALESCE"` in your specific query, update those substrings to whatever expression is actually used for null-handling in the join predicate so the strengthened assertions match the real plan text.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| @@ -1470,6 +1470,43 @@ public void testPayloadJoinCorrectness() | |||
| } | |||
There was a problem hiding this comment.
suggestion (testing): Strengthen the plan assertion to verify that null-handling logic is actually removed, not just that the plan changed
The current test only verifies that the optimized and non-optimized EXPLAIN plans differ, which doesn’t prove that the null-handling (IS NULL/COALESCE) was actually removed. To better validate the intended behavior, assert on the explain text: e.g., confirm the non-optimized plan includes IS NULL (or COALESCE) in the join predicate while the optimized plan does not. Simple string checks like assertTrue(explainNoOpt.contains("IS NULL")); and assertFalse(explainWithOpt.contains("IS NULL")); would more directly ensure we’re testing null-check removal rather than an unrelated plan change.
Suggested implementation:
assertNotEquals(explainWithOpt, explainNoOpt);
// Strengthen the assertion to verify that null-handling logic is actually removed
assertTrue(
explainNoOpt.contains("IS NULL") || explainNoOpt.contains("COALESCE"),
"Expected non-optimized plan to contain null-handling in the join predicate");
assertFalse(
explainWithOpt.contains("IS NULL") || explainWithOpt.contains("COALESCE"),
"Expected optimized plan to have null-handling removed from the join predicate");I assumed the existing test already computes two EXPLAIN plans as String explainNoOpt and String explainWithOpt, and already had a line assertNotEquals(explainWithOpt, explainNoOpt);. If the variable names differ, or if the inequality assertion is written differently, adjust the SEARCH section accordingly to match the existing code.
If the EXPLAIN text does not literally include "IS NULL" or "COALESCE" in your specific query, update those substrings to whatever expression is actually used for null-handling in the join predicate so the strengthened assertions match the real plan text.
ca99a0f to
9a5fbb3
Compare
cc5071f to
bc767b0
Compare
|
@feilong-liu Could you take a look at this PR when you get a chance? It enhances the PayloadJoinOptimizer with null-check skipping for non-null join keys, chain flattening through intervening projections/cross joins, and LOJ reordering. Thanks! |
785e1d2 to
95d6ff6
Compare
…g, chain flattening, and LOJ reordering
95d6ff6 to
347142e
Compare
|
@feilong-liu All CI checks are green — could you review and merge this when you get a chance? Thanks! |
…g, chain flattening, and LOJ reordering (prestodb#27404) ## Summary Improves the PayloadJoinOptimizer in three areas: **1. Skip null checks for non-null join keys** - When a join key has an upstream `WHERE key IS NOT NULL` predicate, the payload rejoin now uses a direct equality predicate instead of generating `IS_NULL` projections and `COALESCE` comparisons - Detects `NOT(IS_NULL(var))` patterns in FilterNode predicates within the scan-filter-project tree - Reduces plan complexity and runtime overhead for queries with non-null join key guarantees **2. Flatten LOJ chains through intervening nodes (pre-pass)** - Removes identity projections (e.g., from subqueries) that break LOJ chains, exposing the full chain to the optimizer - Hoists cross joins from within LOJ chains to above them, so the LOJ chain remains contiguous - Correctly handles cases where cross join or projection columns are used as join keys in subsequent LOJs **3. Reorder LOJ chains to maximize optimization** - Classifies LOJs as "base-keyed" (keys from base table) or "dependent" (keys from another LOJ's output) - Reorders so base-keyed LOJs come first in the chain, maximizing the number of joins the payload optimization can cover - Only triggers when chain has 3+ joins and reordering would move at least 2 base-keyed joins together ## Release Notes No user-facing changes. Internal optimization improvements to the PayloadJoinOptimizer. ## Test plan - [x] Added e2e integration tests in `AbstractTestDistributedQueries` comparing results with optimization enabled vs disabled - [x] Tests cover null-check skipping: both keys non-null, single key non-null, IS NOT NULL combined with other predicates - [x] Tests cover chain flattening: identity projections from subqueries, cross joins between LOJs, non-identity projections computing join keys, cross join columns used as join keys - [x] Tests cover LOJ reordering: dependent LOJ blocking base-keyed LOJ, multiple base-keyed LOJs separated by dependent ones - [x] Existing payload join tests (`testPayloadJoinApplicability`, `testPayloadJoinCorrectness`) continue to pass - [x] Full `TestTpchDistributedQueries` and `TestLocalQueries` pass - [x] Verified compilation and checkstyle pass on `presto-main-base` and `presto-tests`
Summary
Improves the PayloadJoinOptimizer in three areas:
1. Skip null checks for non-null join keys
WHERE key IS NOT NULLpredicate, the payload rejoin now uses a direct equality predicate instead of generatingIS_NULLprojections andCOALESCEcomparisonsNOT(IS_NULL(var))patterns in FilterNode predicates within the scan-filter-project tree2. Flatten LOJ chains through intervening nodes (pre-pass)
3. Reorder LOJ chains to maximize optimization
Release Notes
No user-facing changes. Internal optimization improvements to the PayloadJoinOptimizer.
Test plan
AbstractTestDistributedQueriescomparing results with optimization enabled vs disabledtestPayloadJoinApplicability,testPayloadJoinCorrectness) continue to passTestTpchDistributedQueriesandTestLocalQueriespasspresto-main-baseandpresto-tests