feat(optimizer): Push projections through cross joins#27366
feat(optimizer): Push projections through cross joins#27366feilong-liu merged 1 commit intoprestodb:masterfrom
Conversation
Reviewer's GuideIntroduces a new iterative optimizer rule PushProjectionThroughCrossJoin, gated by a session property, to push deterministic single-side projections below cross joins for performance, with corresponding planner rule tests, query integration tests, configuration wiring, and a Hive benchmark. Sequence diagram for applying PushProjectionThroughCrossJoin during query planningsequenceDiagram
participant Client
participant Planner
participant IterativeOptimizer
participant PushProjectionThroughCrossJoin as Rule_PushProjectionThroughCrossJoin
participant Session
participant SystemSessionProperties
Client->>Planner: Submit query with session
Planner->>IterativeOptimizer: optimize(plan, session)
loop optimization_iterations
IterativeOptimizer->>Rule_PushProjectionThroughCrossJoin: getPattern()
IterativeOptimizer->>IterativeOptimizer: find Project over CrossJoin
alt pattern_matches
IterativeOptimizer->>Rule_PushProjectionThroughCrossJoin: isEnabled(session)
Rule_PushProjectionThroughCrossJoin->>SystemSessionProperties: isPushProjectionThroughCrossJoin(session)
SystemSessionProperties-->>Rule_PushProjectionThroughCrossJoin: enabled_flag
alt enabled_flag == true
IterativeOptimizer->>Rule_PushProjectionThroughCrossJoin: apply(project_over_cross_join, captures, context)
Rule_PushProjectionThroughCrossJoin->>Rule_PushProjectionThroughCrossJoin: classify assignments (left_only/right_only/mixed)
Rule_PushProjectionThroughCrossJoin->>Rule_PushProjectionThroughCrossJoin: skip non_deterministic and identity
Rule_PushProjectionThroughCrossJoin->>Rule_PushProjectionThroughCrossJoin: create left Project under CrossJoin
Rule_PushProjectionThroughCrossJoin->>Rule_PushProjectionThroughCrossJoin: create right Project under CrossJoin
Rule_PushProjectionThroughCrossJoin-->>IterativeOptimizer: Result.ofPlanNode(new_plan)
IterativeOptimizer->>IterativeOptimizer: replace subplan with new_plan
else enabled_flag == false
Rule_PushProjectionThroughCrossJoin-->>IterativeOptimizer: Result.empty()
end
else no_match
IterativeOptimizer->>IterativeOptimizer: skip rule
end
end
IterativeOptimizer-->>Planner: optimized_plan
Planner-->>Client: execute optimized_plan
Class diagram for the PushProjectionThroughCrossJoin optimizer ruleclassDiagram
class Rule {
<<interface>>
+getPattern() Pattern
+isEnabled(session Session) boolean
+apply(project ProjectNode, captures Captures, context Context) Result
}
class PushProjectionThroughCrossJoin {
-Capture CHILD
-Pattern PATTERN
-DeterminismEvaluator determinismEvaluator
+PushProjectionThroughCrossJoin(functionAndTypeManager FunctionAndTypeManager)
+getPattern() Pattern
+isEnabled(session Session) boolean
+apply(project ProjectNode, captures Captures, context Context) Result
-computeVariablesNeededFromSide(topAssignments Assignments, sideVariables Set) Set
-createChildProjectIfNeeded(context Context, child PlanNode, assignments Assignments) PlanNode
}
class ProjectNode {
+getAssignments() Assignments
+getSource() PlanNode
+getLocality() Locality
}
class JoinNode {
+isCrossJoin() boolean
+getLeft() PlanNode
+getRight() PlanNode
+getCriteria() List
+getFilter() RowExpression
+getLeftHashVariable() VariableReferenceExpression
+getRightHashVariable() VariableReferenceExpression
+getDistributionType() Object
+getDynamicFilters() Object
+getOutputVariables() List~VariableReferenceExpression~
}
class PlanNode {
+getOutputVariables() List~VariableReferenceExpression~
+getSourceLocation() Object
}
class Assignments {
+entrySet() Set
+getExpressions() List~RowExpression~
+builder() AssignmentsBuilder
}
class AssignmentsBuilder {
+put(outputVar VariableReferenceExpression, expression RowExpression) AssignmentsBuilder
+build() Assignments
}
class DeterminismEvaluator {
+isDeterministic(expression RowExpression) boolean
}
class RowExpressionDeterminismEvaluator {
+RowExpressionDeterminismEvaluator(functionAndTypeManager FunctionAndTypeManager)
}
class VariableReferenceExpression {
}
class RowExpression {
}
class Context {
+getIdAllocator() IdAllocator
}
class IdAllocator {
+getNextId() Object
}
class Session {
}
class SystemSessionProperties {
+isPushProjectionThroughCrossJoin(session Session) boolean
}
class FunctionAndTypeManager {
}
class Pattern {
}
class Capture {
}
class Captures {
+get(capture Capture) JoinNode
}
class Result {
+empty() Result
+ofPlanNode(planNode PlanNode) Result
}
Rule <|.. PushProjectionThroughCrossJoin
DeterminismEvaluator <|-- RowExpressionDeterminismEvaluator
PushProjectionThroughCrossJoin --> DeterminismEvaluator
PushProjectionThroughCrossJoin --> Pattern
PushProjectionThroughCrossJoin --> Capture
PushProjectionThroughCrossJoin --> ProjectNode
PushProjectionThroughCrossJoin --> JoinNode
PushProjectionThroughCrossJoin --> Assignments
PushProjectionThroughCrossJoin --> PlanNode
PushProjectionThroughCrossJoin --> Context
PushProjectionThroughCrossJoin --> Captures
PushProjectionThroughCrossJoin --> Result
PushProjectionThroughCrossJoin --> Session
PushProjectionThroughCrossJoin --> FunctionAndTypeManager
SystemSessionProperties --> Session
RowExpressionDeterminismEvaluator --> FunctionAndTypeManager
Assignments --> AssignmentsBuilder
Context --> IdAllocator
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- In
PushProjectionThroughCrossJoin.apply, each projection expression is passed throughextractUniqueonce during classification and again incomputeVariablesNeededFromSide; consider caching the referenced variable sets per expression to avoid redundant scans of the expression tree in larger plans. - Currently constant expressions are always kept above the join; if you want to maximize pushdown, you could treat constants as pushable to either side (e.g., left by convention) since they don't depend on join variables and are deterministic by construction.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `PushProjectionThroughCrossJoin.apply`, each projection expression is passed through `extractUnique` once during classification and again in `computeVariablesNeededFromSide`; consider caching the referenced variable sets per expression to avoid redundant scans of the expression tree in larger plans.
- Currently constant expressions are always kept above the join; if you want to maximize pushdown, you could treat constants as pushable to either side (e.g., left by convention) since they don't depend on join variables and are deterministic by construction.
## Individual Comments
### Comment 1
<location path="presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushProjectionThroughCrossJoin.java" line_range="225-230" />
<code_context>
+ return needed.build();
+ }
+
+ private static PlanNode createChildProjectIfNeeded(
+ Context context,
+ PlanNode child,
+ Assignments assignments)
+ {
+ // If assignments are all identity for the child's existing outputs, skip
+ if (assignments.entrySet().stream().allMatch(
+ e -> e.getValue() instanceof VariableReferenceExpression
+ && e.getValue().equals(e.getKey()))) {
</code_context>
<issue_to_address>
**suggestion (performance):** Identity-only detection should also check that the assignment keys match the child outputs to enable column pruning.
Currently the early-return only verifies that each assignment is an identity, but not that the assignment keys cover all of the child’s output variables. If `assignments` is a strict subset of the child outputs (i.e., we’re pruning unused symbols), this will incorrectly skip creating a projection and miss the chance to narrow the output. Consider requiring `assignments.size() == child.getOutputVariables().size()` in addition to the identity check, or explicitly verifying that the assignment key set equals the child’s output variables, so we still build a project when pruning columns.
```suggestion
// If assignments are all identity and cover exactly the child's existing outputs, skip
if (assignments.size() == child.getOutputVariables().size()
&& assignments.entrySet().stream().allMatch(
e -> e.getValue() instanceof VariableReferenceExpression
&& e.getValue().equals(e.getKey())
&& child.getOutputVariables().contains(e.getKey()))) {
return child;
}
```
</issue_to_address>
### Comment 2
<location path="presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushProjectionThroughCrossJoin.java" line_range="158-167" />
<code_context>
+
+ // Build left child: add identity for all left variables needed by the top project
+ // plus any left-only computed expressions
+ Set<VariableReferenceExpression> leftVarsNeededAbove = computeVariablesNeededFromSide(
+ topProjections.build(), leftVariables);
+ for (VariableReferenceExpression var : leftVarsNeededAbove) {
+ leftProjections.put(var, var);
+ }
+ PlanNode newLeft = createChildProjectIfNeeded(
+ context, crossJoin.getLeft(), leftProjections.build());
+
+ // Build right child: add identity for all right variables needed by the top project
+ // plus any right-only computed expressions
+ Set<VariableReferenceExpression> rightVarsNeededAbove = computeVariablesNeededFromSide(
+ topProjections.build(), rightVariables);
+ for (VariableReferenceExpression var : rightVarsNeededAbove) {
</code_context>
<issue_to_address>
**nitpick (performance):** Avoid rebuilding the same top assignments map twice when computing needed variables.
`topProjections.build()` allocates an immutable map each time. Build it once, store it in a local, and reuse it for both `computeVariablesNeededFromSide` calls to avoid the extra allocation and iteration.
</issue_to_address>
### Comment 3
<location path="presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestPushProjectionThroughCrossJoin.java" line_range="130" />
<code_context>
+ }
+
+ @Test
+ public void testDoesNotFireOnMixedProjections()
+ {
+ // All projections reference both sides — nothing to push
</code_context>
<issue_to_address>
**suggestion (testing):** Add an explicit test for constant-only projections to lock in the "do not push constants" behavior.
The rule treats expressions that reference both sides or are constants as non-pushable. We already cover the mixed `(a + b)` case in `testDoesNotFireOnMixedProjections`, but we’re missing a constant-only case (e.g., `BIGINT '42'`). Please add a test like `testDoesNotFireOnConstantProjection` that builds a Project with only a constant expression and asserts `.doesNotFire()` to capture this behavior.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
...main/java/com/facebook/presto/sql/planner/iterative/rule/PushProjectionThroughCrossJoin.java
Outdated
Show resolved
Hide resolved
...main/java/com/facebook/presto/sql/planner/iterative/rule/PushProjectionThroughCrossJoin.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| @Test | ||
| public void testDoesNotFireOnMixedProjections() |
There was a problem hiding this comment.
suggestion (testing): Add an explicit test for constant-only projections to lock in the "do not push constants" behavior.
The rule treats expressions that reference both sides or are constants as non-pushable. We already cover the mixed (a + b) case in testDoesNotFireOnMixedProjections, but we’re missing a constant-only case (e.g., BIGINT '42'). Please add a test like testDoesNotFireOnConstantProjection that builds a Project with only a constant expression and asserts .doesNotFire() to capture this behavior.
feb3190 to
347c984
Compare
|
@feilong-liu Could you take a look at this PR when you get a chance? It adds a new optimizer rule to push projections through cross joins. Thanks! |
steveburnett
left a comment
There was a problem hiding this comment.
Please include documentation in this PR for the new session property push_projection_through_cross_join.
As described in Designing Your Code in CONTRIBUTING.md:
"All new language features, new functions, session and config properties, and major features have documentation added"
347c984 to
eba1024
Compare
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull branch, local doc build, looks good. Thank you!
e76ce1e to
6b584aa
Compare
Summary
PushProjectionThroughCrossJointhat pushes single-side projections below cross joinsrandom()) are correctly excluded from pushdown to preserve semanticspush_projection_through_cross_join(disabled by default)Example transformation
Benchmark results (lineitem CROSS JOIN nation with regex projections)
Test plan
TestPushProjectionThroughCrossJoincovering:AbstractTestQueries.testPushProjectionThroughCrossJoinexecuting real queries with optimization on vs offTestLocalQueriestests passBenchmarkPushProjectionThroughCrossJoinusingHiveDistributedBenchmarkRunner== NO RELEASE NOTE ==
Summary by Sourcery
Add a new optimizer rule that conditionally pushes deterministic, single-side projections below cross joins, controlled by a session property, and validate it with planner tests, query tests, and a benchmark.
New Features:
Tests: