Skip to content

feat(optimizer): Push projections through cross joins#27366

Merged
feilong-liu merged 1 commit intoprestodb:masterfrom
kaikalur:push-projection-through-cross-join
Mar 27, 2026
Merged

feat(optimizer): Push projections through cross joins#27366
feilong-liu merged 1 commit intoprestodb:masterfrom
kaikalur:push-projection-through-cross-join

Conversation

@kaikalur
Copy link
Copy Markdown
Contributor

@kaikalur kaikalur commented Mar 18, 2026

Summary

  • Adds a new iterative optimizer rule PushProjectionThroughCrossJoin that pushes single-side projections below cross joins
  • Expressions referencing only one side of the join are computed on the smaller input table instead of the much larger cross-joined result
  • Non-deterministic expressions (e.g. random()) are correctly excluded from pushdown to preserve semantics
  • Gated by session property push_projection_through_cross_join (disabled by default)

Example transformation

-- Before:
Project(a_expr = f(a), b_expr = g(b), mixed = h(a, b))
  CrossJoin
    Left(a)
    Right(b)

-- After:
Project(a_expr, b_expr, mixed = h(a, b))
  CrossJoin
    Project(a_expr = f(a), a)
      Left(a)
    Project(b_expr = g(b), b)
      Right(b)

Benchmark results (lineitem CROSS JOIN nation with regex projections)

Scenario Time Speedup
Baseline (off) 6142 ms 1.00x
Optimized (on) 4440 ms 1.38x

Test plan

  • 11 unit tests in TestPushProjectionThroughCrossJoin covering:
    • Left-only, right-only, both-sides projection pushdown
    • Mixed pushable and non-pushable assignments
    • Does not fire on: mixed-only projections, identity-only, disabled session property, non-cross joins (equi-join criteria), joins with filter, non-deterministic expressions
    • Pushes deterministic but keeps non-deterministic expressions in same plan
  • E2E integration tests in AbstractTestQueries.testPushProjectionThroughCrossJoin executing real queries with optimization on vs off
  • All 522 TestLocalQueries tests pass
  • Benchmark in BenchmarkPushProjectionThroughCrossJoin using HiveDistributedBenchmarkRunner

== NO RELEASE NOTE ==

Summary by Sourcery

Add a new optimizer rule that conditionally pushes deterministic, single-side projections below cross joins, controlled by a session property, and validate it with planner tests, query tests, and a benchmark.

New Features:

  • Introduce the PushProjectionThroughCrossJoin iterative optimizer rule to evaluate eligible projections on join inputs instead of cross-join results.
  • Add a session-configurable system property to enable or disable pushing projections through cross joins.

Tests:

  • Add unit tests for the PushProjectionThroughCrossJoin rule covering pushable, non-pushable, and mixed projection scenarios and session gating.
  • Extend query integration tests to verify identical results with the optimization enabled and disabled on real CROSS JOIN queries.
  • Introduce a Hive benchmark that measures the performance impact of pushing projections through cross joins.

@kaikalur kaikalur requested review from a team, elharo, feilong-liu and jaystarshot as code owners March 18, 2026 14:10
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Mar 18, 2026

Reviewer's Guide

Introduces a new iterative optimizer rule PushProjectionThroughCrossJoin, gated by a session property, to push deterministic single-side projections below cross joins for performance, with corresponding planner rule tests, query integration tests, configuration wiring, and a Hive benchmark.

Sequence diagram for applying PushProjectionThroughCrossJoin during query planning

sequenceDiagram
    participant Client
    participant Planner
    participant IterativeOptimizer
    participant PushProjectionThroughCrossJoin as Rule_PushProjectionThroughCrossJoin
    participant Session
    participant SystemSessionProperties

    Client->>Planner: Submit query with session
    Planner->>IterativeOptimizer: optimize(plan, session)

    loop optimization_iterations
        IterativeOptimizer->>Rule_PushProjectionThroughCrossJoin: getPattern()
        IterativeOptimizer->>IterativeOptimizer: find Project over CrossJoin
        alt pattern_matches
            IterativeOptimizer->>Rule_PushProjectionThroughCrossJoin: isEnabled(session)
            Rule_PushProjectionThroughCrossJoin->>SystemSessionProperties: isPushProjectionThroughCrossJoin(session)
            SystemSessionProperties-->>Rule_PushProjectionThroughCrossJoin: enabled_flag
            alt enabled_flag == true
                IterativeOptimizer->>Rule_PushProjectionThroughCrossJoin: apply(project_over_cross_join, captures, context)
                Rule_PushProjectionThroughCrossJoin->>Rule_PushProjectionThroughCrossJoin: classify assignments (left_only/right_only/mixed)
                Rule_PushProjectionThroughCrossJoin->>Rule_PushProjectionThroughCrossJoin: skip non_deterministic and identity
                Rule_PushProjectionThroughCrossJoin->>Rule_PushProjectionThroughCrossJoin: create left Project under CrossJoin
                Rule_PushProjectionThroughCrossJoin->>Rule_PushProjectionThroughCrossJoin: create right Project under CrossJoin
                Rule_PushProjectionThroughCrossJoin-->>IterativeOptimizer: Result.ofPlanNode(new_plan)
                IterativeOptimizer->>IterativeOptimizer: replace subplan with new_plan
            else enabled_flag == false
                Rule_PushProjectionThroughCrossJoin-->>IterativeOptimizer: Result.empty()
            end
        else no_match
            IterativeOptimizer->>IterativeOptimizer: skip rule
        end
    end

    IterativeOptimizer-->>Planner: optimized_plan
    Planner-->>Client: execute optimized_plan
Loading

Class diagram for the PushProjectionThroughCrossJoin optimizer rule

classDiagram
    class Rule {
      <<interface>>
      +getPattern() Pattern
      +isEnabled(session Session) boolean
      +apply(project ProjectNode, captures Captures, context Context) Result
    }

    class PushProjectionThroughCrossJoin {
      -Capture CHILD
      -Pattern PATTERN
      -DeterminismEvaluator determinismEvaluator
      +PushProjectionThroughCrossJoin(functionAndTypeManager FunctionAndTypeManager)
      +getPattern() Pattern
      +isEnabled(session Session) boolean
      +apply(project ProjectNode, captures Captures, context Context) Result
      -computeVariablesNeededFromSide(topAssignments Assignments, sideVariables Set) Set
      -createChildProjectIfNeeded(context Context, child PlanNode, assignments Assignments) PlanNode
    }

    class ProjectNode {
      +getAssignments() Assignments
      +getSource() PlanNode
      +getLocality() Locality
    }

    class JoinNode {
      +isCrossJoin() boolean
      +getLeft() PlanNode
      +getRight() PlanNode
      +getCriteria() List
      +getFilter() RowExpression
      +getLeftHashVariable() VariableReferenceExpression
      +getRightHashVariable() VariableReferenceExpression
      +getDistributionType() Object
      +getDynamicFilters() Object
      +getOutputVariables() List~VariableReferenceExpression~
    }

    class PlanNode {
      +getOutputVariables() List~VariableReferenceExpression~
      +getSourceLocation() Object
    }

    class Assignments {
      +entrySet() Set
      +getExpressions() List~RowExpression~
      +builder() AssignmentsBuilder
    }

    class AssignmentsBuilder {
      +put(outputVar VariableReferenceExpression, expression RowExpression) AssignmentsBuilder
      +build() Assignments
    }

    class DeterminismEvaluator {
      +isDeterministic(expression RowExpression) boolean
    }

    class RowExpressionDeterminismEvaluator {
      +RowExpressionDeterminismEvaluator(functionAndTypeManager FunctionAndTypeManager)
    }

    class VariableReferenceExpression {
    }

    class RowExpression {
    }

    class Context {
      +getIdAllocator() IdAllocator
    }

    class IdAllocator {
      +getNextId() Object
    }

    class Session {
    }

    class SystemSessionProperties {
      +isPushProjectionThroughCrossJoin(session Session) boolean
    }

    class FunctionAndTypeManager {
    }

    class Pattern {
    }

    class Capture {
    }

    class Captures {
      +get(capture Capture) JoinNode
    }

    class Result {
      +empty() Result
      +ofPlanNode(planNode PlanNode) Result
    }

    Rule <|.. PushProjectionThroughCrossJoin
    DeterminismEvaluator <|-- RowExpressionDeterminismEvaluator

    PushProjectionThroughCrossJoin --> DeterminismEvaluator
    PushProjectionThroughCrossJoin --> Pattern
    PushProjectionThroughCrossJoin --> Capture
    PushProjectionThroughCrossJoin --> ProjectNode
    PushProjectionThroughCrossJoin --> JoinNode
    PushProjectionThroughCrossJoin --> Assignments
    PushProjectionThroughCrossJoin --> PlanNode
    PushProjectionThroughCrossJoin --> Context
    PushProjectionThroughCrossJoin --> Captures
    PushProjectionThroughCrossJoin --> Result
    PushProjectionThroughCrossJoin --> Session
    PushProjectionThroughCrossJoin --> FunctionAndTypeManager

    SystemSessionProperties --> Session
    RowExpressionDeterminismEvaluator --> FunctionAndTypeManager
    Assignments --> AssignmentsBuilder
    Context --> IdAllocator
Loading

File-Level Changes

Change Details Files
Add PushProjectionThroughCrossJoin iterative optimizer rule that pushes deterministic single-side projections below cross joins while preserving semantics.
  • Define a Project-over-CrossJoin pattern and enable the rule only when the push_projection_through_cross_join session property is true.
  • Use a DeterminismEvaluator to skip non-deterministic expressions from pushdown and classify assignments into left-only, right-only, or mixed/identity.
  • Build new child ProjectNodes with pushed-down assignments and identities, reconstruct a JoinNode with updated children, and keep a top ProjectNode that reuses pushed symbols where needed.
  • Avoid creating redundant child projects by skipping them when all assignments are simple identities.
presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushProjectionThroughCrossJoin.java
Wire the new optimization up to configuration and planner pipeline and expose it as a session property.
  • Add a pushProjectionThroughCrossJoin boolean to FeaturesConfig with config key optimizer.push-projection-through-cross-join.
  • Introduce PUSH_PROJECTION_THROUGH_CROSS_JOIN system property name and booleanProperty with default from FeaturesConfig, plus an accessor isPushProjectionThroughCrossJoin(Session).
  • Register PushProjectionThroughCrossJoin in the PlanOptimizers iterative optimizer alongside other projection-pushdown rules.
presto-main-base/src/main/java/com/facebook/presto/sql/analyzer/FeaturesConfig.java
presto-main-base/src/main/java/com/facebook/presto/SystemSessionProperties.java
presto-main-base/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java
Add focused unit tests for the new rule and integration tests to validate behavior on real CROSS JOIN queries.
  • Create TestPushProjectionThroughCrossJoin with cases for left-only, right-only, both-sides pushdown, mixed pushable/non-pushable projections, disabled session property, non-cross joins, joins with filter, and handling of deterministic vs non-deterministic expressions.
  • In AbstractTestQueries, add testPushProjectionThroughCrossJoin that runs a suite of real CROSS JOIN queries with the optimization enabled vs disabled and asserts same results.
presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestPushProjectionThroughCrossJoin.java
presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java
Introduce a Hive benchmark to measure performance impact of pushing projections through cross joins.
  • Add BenchmarkPushProjectionThroughCrossJoin that runs a CROSS JOIN between lineitem and nation with several regex/string projections, with scenarios for optimization on and off.
  • Use HiveDistributedBenchmarkRunner to run both scenarios with verification and allow CLI execution via main().
presto-hive/src/test/java/com/facebook/presto/hive/benchmark/BenchmarkPushProjectionThroughCrossJoin.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • In PushProjectionThroughCrossJoin.apply, each projection expression is passed through extractUnique once during classification and again in computeVariablesNeededFromSide; consider caching the referenced variable sets per expression to avoid redundant scans of the expression tree in larger plans.
  • Currently constant expressions are always kept above the join; if you want to maximize pushdown, you could treat constants as pushable to either side (e.g., left by convention) since they don't depend on join variables and are deterministic by construction.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `PushProjectionThroughCrossJoin.apply`, each projection expression is passed through `extractUnique` once during classification and again in `computeVariablesNeededFromSide`; consider caching the referenced variable sets per expression to avoid redundant scans of the expression tree in larger plans.
- Currently constant expressions are always kept above the join; if you want to maximize pushdown, you could treat constants as pushable to either side (e.g., left by convention) since they don't depend on join variables and are deterministic by construction.

## Individual Comments

### Comment 1
<location path="presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushProjectionThroughCrossJoin.java" line_range="225-230" />
<code_context>
+        return needed.build();
+    }
+
+    private static PlanNode createChildProjectIfNeeded(
+            Context context,
+            PlanNode child,
+            Assignments assignments)
+    {
+        // If assignments are all identity for the child's existing outputs, skip
+        if (assignments.entrySet().stream().allMatch(
+                e -> e.getValue() instanceof VariableReferenceExpression
+                        && e.getValue().equals(e.getKey()))) {
</code_context>
<issue_to_address>
**suggestion (performance):** Identity-only detection should also check that the assignment keys match the child outputs to enable column pruning.

Currently the early-return only verifies that each assignment is an identity, but not that the assignment keys cover all of the child’s output variables. If `assignments` is a strict subset of the child outputs (i.e., we’re pruning unused symbols), this will incorrectly skip creating a projection and miss the chance to narrow the output. Consider requiring `assignments.size() == child.getOutputVariables().size()` in addition to the identity check, or explicitly verifying that the assignment key set equals the child’s output variables, so we still build a project when pruning columns.

```suggestion
        // If assignments are all identity and cover exactly the child's existing outputs, skip
        if (assignments.size() == child.getOutputVariables().size()
                && assignments.entrySet().stream().allMatch(
                        e -> e.getValue() instanceof VariableReferenceExpression
                                && e.getValue().equals(e.getKey())
                                && child.getOutputVariables().contains(e.getKey()))) {
            return child;
        }
```
</issue_to_address>

### Comment 2
<location path="presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushProjectionThroughCrossJoin.java" line_range="158-167" />
<code_context>
+
+        // Build left child: add identity for all left variables needed by the top project
+        // plus any left-only computed expressions
+        Set<VariableReferenceExpression> leftVarsNeededAbove = computeVariablesNeededFromSide(
+                topProjections.build(), leftVariables);
+        for (VariableReferenceExpression var : leftVarsNeededAbove) {
+            leftProjections.put(var, var);
+        }
+        PlanNode newLeft = createChildProjectIfNeeded(
+                context, crossJoin.getLeft(), leftProjections.build());
+
+        // Build right child: add identity for all right variables needed by the top project
+        // plus any right-only computed expressions
+        Set<VariableReferenceExpression> rightVarsNeededAbove = computeVariablesNeededFromSide(
+                topProjections.build(), rightVariables);
+        for (VariableReferenceExpression var : rightVarsNeededAbove) {
</code_context>
<issue_to_address>
**nitpick (performance):** Avoid rebuilding the same top assignments map twice when computing needed variables.

`topProjections.build()` allocates an immutable map each time. Build it once, store it in a local, and reuse it for both `computeVariablesNeededFromSide` calls to avoid the extra allocation and iteration.
</issue_to_address>

### Comment 3
<location path="presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestPushProjectionThroughCrossJoin.java" line_range="130" />
<code_context>
+    }
+
+    @Test
+    public void testDoesNotFireOnMixedProjections()
+    {
+        // All projections reference both sides — nothing to push
</code_context>
<issue_to_address>
**suggestion (testing):** Add an explicit test for constant-only projections to lock in the "do not push constants" behavior.

The rule treats expressions that reference both sides or are constants as non-pushable. We already cover the mixed `(a + b)` case in `testDoesNotFireOnMixedProjections`, but we’re missing a constant-only case (e.g., `BIGINT '42'`). Please add a test like `testDoesNotFireOnConstantProjection` that builds a Project with only a constant expression and asserts `.doesNotFire()` to capture this behavior.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

}

@Test
public void testDoesNotFireOnMixedProjections()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add an explicit test for constant-only projections to lock in the "do not push constants" behavior.

The rule treats expressions that reference both sides or are constants as non-pushable. We already cover the mixed (a + b) case in testDoesNotFireOnMixedProjections, but we’re missing a constant-only case (e.g., BIGINT '42'). Please add a test like testDoesNotFireOnConstantProjection that builds a Project with only a constant expression and asserts .doesNotFire() to capture this behavior.

@kaikalur kaikalur force-pushed the push-projection-through-cross-join branch 3 times, most recently from feb3190 to 347c984 Compare March 20, 2026 20:35
@kaikalur
Copy link
Copy Markdown
Contributor Author

@feilong-liu Could you take a look at this PR when you get a chance? It adds a new optimizer rule to push projections through cross joins. Thanks!

Copy link
Copy Markdown
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include documentation in this PR for the new session property push_projection_through_cross_join.

As described in Designing Your Code in CONTRIBUTING.md:

"All new language features, new functions, session and config properties, and major features have documentation added"

@kaikalur kaikalur force-pushed the push-projection-through-cross-join branch from 347c984 to eba1024 Compare March 25, 2026 00:36
Copy link
Copy Markdown
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull branch, local doc build, looks good. Thank you!

@kaikalur kaikalur force-pushed the push-projection-through-cross-join branch from e76ce1e to 6b584aa Compare March 25, 2026 17:54
@feilong-liu feilong-liu merged commit 94e4f47 into prestodb:master Mar 27, 2026
86 of 88 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants