fix: Guard JoinPrefilter against non-deterministic expressions (#27312)#27312
Merged
singcha merged 1 commit intoprestodb:masterfrom Mar 19, 2026
Merged
fix: Guard JoinPrefilter against non-deterministic expressions (#27312)#27312singcha merged 1 commit intoprestodb:masterfrom
singcha merged 1 commit intoprestodb:masterfrom
Conversation
Contributor
Reviewer's GuideAdds a determinism-aware guard to the JoinPrefilter optimizer so it only clones scan/filter/project left subtrees when all predicates and projections are deterministic, and adds a regression test to verify behavior with TABLESAMPLE BERNOULLI vs deterministic joins. Sequence diagram for applying JoinPrefilter with determinism guardsequenceDiagram
participant Optimizer as JoinPrefilter
participant LeftPlan as PlanNode_left
participant PlannerUtils
participant Determinism as RowExpressionDeterminismEvaluator
Optimizer->>LeftPlan: rewriteWith(this)
activate LeftPlan
LeftPlan-->>Optimizer: rewrittenLeft
deactivate LeftPlan
Optimizer->>PlannerUtils: isScanFilterProject(rewrittenLeft)
PlannerUtils-->>Optimizer: boolean isScanFilterProject
Optimizer->>PlannerUtils: isDeterministicScanFilterProject(rewrittenLeft, functionAndTypeManager)
activate PlannerUtils
PlannerUtils->>Determinism: RowExpressionDeterminismEvaluator(functionAndTypeManager)
activate Determinism
Determinism-->>PlannerUtils: instance
loop scan_filter_project subtree
PlannerUtils->>Determinism: isDeterministic(predicate_or_projection_expression)
Determinism-->>PlannerUtils: boolean isDeterministic
end
PlannerUtils-->>Optimizer: boolean isDeterministicScanFilterProject
deactivate Determinism
deactivate PlannerUtils
alt LEFT or INNER join and scan_filter_project and deterministic and has criteria
Optimizer->>Optimizer: build bloom filter and SemiJoin
else otherwise
Optimizer->>Optimizer: skip JoinPrefilter transformation
end
Class diagram for determinism guard in JoinPrefilter optimizerclassDiagram
class PlanNode
class TableScanNode
class FilterNode
class ProjectNode
class JoinNode
PlanNode <|-- TableScanNode
PlanNode <|-- FilterNode
PlanNode <|-- ProjectNode
PlanNode <|-- JoinNode
class PlannerUtils {
+static boolean isScanFilterProject(PlanNode node)
+static boolean isDeterministicScanFilterProject(PlanNode node, FunctionAndTypeManager functionAndTypeManager)
-static boolean isDeterministicPlanSubtree(PlanNode node, DeterminismEvaluator determinismEvaluator)
}
class JoinPrefilter {
-FunctionAndTypeManager functionAndTypeManager
+PlanNode visitJoin(JoinNode node, RewriteContext context)
}
class DeterminismEvaluator {
+boolean isDeterministic(RowExpression expression)
}
class RowExpressionDeterminismEvaluator {
+RowExpressionDeterminismEvaluator(FunctionAndTypeManager functionAndTypeManager)
+boolean isDeterministic(RowExpression expression)
}
class FunctionAndTypeManager
class RowExpression
class RewriteContext
class EquiJoinClause {
+VariableReferenceExpression getLeft()
+VariableReferenceExpression getRight()
}
class VariableReferenceExpression {
+Type getType()
}
class Type
RowExpressionDeterminismEvaluator ..|> DeterminismEvaluator
PlannerUtils ..> DeterminismEvaluator
PlannerUtils ..> RowExpressionDeterminismEvaluator
JoinPrefilter ..> PlannerUtils
JoinPrefilter ..> FunctionAndTypeManager
JoinPrefilter ..> JoinNode
JoinNode ..> EquiJoinClause
EquiJoinClause ..> VariableReferenceExpression
VariableReferenceExpression ..> Type
FilterNode ..> RowExpression
ProjectNode ..> RowExpression
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
isDeterministicScanFilterProjectassumes the input is a scan-filter-project tree and returnsfalsefor any other node type; consider enforcing this precondition explicitly (e.g., viacheckArgument(isScanFilterProject(node))) or handling unexpected node types more clearly to avoid silent behavior changes if the method is reused.- In
JoinPrefilter,RowExpressionDeterminismEvaluatoris effectively re-created for every call viaisDeterministicScanFilterProject; consider constructing and caching a single evaluator inJoinPrefilter(or passing one in) to avoid repeated allocations on hot planning paths.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- `isDeterministicScanFilterProject` assumes the input is a scan-filter-project tree and returns `false` for any other node type; consider enforcing this precondition explicitly (e.g., via `checkArgument(isScanFilterProject(node))`) or handling unexpected node types more clearly to avoid silent behavior changes if the method is reused.
- In `JoinPrefilter`, `RowExpressionDeterminismEvaluator` is effectively re-created for every call via `isDeterministicScanFilterProject`; consider constructing and caching a single evaluator in `JoinPrefilter` (or passing one in) to avoid repeated allocations on hot planning paths.
## Individual Comments
### Comment 1
<location path="presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java" line_range="7871-7880" />
<code_context>
+ String testQuery = "SELECT count(*) from orders TABLESAMPLE BERNOULLI (50) join lineitem using(orderkey)";
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding a complementary test where the TABLESAMPLE BERNOULLI is applied on the right side to ensure JoinPrefilter still applies as expected.
Since the regression only involves non-determinism on the left side and the guard only inspects the left scan-filter-project subtree, please also add a test with `TABLESAMPLE BERNOULLI` on the right side (e.g., `orders join lineitem TABLESAMPLE BERNOULLI(50) using(orderkey)`) and assert that a SemiJoin is still present. This will document the intended asymmetry and protect against future changes that might incorrectly apply the check to both sides and disable the optimization.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java
Outdated
Show resolved
Hide resolved
adheer-araokar
added a commit
to adheer-araokar/presto
that referenced
this pull request
Mar 12, 2026
…ions (prestodb#27312) Summary: The JoinPrefilter optimizer clones the left side of a join to build a bloom filter for pre-filtering the right side. When the left side contains non-deterministic expressions (e.g., `rand() < 0.1` from `TABLESAMPLE BERNOULLI`), the two clones produce different random samples, causing the join to require rows to be in BOTH samples — effectively squaring the sampling rate (10% becomes 1%). This diff adds a determinism guard to the JoinPrefilter optimizer: 1. **PlannerUtils.java**: Adds `isDeterministicScanFilterProject()` which recursively checks that all filter predicates and project assignments in a scan-filter-project subtree are deterministic, using `RowExpressionDeterminismEvaluator`. 2. **JoinPrefilter.java**: Adds the determinism check to the `visitJoin()` condition, so the optimizer only clones the left subtree when it is safe to do so. 3. **AbstractTestQueries.java**: Adds `testJoinPrefilterSkippedForNonDeterministicExpressions` which verifies that: - With TABLESAMPLE BERNOULLI (non-deterministic), JoinPrefilter does NOT produce a SemiJoin - With deterministic joins, JoinPrefilter still produces a SemiJoin as expected Reviewed By: kaikalur Differential Revision: D95575024
bdf513c to
3a14f56
Compare
adheer-araokar
added a commit
to adheer-araokar/presto
that referenced
this pull request
Mar 12, 2026
…ions (prestodb#27312) Summary: The JoinPrefilter optimizer clones the left side of a join to build a bloom filter for pre-filtering the right side. When the left side contains non-deterministic expressions (e.g., `rand() < 0.1` from `TABLESAMPLE BERNOULLI`), the two clones produce different random samples, causing the join to require rows to be in BOTH samples — effectively squaring the sampling rate (10% becomes 1%). This diff adds a determinism guard to the JoinPrefilter optimizer: 1. **PlannerUtils.java**: Adds `isDeterministicScanFilterProject()` which recursively checks that all filter predicates and project assignments in a scan-filter-project subtree are deterministic, using `RowExpressionDeterminismEvaluator`. 2. **JoinPrefilter.java**: Adds the determinism check to the `visitJoin()` condition, so the optimizer only clones the left subtree when it is safe to do so. 3. **AbstractTestQueries.java**: Adds `testJoinPrefilterSkippedForNonDeterministicExpressions` which verifies that: - With TABLESAMPLE BERNOULLI (non-deterministic), JoinPrefilter does NOT produce a SemiJoin - With deterministic joins, JoinPrefilter still produces a SemiJoin as expected Reviewed By: kaikalur Differential Revision: D95575024
3a14f56 to
9c289d1
Compare
adheer-araokar
added a commit
to adheer-araokar/presto
that referenced
this pull request
Mar 12, 2026
…odb#27312) Summary: The JoinPrefilter optimizer clones the left side of a join to build a bloom filter for pre-filtering the right side. When the left side contains non-deterministic expressions (e.g., `rand() < 0.1` from `TABLESAMPLE BERNOULLI`), the two clones produce different random samples, causing the join to require rows to be in BOTH samples — effectively squaring the sampling rate (10% becomes 1%). This diff adds a determinism guard to the JoinPrefilter optimizer: 1. **PlannerUtils.java**: Adds `isDeterministicScanFilterProject()` which recursively checks that all filter predicates and project assignments in a scan-filter-project subtree are deterministic, using `RowExpressionDeterminismEvaluator`. 2. **JoinPrefilter.java**: Adds the determinism check to the `visitJoin()` condition, so the optimizer only clones the left subtree when it is safe to do so. 3. **AbstractTestQueries.java**: Adds `testJoinPrefilterSkippedForNonDeterministicExpressions` which verifies that: - With TABLESAMPLE BERNOULLI (non-deterministic), JoinPrefilter does NOT produce a SemiJoin - With deterministic joins, JoinPrefilter still produces a SemiJoin as expected Reviewed By: kaikalur Differential Revision: D95575024
9c289d1 to
b90bf98
Compare
adheer-araokar
added a commit
to adheer-araokar/presto
that referenced
this pull request
Mar 12, 2026
…odb#27312) Summary: The JoinPrefilter optimizer clones the left side of a join to build a bloom filter for pre-filtering the right side. When the left side contains non-deterministic expressions (e.g., `rand() < 0.1` from `TABLESAMPLE BERNOULLI`), the two clones produce different random samples, causing the join to require rows to be in BOTH samples — effectively squaring the sampling rate (10% becomes 1%). This diff adds a determinism guard to the JoinPrefilter optimizer: 1. **PlannerUtils.java**: Adds `isDeterministicScanFilterProject()` which recursively checks that all filter predicates and project assignments in a scan-filter-project subtree are deterministic, using `RowExpressionDeterminismEvaluator`. 2. **JoinPrefilter.java**: Adds the determinism check to the `visitJoin()` condition, so the optimizer only clones the left subtree when it is safe to do so. 3. **AbstractTestQueries.java**: Adds `testJoinPrefilterSkippedForNonDeterministicExpressions` which verifies that: - With TABLESAMPLE BERNOULLI (non-deterministic), JoinPrefilter does NOT produce a SemiJoin - With deterministic joins, JoinPrefilter still produces a SemiJoin as expected Differential Revision: D95575024
b90bf98 to
62a6fac
Compare
adheer-araokar
added a commit
to adheer-araokar/presto
that referenced
this pull request
Mar 12, 2026
…odb#27312) Summary: The JoinPrefilter optimizer clones the left side of a join to build a bloom filter for pre-filtering the right side. When the left side contains non-deterministic expressions (e.g., `rand() < 0.1` from `TABLESAMPLE BERNOULLI`), the two clones produce different random samples, causing the join to require rows to be in BOTH samples — effectively squaring the sampling rate (10% becomes 1%). This diff adds a determinism guard to the JoinPrefilter optimizer: 1. **PlannerUtils.java**: Adds `isDeterministicScanFilterProject()` which recursively checks that all filter predicates and project assignments in a scan-filter-project subtree are deterministic, using `RowExpressionDeterminismEvaluator`. 2. **JoinPrefilter.java**: Adds the determinism check to the `visitJoin()` condition, so the optimizer only clones the left subtree when it is safe to do so. 3. **AbstractTestQueries.java**: Adds `testJoinPrefilterSkippedForNonDeterministicExpressions` which verifies that: - With TABLESAMPLE BERNOULLI (non-deterministic), JoinPrefilter does NOT produce a SemiJoin - With deterministic joins, JoinPrefilter still produces a SemiJoin as expected Differential Revision: D95575024
62a6fac to
5e86fea
Compare
adheer-araokar
added a commit
to adheer-araokar/presto
that referenced
this pull request
Mar 12, 2026
…odb#27312) Summary: Pull Request resolved: prestodb#27312 The JoinPrefilter optimizer clones the left side of a join to build a bloom filter for pre-filtering the right side. When the left side contains non-deterministic expressions (e.g., `rand() < 0.1` from `TABLESAMPLE BERNOULLI`), the two clones produce different random samples, causing the join to require rows to be in BOTH samples — effectively squaring the sampling rate (10% becomes 1%). This diff adds a determinism guard to the JoinPrefilter optimizer: 1. **PlannerUtils.java**: Adds `isDeterministicScanFilterProject()` which recursively checks that all filter predicates and project assignments in a scan-filter-project subtree are deterministic, using `RowExpressionDeterminismEvaluator`. 2. **JoinPrefilter.java**: Adds the determinism check to the `visitJoin()` condition, so the optimizer only clones the left subtree when it is safe to do so. 3. **AbstractTestQueries.java**: Adds `testJoinPrefilterSkippedForNonDeterministicExpressions` which verifies that: - With TABLESAMPLE BERNOULLI (non-deterministic), JoinPrefilter does NOT produce a SemiJoin - With deterministic joins, JoinPrefilter still produces a SemiJoin as expected Differential Revision: D95575024
5e86fea to
d2d8ab2
Compare
adheer-araokar
added a commit
to adheer-araokar/presto
that referenced
this pull request
Mar 12, 2026
…odb#27312) Summary: Pull Request resolved: prestodb#27312 The JoinPrefilter optimizer clones the left side of a join to build a bloom filter for pre-filtering the right side. When the left side contains non-deterministic expressions (e.g., `rand() < 0.1` from `TABLESAMPLE BERNOULLI`), the two clones produce different random samples, causing the join to require rows to be in BOTH samples — effectively squaring the sampling rate (10% becomes 1%). This diff adds a determinism guard to the JoinPrefilter optimizer: 1. **PlannerUtils.java**: Adds `isDeterministicScanFilterProject()` which recursively checks that all filter predicates and project assignments in a scan-filter-project subtree are deterministic, using `RowExpressionDeterminismEvaluator`. 2. **JoinPrefilter.java**: Adds the determinism check to the `visitJoin()` condition, so the optimizer only clones the left subtree when it is safe to do so. 3. **AbstractTestQueries.java**: Adds `testJoinPrefilterSkippedForNonDeterministicExpressions` which verifies that: - With TABLESAMPLE BERNOULLI (non-deterministic), JoinPrefilter does NOT produce a SemiJoin - With deterministic joins, JoinPrefilter still produces a SemiJoin as expected Differential Revision: D95575024
d2d8ab2 to
3621512
Compare
adheer-araokar
added a commit
to adheer-araokar/presto
that referenced
this pull request
Mar 12, 2026
…odb#27312) Summary: The JoinPrefilter optimizer clones the left side of a join to build a bloom filter for pre-filtering the right side. When the left side contains non-deterministic expressions (e.g., `rand() < 0.1` from `TABLESAMPLE BERNOULLI`), the two clones produce different random samples, causing the join to require rows to be in BOTH samples — effectively squaring the sampling rate (10% becomes 1%). This diff adds a determinism guard to the JoinPrefilter optimizer: 1. **PlannerUtils.java**: Adds `isDeterministicScanFilterProject()` which recursively checks that all filter predicates and project assignments in a scan-filter-project subtree are deterministic, using `RowExpressionDeterminismEvaluator`. 2. **JoinPrefilter.java**: Adds the determinism check to the `visitJoin()` condition, so the optimizer only clones the left subtree when it is safe to do so. 3. **AbstractTestQueries.java**: Adds `testJoinPrefilterSkippedForNonDeterministicExpressions` which verifies that: - With TABLESAMPLE BERNOULLI (non-deterministic), JoinPrefilter does NOT produce a SemiJoin - With deterministic joins, JoinPrefilter still produces a SemiJoin as expected Reviewed By: kaikalur Differential Revision: D95575024
3621512 to
7b7bc3b
Compare
kaikalur
approved these changes
Mar 13, 2026
feilong-liu
approved these changes
Mar 13, 2026
adheer-araokar
added a commit
to adheer-araokar/presto
that referenced
this pull request
Mar 13, 2026
…odb#27312) Summary: The JoinPrefilter optimizer clones the left side of a join to build a bloom filter for pre-filtering the right side. When the left side contains non-deterministic expressions (e.g., `rand() < 0.1` from `TABLESAMPLE BERNOULLI`), the two clones produce different random samples, causing the join to require rows to be in BOTH samples — effectively squaring the sampling rate (10% becomes 1%). This diff adds a determinism guard to the JoinPrefilter optimizer: 1. **PlannerUtils.java**: Adds `isDeterministicScanFilterProject()` which recursively checks that all filter predicates and project assignments in a scan-filter-project subtree are deterministic, using `RowExpressionDeterminismEvaluator`. 2. **JoinPrefilter.java**: Adds the determinism check to the `visitJoin()` condition, so the optimizer only clones the left subtree when it is safe to do so. 3. **AbstractTestQueries.java**: Adds `testJoinPrefilterSkippedForNonDeterministicExpressions` which verifies that: - With TABLESAMPLE BERNOULLI (non-deterministic), JoinPrefilter does NOT produce a SemiJoin - With deterministic joins, JoinPrefilter still produces a SemiJoin as expected Reviewed By: kaikalur Differential Revision: D95575024
7b7bc3b to
8a6003f
Compare
adheer-araokar
added a commit
to adheer-araokar/presto
that referenced
this pull request
Mar 13, 2026
…odb#27312) Summary: Pull Request resolved: prestodb#27312 The JoinPrefilter optimizer clones the left side of a join to build a bloom filter for pre-filtering the right side. When the left side contains non-deterministic expressions (e.g., `rand() < 0.1` from `TABLESAMPLE BERNOULLI`), the two clones produce different random samples, causing the join to require rows to be in BOTH samples — effectively squaring the sampling rate (10% becomes 1%). This diff adds a determinism guard to the JoinPrefilter optimizer: 1. **PlannerUtils.java**: Adds `isDeterministicScanFilterProject()` which recursively checks that all filter predicates and project assignments in a scan-filter-project subtree are deterministic, using `RowExpressionDeterminismEvaluator`. 2. **JoinPrefilter.java**: Adds the determinism check to the `visitJoin()` condition, so the optimizer only clones the left subtree when it is safe to do so. 3. **AbstractTestQueries.java**: Adds `testJoinPrefilterSkippedForNonDeterministicExpressions` which verifies that: - With TABLESAMPLE BERNOULLI (non-deterministic), JoinPrefilter does NOT produce a SemiJoin - With deterministic joins, JoinPrefilter still produces a SemiJoin as expected Reviewed By: kaikalur Differential Revision: D95575024
8a6003f to
9d4fdce
Compare
…odb#27312) Summary: The JoinPrefilter optimizer clones the left side of a join to build a bloom filter for pre-filtering the right side. When the left side contains non-deterministic expressions (e.g., `rand() < 0.1` from `TABLESAMPLE BERNOULLI`), the two clones produce different random samples, causing the join to require rows to be in BOTH samples — effectively squaring the sampling rate (10% becomes 1%). This diff adds a determinism guard to the JoinPrefilter optimizer: 1. **PlannerUtils.java**: Adds `isDeterministicScanFilterProject()` which recursively checks that all filter predicates and project assignments in a scan-filter-project subtree are deterministic, using `RowExpressionDeterminismEvaluator`. 2. **JoinPrefilter.java**: Adds the determinism check to the `visitJoin()` condition, so the optimizer only clones the left subtree when it is safe to do so. 3. **AbstractTestQueries.java**: Adds `testJoinPrefilterSkippedForNonDeterministicExpressions` which verifies that: - With TABLESAMPLE BERNOULLI (non-deterministic), JoinPrefilter does NOT produce a SemiJoin - With deterministic joins, JoinPrefilter still produces a SemiJoin as expected Reviewed By: kaikalur Differential Revision: D95575024
9d4fdce to
f8b53a1
Compare
This was referenced Mar 31, 2026
15 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
The JoinPrefilter optimizer clones the left side of a join to build a bloom filter
for pre-filtering the right side. When the left side contains non-deterministic
expressions (e.g.,
rand() < 0.1fromTABLESAMPLE BERNOULLI), the two clonesproduce different random samples, causing the join to require rows to be in BOTH
samples — effectively squaring the sampling rate (10% becomes 1%).
This diff adds a determinism guard to the JoinPrefilter optimizer:
PlannerUtils.java: Adds
isDeterministicScanFilterProject()which recursivelychecks that all filter predicates and project assignments in a scan-filter-project
subtree are deterministic, using
RowExpressionDeterminismEvaluator.JoinPrefilter.java: Adds the determinism check to the
visitJoin()condition,so the optimizer only clones the left subtree when it is safe to do so.
AbstractTestQueries.java: Adds
testJoinPrefilterSkippedForNonDeterministicExpressionswhich verifies that:
Reviewed By: kaikalur
Differential Revision: D95575024