fix(planner): Fix filter stats estimation corner cases by aaneja · Pull Request #26812 · prestodb/presto

aaneja · 2025-12-16T14:14:35Z

Description

Fix for #26685
Fix for #26808

Motivation and Context

Impact

Test Plan

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.
If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* ... 
* ... 

Hive Connector Changes
* ... 
* ...

If release note is NOT required, use:

== NO RELEASE NOTE ==

sourcery-ai · 2025-12-16T14:14:42Z

Reviewer's Guide

Refines filter statistics estimation for IN, NOT IN, and <> predicates, especially for low/unknown NDV cases, and adds regression tests to ensure filter factors remain within sane bounds and avoid zero/NaN estimates.

Class diagram for updated filter statistics calculators

classDiagram
    class ComparisonStatsCalculator {
        +PlanNodeStatsEstimate estimateExpressionNotEqualToLiteral(expression, literal, expressionStatistics, inputStatistics, expressionVariable)
    }

    class FilterStatsCalculator {
        +static double CIEL_IN_PREDICATE_UPPER_BOUND_COEFFICIENT
        +static double UNKNOWN_FILTER_COEFFICIENT
        +PlanNodeStatsEstimate visitInPredicate(node, context)
        +PlanNodeStatsEstimate estimateIn(value, values, input, session)
    }

    class PlanNodeStatsEstimate {
        +double outputRowCount
        +PlanNodeStatsEstimate buildFrom(inputStatistics)
        +PlanNodeStatsEstimate$Builder setOutputRowCount(rowCount)
        +PlanNodeStatsEstimate$Builder addVariableStatistics(variable, variableStatsEstimate)
    }

    class VariableStatsEstimate {
        +double nullsFraction
        +double distinctValuesCount
        +VariableStatsEstimate$Builder buildFrom(variableStatsEstimate)
        +VariableStatsEstimate$Builder setNullsFraction(nullsFraction)
        +VariableStatsEstimate$Builder setDistinctValuesCount(distinctValuesCount)
    }

    ComparisonStatsCalculator --> PlanNodeStatsEstimate : builds
    ComparisonStatsCalculator --> VariableStatsEstimate : updates
    FilterStatsCalculator --> PlanNodeStatsEstimate : builds
    FilterStatsCalculator --> VariableStatsEstimate : reads

    class StatisticRange {
        +double low
        +boolean lowInclusive
        +double high
        +boolean highInclusive
        +double distinctValuesCount
    }

    ComparisonStatsCalculator --> StatisticRange : uses

File-Level Changes

Change	Details	Files
Adjust NOT EQUAL filter estimation to handle NDV=1 specially and avoid over‑aggressive NDV reduction.	In ComparisonStatsCalculator, when estimating `<> literal` on an expression with NDV=1, use UNKNOWN_FILTER_COEFFICIENT instead of range-based filter factor. For expressions with NDV>1, keep using 1 - calculateFilterFactor() as before. When updating variable stats after `<>` filtering, keep NDV at 1 if original NDV was 1; otherwise decrease NDV by 1 but not below 0, and always set null fraction to 0.	`presto-main-base/src/main/java/com/facebook/presto/cost/ComparisonStatsCalculator.java`
Cap IN predicate selectivity and ensure NOT IN estimates never collapse to zero.	Introduce CIEL_IN_PREDICATE_UPPER_BOUND_COEFFICIENT (0.8) in FilterStatsCalculator to upper-bound row counts produced by IN predicate estimation relative to non-null input cardinality. Use this coefficient when computing outputRowCount for both AST-based InPredicate and RowExpression-based estimateIn methods, taking min(inEstimate, nonNullCount * coefficient).	`presto-main-base/src/main/java/com/facebook/presto/cost/FilterStatsCalculator.java`
Add regression tests for IN/NOT IN and NDV=1 corner cases in filter stats.	Add ndv1Expressions and inList data providers to cover multiple predicate shapes and IN list sizes. Add tests verifying IN predicate with unknown NDV but known null fraction yields a non-trivial filter (output between 0 and full non-null count, with specific expectations for 1-element and multi-element IN lists). Add tests asserting NOT IN predicates never estimate zero output rows across various IN list sizes. Add tests validating <> predicates on variables with NDV=1 use UNKNOWN_FILTER_COEFFICIENT for row count and preserve NDV and null fraction. Adjust an existing IN predicate test to expect 720 instead of 900 rows, documenting that range estimate is capped by the IN coefficient.	`presto-main-base/src/test/java/com/facebook/presto/cost/AbstractTestFilterStatsCalculator.java`

Possibly linked issues

#N/A: PR introduces caps and tests so IN never full-range and NOT IN never zero, directly addressing the issue’s scenarios.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

presto-main-base/src/test/java/com/facebook/presto/cost/AbstractTestFilterStatsCalculator.java

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The new constant name CIEL_IN_PREDICATE_UPPER_BOUND_COEFFICIENT looks like a typo of CEIL; consider renaming it (and adjusting the comment wording) to avoid confusion about its intent.
The test method tesInPredicateWithoutNDV appears to be missing a t in test; renaming it would keep naming consistent and clearer.
In estimateExpressionNotEqualToLiteral, the NDV == 1 special case is handled in two places (filterFactor and newNDV); consider centralizing this logic or adding a brief comment tying the two together so the coupling is explicit for future maintainers.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new constant name `CIEL_IN_PREDICATE_UPPER_BOUND_COEFFICIENT` looks like a typo of `CEIL`; consider renaming it (and adjusting the comment wording) to avoid confusion about its intent.
- The test method `tesInPredicateWithoutNDV` appears to be missing a `t` in `test`; renaming it would keep naming consistent and clearer.
- In `estimateExpressionNotEqualToLiteral`, the NDV == 1 special case is handled in two places (filterFactor and `newNDV`); consider centralizing this logic or adding a brief comment tying the two together so the coupling is explicit for future maintainers.

## Individual Comments

### Comment 1
<location> `presto-main-base/src/main/java/com/facebook/presto/cost/ComparisonStatsCalculator.java:110-119` </location>
<code_context>
+        double expressionNDV = expressionStatistics.getDistinctValuesCount();
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Use a consistent comparison approach for `expressionNDV` to avoid subtle floating-point discrepancies.

Here you mix `Double.compare(expressionNDV, 1D) == 0` with `expressionNDV == 1`. For doubles coming from stats, this can diverge in edge cases (`-0.0`, rounding, NaN). Please use a single, consistent pattern (e.g., `Double.compare(expressionNDV, 1D) == 0` or an epsilon-based check) everywhere `expressionNDV` is compared to 1.

Suggested implementation:

```java
        double filterFactor;
        double expressionNDV = expressionStatistics.getDistinctValuesCount();
        if (Double.compare(expressionNDV, 1D) == 0) {
            // It's hard to make a meaningful estimate when we have only one distinct value
            filterFactor = UNKNOWN_FILTER_COEFFICIENT;
        }
        else {
            filterFactor = 1 - calculateFilterFactor(expressionStatistics, filterRange);
        }

        PlanNodeStatsEstimate.Builder estimate = PlanNodeStatsEstimate.buildFrom(inputStatistics);
        estimate.setOutputRowCount(filterFactor * (1 - expressionStatistics.getNullsFraction()) * inputStatistics.getOutputRowCount());

```

```java
        if (Double.compare(expressionNDV, 1D) == 0) {

```

If there are other comparisons in this file (or related cost calculators) using `expressionNDV == 1`, `expressionNDV == 1.0`, or similar, they should also be updated to `Double.compare(expressionNDV, 1D) == 0` for consistency with this change.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

presto-main-base/src/main/java/com/facebook/presto/cost/ComparisonStatsCalculator.java

steveburnett · 2026-01-06T18:45:47Z

Nit: when you can, edit the Release Notes section of the description to = NO RELEASE NOTE ==.

Fixes prestodb#26685

Fixes prestodb#26808

- Typo fixes - Fix IN estimate to a lower bound of 1.0 rows

aaneja · 2026-01-15T16:22:31Z

@arhimondr Can you help review ? I see from git history that you've worked on the same classes

tdcmeehan

LGTM. @arhimondr please review if you have time.

prestodb-ci added the from:IBM PR from IBM label Dec 16, 2025

aaneja changed the title ~~Filter stats bugfixes~~ fix(planner): Fix filter stats estimation corner cases Dec 16, 2025

aaneja commented Dec 16, 2025

View reviewed changes

presto-main-base/src/test/java/com/facebook/presto/cost/AbstractTestFilterStatsCalculator.java Show resolved Hide resolved

aaneja commented Dec 16, 2025

View reviewed changes

presto-main-base/src/test/java/com/facebook/presto/cost/AbstractTestFilterStatsCalculator.java Outdated Show resolved Hide resolved

aaneja force-pushed the filterStatsBug branch from e55b354 to 272f4cb Compare December 19, 2025 06:23

aaneja marked this pull request as ready for review December 20, 2025 12:21

aaneja requested review from a team, feilong-liu, jaystarshot and vivek-bharathan as code owners December 20, 2025 12:21

prestodb-ci requested review from a team, Dilli-Babu-Godari and Joe-Abraham and removed request for a team December 20, 2025 12:21

sourcery-ai bot reviewed Dec 20, 2025

View reviewed changes

presto-main-base/src/main/java/com/facebook/presto/cost/ComparisonStatsCalculator.java Show resolved Hide resolved

aaneja added 3 commits January 15, 2026 21:47

fix(planner): Fix NOT_EQUAL selectivity for variables with NDV 1

1a210e5

Fixes prestodb#26685

fix(planner): Fix selectivity over estimation for IN predicates

c1ae6f1

Fixes prestodb#26808

Misc bugfixes

8c76b14

- Typo fixes - Fix IN estimate to a lower bound of 1.0 rows

aaneja force-pushed the filterStatsBug branch from 646a4c7 to 8c76b14 Compare January 15, 2026 16:18

aaneja requested a review from arhimondr January 15, 2026 16:21

tdcmeehan approved these changes Jan 16, 2026

View reviewed changes

aaneja merged commit 93def8b into prestodb:master Feb 4, 2026
80 checks passed

This was referenced Mar 31, 2026

docs: Add release notes for 0.297 unix280/presto#51

Closed

docs: Add release notes for 0.297 unix280/presto#52

Open

prestodb-ci mentioned this pull request Apr 1, 2026

docs: Add release notes for 0.297 #27484

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(planner): Fix filter stats estimation corner cases#26812

fix(planner): Fix filter stats estimation corner cases#26812
aaneja merged 3 commits intoprestodb:masterfrom
aaneja:filterStatsBug

aaneja commented Dec 16, 2025

Uh oh!

sourcery-ai bot commented Dec 16, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Uh oh!

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

steveburnett commented Jan 6, 2026

Uh oh!

aaneja commented Jan 15, 2026

Uh oh!

tdcmeehan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

aaneja commented Dec 16, 2025

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Uh oh!

sourcery-ai bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Class diagram for updated filter statistics calculators

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Uh oh!

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

steveburnett commented Jan 6, 2026

Uh oh!

aaneja commented Jan 15, 2026

Uh oh!

tdcmeehan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sourcery-ai bot commented Dec 16, 2025 •

edited

Loading