Skip to content

fix(planner): Fix filter stats estimation corner cases#26812

Merged
aaneja merged 3 commits intoprestodb:masterfrom
aaneja:filterStatsBug
Feb 4, 2026
Merged

fix(planner): Fix filter stats estimation corner cases#26812
aaneja merged 3 commits intoprestodb:masterfrom
aaneja:filterStatsBug

Conversation

@aaneja
Copy link
Copy Markdown
Contributor

@aaneja aaneja commented Dec 16, 2025

Description

Fix for #26685
Fix for #26808

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* ... 
* ... 

Hive Connector Changes
* ... 
* ... 

If release note is NOT required, use:

== NO RELEASE NOTE ==

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Dec 16, 2025
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Dec 16, 2025

Reviewer's Guide

Refines filter statistics estimation for IN, NOT IN, and <> predicates, especially for low/unknown NDV cases, and adds regression tests to ensure filter factors remain within sane bounds and avoid zero/NaN estimates.

Class diagram for updated filter statistics calculators

classDiagram
    class ComparisonStatsCalculator {
        +PlanNodeStatsEstimate estimateExpressionNotEqualToLiteral(expression, literal, expressionStatistics, inputStatistics, expressionVariable)
    }

    class FilterStatsCalculator {
        +static double CIEL_IN_PREDICATE_UPPER_BOUND_COEFFICIENT
        +static double UNKNOWN_FILTER_COEFFICIENT
        +PlanNodeStatsEstimate visitInPredicate(node, context)
        +PlanNodeStatsEstimate estimateIn(value, values, input, session)
    }

    class PlanNodeStatsEstimate {
        +double outputRowCount
        +PlanNodeStatsEstimate buildFrom(inputStatistics)
        +PlanNodeStatsEstimate$Builder setOutputRowCount(rowCount)
        +PlanNodeStatsEstimate$Builder addVariableStatistics(variable, variableStatsEstimate)
    }

    class VariableStatsEstimate {
        +double nullsFraction
        +double distinctValuesCount
        +VariableStatsEstimate$Builder buildFrom(variableStatsEstimate)
        +VariableStatsEstimate$Builder setNullsFraction(nullsFraction)
        +VariableStatsEstimate$Builder setDistinctValuesCount(distinctValuesCount)
    }

    ComparisonStatsCalculator --> PlanNodeStatsEstimate : builds
    ComparisonStatsCalculator --> VariableStatsEstimate : updates
    FilterStatsCalculator --> PlanNodeStatsEstimate : builds
    FilterStatsCalculator --> VariableStatsEstimate : reads

    class StatisticRange {
        +double low
        +boolean lowInclusive
        +double high
        +boolean highInclusive
        +double distinctValuesCount
    }

    ComparisonStatsCalculator --> StatisticRange : uses
Loading

File-Level Changes

Change Details Files
Adjust NOT EQUAL filter estimation to handle NDV=1 specially and avoid over‑aggressive NDV reduction.
  • In ComparisonStatsCalculator, when estimating <> literal on an expression with NDV=1, use UNKNOWN_FILTER_COEFFICIENT instead of range-based filter factor.
  • For expressions with NDV>1, keep using 1 - calculateFilterFactor() as before.
  • When updating variable stats after <> filtering, keep NDV at 1 if original NDV was 1; otherwise decrease NDV by 1 but not below 0, and always set null fraction to 0.
presto-main-base/src/main/java/com/facebook/presto/cost/ComparisonStatsCalculator.java
Cap IN predicate selectivity and ensure NOT IN estimates never collapse to zero.
  • Introduce CIEL_IN_PREDICATE_UPPER_BOUND_COEFFICIENT (0.8) in FilterStatsCalculator to upper-bound row counts produced by IN predicate estimation relative to non-null input cardinality.
  • Use this coefficient when computing outputRowCount for both AST-based InPredicate and RowExpression-based estimateIn methods, taking min(inEstimate, nonNullCount * coefficient).
presto-main-base/src/main/java/com/facebook/presto/cost/FilterStatsCalculator.java
Add regression tests for IN/NOT IN and NDV=1 corner cases in filter stats.
  • Add ndv1Expressions and inList data providers to cover multiple predicate shapes and IN list sizes.
  • Add tests verifying IN predicate with unknown NDV but known null fraction yields a non-trivial filter (output between 0 and full non-null count, with specific expectations for 1-element and multi-element IN lists).
  • Add tests asserting NOT IN predicates never estimate zero output rows across various IN list sizes.
  • Add tests validating <> predicates on variables with NDV=1 use UNKNOWN_FILTER_COEFFICIENT for row count and preserve NDV and null fraction.
  • Adjust an existing IN predicate test to expect 720 instead of 900 rows, documenting that range estimate is capped by the IN coefficient.
presto-main-base/src/test/java/com/facebook/presto/cost/AbstractTestFilterStatsCalculator.java

Possibly linked issues

  • #N/A: PR introduces caps and tests so IN never full-range and NOT IN never zero, directly addressing the issue’s scenarios.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@aaneja aaneja changed the title Filter stats bugfixes fix(planner): Fix filter stats estimation corner cases Dec 16, 2025
@aaneja aaneja marked this pull request as ready for review December 20, 2025 12:21
@prestodb-ci prestodb-ci requested review from a team, Dilli-Babu-Godari and Joe-Abraham and removed request for a team December 20, 2025 12:21
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The new constant name CIEL_IN_PREDICATE_UPPER_BOUND_COEFFICIENT looks like a typo of CEIL; consider renaming it (and adjusting the comment wording) to avoid confusion about its intent.
  • The test method tesInPredicateWithoutNDV appears to be missing a t in test; renaming it would keep naming consistent and clearer.
  • In estimateExpressionNotEqualToLiteral, the NDV == 1 special case is handled in two places (filterFactor and newNDV); consider centralizing this logic or adding a brief comment tying the two together so the coupling is explicit for future maintainers.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new constant name `CIEL_IN_PREDICATE_UPPER_BOUND_COEFFICIENT` looks like a typo of `CEIL`; consider renaming it (and adjusting the comment wording) to avoid confusion about its intent.
- The test method `tesInPredicateWithoutNDV` appears to be missing a `t` in `test`; renaming it would keep naming consistent and clearer.
- In `estimateExpressionNotEqualToLiteral`, the NDV == 1 special case is handled in two places (filterFactor and `newNDV`); consider centralizing this logic or adding a brief comment tying the two together so the coupling is explicit for future maintainers.

## Individual Comments

### Comment 1
<location> `presto-main-base/src/main/java/com/facebook/presto/cost/ComparisonStatsCalculator.java:110-119` </location>
<code_context>
+        double expressionNDV = expressionStatistics.getDistinctValuesCount();
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Use a consistent comparison approach for `expressionNDV` to avoid subtle floating-point discrepancies.

Here you mix `Double.compare(expressionNDV, 1D) == 0` with `expressionNDV == 1`. For doubles coming from stats, this can diverge in edge cases (`-0.0`, rounding, NaN). Please use a single, consistent pattern (e.g., `Double.compare(expressionNDV, 1D) == 0` or an epsilon-based check) everywhere `expressionNDV` is compared to 1.

Suggested implementation:

```java
        double filterFactor;
        double expressionNDV = expressionStatistics.getDistinctValuesCount();
        if (Double.compare(expressionNDV, 1D) == 0) {
            // It's hard to make a meaningful estimate when we have only one distinct value
            filterFactor = UNKNOWN_FILTER_COEFFICIENT;
        }
        else {
            filterFactor = 1 - calculateFilterFactor(expressionStatistics, filterRange);
        }

        PlanNodeStatsEstimate.Builder estimate = PlanNodeStatsEstimate.buildFrom(inputStatistics);
        estimate.setOutputRowCount(filterFactor * (1 - expressionStatistics.getNullsFraction()) * inputStatistics.getOutputRowCount());

```

```java
        if (Double.compare(expressionNDV, 1D) == 0) {

```

If there are other comparisons in this file (or related cost calculators) using `expressionNDV == 1`, `expressionNDV == 1.0`, or similar, they should also be updated to `Double.compare(expressionNDV, 1D) == 0` for consistency with this change.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@steveburnett
Copy link
Copy Markdown
Contributor

Nit: when you can, edit the Release Notes section of the description to = NO RELEASE NOTE ==.

@aaneja
Copy link
Copy Markdown
Contributor Author

aaneja commented Jan 15, 2026

@arhimondr Can you help review ? I see from git history that you've worked on the same classes

Copy link
Copy Markdown
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @arhimondr please review if you have time.

@aaneja aaneja merged commit 93def8b into prestodb:master Feb 4, 2026
80 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants