Return unknown equality estimate when NDV and range are unknown#29157
Return unknown equality estimate when NDV and range are unknown#29157raunaqmorarka merged 1 commit intotrinodb:masterfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e1008e6699
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR adjusts Trino’s filter selectivity estimation to avoid applying StatisticRange’s infinite-to-infinite overlap heuristic (0.5) to point-equality-style predicates when column statistics are too incomplete (unknown NDV + unbounded/unknown range), preventing pathological row-count outcomes (e.g., NOT IN collapsing to 0).
Changes:
- Add a guard in equality-to-literal estimation to return an unknown estimate when NDV is unknown and the column range is unbounded/unknown.
- Add a regression test ensuring
NOT INover such a column produces an unknown row-count estimate.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| core/trino-main/src/main/java/io/trino/cost/ComparisonStatsCalculator.java | Introduces an early-return to avoid using infinite-range overlap heuristics for equality when NDV/range are unknown. |
| core/trino-main/src/test/java/io/trino/cost/TestFilterStatsCalculator.java | Adds a regression test covering NOT IN on a VARCHAR column with unknown NDV and unbounded range. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
On a column with unknown NDV and an unbounded range, StatisticRange.overlapPercentWith falls back to the infinite-to-infinite 0.5 heuristic, which is meant for range overlap, not point equality. It yielded 0.5 * non-null rows per equality, causing an IN list to saturate at the full non-null row count and $not(IN) to subtract to 0.
e1008e6 to
90f1a9b
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Description
On a column with unknown NDV and an unbounded range, StatisticRange.overlapPercentWith falls back to the infinite-to-infinite 0.5 heuristic, which is meant for range overlap, not point equality. It yielded 0.5 * non-null rows per equality, causing an IN list to saturate at the full non-null row count and $not(IN) to subtract to 0.
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: