Skip to content

feat(planner): Update AggregationStatsRule to work for more aggregation shapes#27215

Merged
aaneja merged 3 commits intoprestodb:masterfrom
aaneja:improveStatsEstimations
Mar 11, 2026
Merged

feat(planner): Update AggregationStatsRule to work for more aggregation shapes#27215
aaneja merged 3 commits intoprestodb:masterfrom
aaneja:improveStatsEstimations

Conversation

@aaneja
Copy link
Copy Markdown
Contributor

@aaneja aaneja commented Feb 26, 2026

Ported over from https://github.com/trinodb/trino/blob/060d9cbf4e584301b7a2f3faac14afe24ddcf53e/core/trino-main/src/main/java/io/trino/cost/AggregationStatsRule.java

Co-authored-by: Kamil Endruszkiewicz kamil.endruszkiewicz@starburstdata.com
Co-authored-by: @copilot (tests)

Description, Motivation and Context

PARTIAL Aggregation nodes break stats propagation and lead to poorer plans

Impact

Stats estimates for Aggregation nodes are propagated

Test Plan

New test added

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

== NO RELEASE NOTE ==

Summary by Sourcery

Extend aggregation statistics estimation to support more aggregation shapes and planner steps, improving row count and confidence handling for aggregation nodes.

New Features:

  • Add statistics estimation for PARTIAL and INTERMEDIATE aggregation steps by forwarding source stats with no assumed reduction.
  • Handle global (no-key) aggregations by always estimating a single output row with FACT confidence.

Enhancements:

  • Refine group-by aggregation stats to account for grouping key distinct counts, null handling, and capping output row count to input rows across FINAL/SINGLE steps.
  • Factor out reusable helpers for group-by variable statistics and row-count computation in aggregation stats rule.
  • Replace repeated inline variable references in aggregation stats tests with shared constants for clarity and reuse.

Tests:

  • Expand aggregation stats test coverage to cover global, partial, intermediate, final, and multi-key grouping scenarios, including null handling and NDV vs. input row count edge cases.

@aaneja aaneja requested review from a team, feilong-liu and jaystarshot as code owners February 26, 2026 12:49
@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Feb 26, 2026
@prestodb-ci prestodb-ci requested review from a team, jkhaliqi and jp-sivaprasad and removed request for a team February 26, 2026 12:49
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Feb 26, 2026

Reviewer's Guide

Extends AggregationStatsRule to handle all aggregation steps and more grouping configurations, adds helper methods for row-count and group-by variable stats computation, and significantly broadens test coverage to validate stats propagation for global, partial, intermediate, final, and multi-key aggregations including edge cases with nulls and NDV capping.

Sequence diagram for updated AggregationStatsRule stats calculation

sequenceDiagram
    actor Optimizer
    participant AggregationStatsRule
    participant AggregationNode
    participant StatsProvider
    participant PlanNodeStatsEstimate

    Optimizer->>AggregationStatsRule: doCalculate(node, statsProvider, session, types)
    AggregationStatsRule->>AggregationNode: getStep()
    AggregationStatsRule->>StatsProvider: getStats(node.getSource())
    StatsProvider-->>AggregationStatsRule: sourceStats

    alt step is PARTIAL or INTERMEDIATE
        AggregationStatsRule->>AggregationStatsRule: partialGroupBy(sourceStats, groupingKeys, aggregations)
        AggregationStatsRule->>AggregationStatsRule: getGroupByVariablesStatistics(sourceStats, groupingKeys)
        AggregationStatsRule->>AggregationStatsRule: estimateAggregationStats(aggregation, sourceStats) *
        AggregationStatsRule-->>PlanNodeStatsEstimate: partialEstimate
    else step is SINGLE or FINAL
        AggregationStatsRule->>AggregationStatsRule: groupBy(sourceStats, groupingKeys, aggregations)
        alt groupingKeys is empty
            AggregationStatsRule->>PlanNodeStatsEstimate: setConfidence(FACT)
            AggregationStatsRule->>PlanNodeStatsEstimate: setOutputRowCount(1)
        else groupingKeys not empty
            AggregationStatsRule->>AggregationStatsRule: getGroupByVariablesStatistics(sourceStats, groupingKeys)
            AggregationStatsRule->>AggregationStatsRule: getRowsCount(sourceStats, groupingKeys)
            AggregationStatsRule->>PlanNodeStatsEstimate: setOutputRowCount(min(rowsCount, sourceStats.rowCount))
        end
        AggregationStatsRule->>AggregationStatsRule: estimateAggregationStats(aggregation, sourceStats) *
        AggregationStatsRule-->>PlanNodeStatsEstimate: finalEstimate
    end

    AggregationStatsRule-->>Optimizer: Optional.of(estimate)
Loading

Updated class diagram for AggregationStatsRule and related types

classDiagram
    class AggregationStatsRule {
        +Optional doCalculate(AggregationNode node, StatsProvider statsProvider, Session session, TypeProvider types)
        +static PlanNodeStatsEstimate groupBy(PlanNodeStatsEstimate sourceStats, Collection groupByVariables, Map aggregations)
        +static double getRowsCount(PlanNodeStatsEstimate sourceStats, Collection groupByVariables)
        -static PlanNodeStatsEstimate partialGroupBy(PlanNodeStatsEstimate sourceStats, Collection groupByVariables, Map aggregations)
        -static Map getGroupByVariablesStatistics(PlanNodeStatsEstimate sourceStats, Collection groupByVariables)
        -static VariableStatsEstimate estimateAggregationStats(Aggregation aggregation, PlanNodeStatsEstimate sourceStats)
    }

    class AggregationNode {
        +Step getStep()
        +PlanNode getSource()
        +Collection getGroupingKeys()
        +Map getAggregations()
    }

    class PlanNodeStatsEstimate {
        +double getOutputRowCount()
        +VariableStatsEstimate getVariableStatistics(VariableReferenceExpression variable)
        +Builder builder()
    }

    class PlanNodeStatsEstimate.Builder {
        +PlanNodeStatsEstimate.Builder setConfidence(double confidence)
        +PlanNodeStatsEstimate.Builder setOutputRowCount(double rowCount)
        +PlanNodeStatsEstimate.Builder addVariableStatistics(VariableReferenceExpression variable, VariableStatsEstimate stats)
        +PlanNodeStatsEstimate build()
    }

    class VariableStatsEstimate {
        +double getNullsFraction()
        +double getDistinctValuesCount()
        +VariableStatsEstimate mapNullsFraction(Function mapper)
        +static VariableStatsEstimate unknown()
    }

    class Aggregation {
    }

    class VariableReferenceExpression {
    }

    class StatsProvider {
        +PlanNodeStatsEstimate getStats(PlanNode node)
    }

    class Session {
    }

    class TypeProvider {
    }

    class PlanNode {
    }

    class Step {
        <<enumeration>>
        PARTIAL
        INTERMEDIATE
        FINAL
        SINGLE
    }

    AggregationStatsRule --> AggregationNode : uses
    AggregationStatsRule --> StatsProvider : uses
    AggregationStatsRule --> PlanNodeStatsEstimate : produces
    AggregationStatsRule --> VariableStatsEstimate : computes
    AggregationStatsRule --> Aggregation : forAggregationStats
    AggregationStatsRule --> VariableReferenceExpression : groupByKeys
    AggregationNode --> Step : hasStep
    StatsProvider --> PlanNode : statsFor
    PlanNodeStatsEstimate --> VariableStatsEstimate : contains
    PlanNodeStatsEstimate.Builder --> PlanNodeStatsEstimate : builds
    VariableStatsEstimate --> VariableReferenceExpression : describesStatsFor
Loading

Flow diagram for AggregationStatsRule aggregation step handling

flowchart TD
    Start([Start doCalculate])
    Step["Read aggregation step from AggregationNode"]
    CheckStep{Step is
PARTIAL or INTERMEDIATE?}
    Partial["Call partialGroupBy(sourceStats,
 groupingKeys, aggregations)"]
    FinalOrSingle["Call groupBy(sourceStats,
 groupingKeys, aggregations)"]
    CheckGlobal{groupingKeys empty?}
    GlobalAgg["Set confidence FACT
Set outputRowCount 1"]
    NonGlobalAgg["Compute group-by stats and
rowsCount via
getGroupByVariablesStatistics
and getRowsCount"]
    AggStats["For each aggregation
estimateAggregationStats"]
    BuildEstimate["Build PlanNodeStatsEstimate
and wrap in Optional"]
    End([Return estimate])

    Start --> Step --> CheckStep
    CheckStep -->|Yes| Partial --> AggStats --> BuildEstimate --> End
    CheckStep -->|No| FinalOrSingle --> CheckGlobal
    CheckGlobal -->|Yes| GlobalAgg --> AggStats --> BuildEstimate --> End
    CheckGlobal -->|No| NonGlobalAgg --> AggStats --> BuildEstimate --> End
Loading

File-Level Changes

Change Details Files
Extend aggregation stats calculation to support PARTIAL and INTERMEDIATE steps and refactor group-by logic.
  • Replace early return that limited stats calculation to SINGLE step with branching that calls either partialGroupBy or groupBy based on aggregation step.
  • Introduce partialGroupBy to pessimistically forward source row count and grouping key stats for PARTIAL and INTERMEDIATE aggregations while still estimating aggregation outputs.
  • Refactor existing groupBy implementation to use helper methods for row-count computation and grouping key stats, and always estimate stats for aggregation outputs.
presto-main-base/src/main/java/com/facebook/presto/cost/AggregationStatsRule.java
Add reusable helpers for grouping key stats and row-count estimation based on distinct values and null fractions.
  • Add getRowsCount helper that multiplies per-key (NDV + possible null group) to estimate grouped output row count, leaving capping to the caller.
  • Add getGroupByVariablesStatistics to copy grouping key stats from the source while remapping nullsFraction to 0 when there are no nulls, or to 1 / (NDV + 1) when there are nulls.
  • Remove the now-redundant isGlobalAggregation helper in favor of direct isEmpty checks on grouping keys.
presto-main-base/src/main/java/com/facebook/presto/cost/AggregationStatsRule.java
Handle global (no-key) aggregations explicitly as single-row outputs with FACT confidence.
  • Update groupBy so that when there are no grouping keys it sets outputRowCount to 1 and confidence to FACT regardless of source stats.
  • Ensure aggregation output variables still receive estimated (currently unknown) stats even for global aggregations.
presto-main-base/src/main/java/com/facebook/presto/cost/AggregationStatsRule.java
Broaden unit test coverage for aggregation stats across multiple shapes, steps, and edge cases.
  • Introduce shared VariableReferenceExpression constants for x, y, and z to reduce duplication in tests.
  • Add tests for global aggregations with non-zero and zero input rows, verifying single-row output and FACT confidence.
  • Add tests for PARTIAL and INTERMEDIATE aggregations to ensure row counts are preserved and grouping key stats are forwarded.
  • Add tests for SINGLE and FINAL aggregations with single and multiple grouping keys, covering null-handling, NDV-based row-count computation, capping to input rows, and behavior when grouping key stats are unknown.
  • Adjust existing tests to use the new variable constants while preserving prior assertions on z and y stats and row-count capping behavior.
presto-main-base/src/test/java/com/facebook/presto/cost/TestAggregationStatsRule.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • Consider making getRowsCount package-private or private instead of public, since it’s currently only used within AggregationStatsRule and exposing it widens the API surface unnecessarily.
  • In partialGroupBy, you might want to propagate the source stats confidence level (e.g., result.setConfidence(sourceStats.getConfidence())) to keep the forwarded estimates consistent with the input rather than relying on the default.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider making `getRowsCount` package-private or private instead of public, since it’s currently only used within `AggregationStatsRule` and exposing it widens the API surface unnecessarily.
- In `partialGroupBy`, you might want to propagate the source stats confidence level (e.g., `result.setConfidence(sourceStats.getConfidence())`) to keep the forwarded estimates consistent with the input rather than relying on the default.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@aaneja aaneja requested a review from elharo as a code owner February 27, 2026 04:26
@aaneja aaneja force-pushed the improveStatsEstimations branch from abf1e32 to 5398a9f Compare February 27, 2026 06:20
Copy link
Copy Markdown
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aaneja. Minor comment.

@aaneja aaneja force-pushed the improveStatsEstimations branch from 5398a9f to 8cba327 Compare March 10, 2026 03:22
Copy link
Copy Markdown
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aaneja

@aaneja aaneja merged commit b7142a3 into prestodb:master Mar 11, 2026
81 checks passed
@aaneja aaneja deleted the improveStatsEstimations branch March 11, 2026 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants