feat(planner): Update AggregationStatsRule to work for more aggregation shapes by aaneja · Pull Request #27215 · prestodb/presto

aaneja · 2026-02-26T12:49:40Z

Ported over from https://github.com/trinodb/trino/blob/060d9cbf4e584301b7a2f3faac14afe24ddcf53e/core/trino-main/src/main/java/io/trino/cost/AggregationStatsRule.java

Co-authored-by: Kamil Endruszkiewicz kamil.endruszkiewicz@starburstdata.com
Co-authored-by: @copilot (tests)

Description, Motivation and Context

PARTIAL Aggregation nodes break stats propagation and lead to poorer plans

Impact

Stats estimates for Aggregation nodes are propagated

Test Plan

New test added

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.
If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

== NO RELEASE NOTE ==

Summary by Sourcery

Extend aggregation statistics estimation to support more aggregation shapes and planner steps, improving row count and confidence handling for aggregation nodes.

New Features:

Add statistics estimation for PARTIAL and INTERMEDIATE aggregation steps by forwarding source stats with no assumed reduction.
Handle global (no-key) aggregations by always estimating a single output row with FACT confidence.

Enhancements:

Refine group-by aggregation stats to account for grouping key distinct counts, null handling, and capping output row count to input rows across FINAL/SINGLE steps.
Factor out reusable helpers for group-by variable statistics and row-count computation in aggregation stats rule.
Replace repeated inline variable references in aggregation stats tests with shared constants for clarity and reuse.

Tests:

Expand aggregation stats test coverage to cover global, partial, intermediate, final, and multi-key grouping scenarios, including null handling and NDV vs. input row count edge cases.

sourcery-ai · 2026-02-26T12:49:46Z

Reviewer's Guide

Extends AggregationStatsRule to handle all aggregation steps and more grouping configurations, adds helper methods for row-count and group-by variable stats computation, and significantly broadens test coverage to validate stats propagation for global, partial, intermediate, final, and multi-key aggregations including edge cases with nulls and NDV capping.

Sequence diagram for updated AggregationStatsRule stats calculation

sequenceDiagram
    actor Optimizer
    participant AggregationStatsRule
    participant AggregationNode
    participant StatsProvider
    participant PlanNodeStatsEstimate

    Optimizer->>AggregationStatsRule: doCalculate(node, statsProvider, session, types)
    AggregationStatsRule->>AggregationNode: getStep()
    AggregationStatsRule->>StatsProvider: getStats(node.getSource())
    StatsProvider-->>AggregationStatsRule: sourceStats

    alt step is PARTIAL or INTERMEDIATE
        AggregationStatsRule->>AggregationStatsRule: partialGroupBy(sourceStats, groupingKeys, aggregations)
        AggregationStatsRule->>AggregationStatsRule: getGroupByVariablesStatistics(sourceStats, groupingKeys)
        AggregationStatsRule->>AggregationStatsRule: estimateAggregationStats(aggregation, sourceStats) *
        AggregationStatsRule-->>PlanNodeStatsEstimate: partialEstimate
    else step is SINGLE or FINAL
        AggregationStatsRule->>AggregationStatsRule: groupBy(sourceStats, groupingKeys, aggregations)
        alt groupingKeys is empty
            AggregationStatsRule->>PlanNodeStatsEstimate: setConfidence(FACT)
            AggregationStatsRule->>PlanNodeStatsEstimate: setOutputRowCount(1)
        else groupingKeys not empty
            AggregationStatsRule->>AggregationStatsRule: getGroupByVariablesStatistics(sourceStats, groupingKeys)
            AggregationStatsRule->>AggregationStatsRule: getRowsCount(sourceStats, groupingKeys)
            AggregationStatsRule->>PlanNodeStatsEstimate: setOutputRowCount(min(rowsCount, sourceStats.rowCount))
        end
        AggregationStatsRule->>AggregationStatsRule: estimateAggregationStats(aggregation, sourceStats) *
        AggregationStatsRule-->>PlanNodeStatsEstimate: finalEstimate
    end

    AggregationStatsRule-->>Optimizer: Optional.of(estimate)

Updated class diagram for AggregationStatsRule and related types

classDiagram
    class AggregationStatsRule {
        +Optional doCalculate(AggregationNode node, StatsProvider statsProvider, Session session, TypeProvider types)
        +static PlanNodeStatsEstimate groupBy(PlanNodeStatsEstimate sourceStats, Collection groupByVariables, Map aggregations)
        +static double getRowsCount(PlanNodeStatsEstimate sourceStats, Collection groupByVariables)
        -static PlanNodeStatsEstimate partialGroupBy(PlanNodeStatsEstimate sourceStats, Collection groupByVariables, Map aggregations)
        -static Map getGroupByVariablesStatistics(PlanNodeStatsEstimate sourceStats, Collection groupByVariables)
        -static VariableStatsEstimate estimateAggregationStats(Aggregation aggregation, PlanNodeStatsEstimate sourceStats)
    }

    class AggregationNode {
        +Step getStep()
        +PlanNode getSource()
        +Collection getGroupingKeys()
        +Map getAggregations()
    }

    class PlanNodeStatsEstimate {
        +double getOutputRowCount()
        +VariableStatsEstimate getVariableStatistics(VariableReferenceExpression variable)
        +Builder builder()
    }

    class PlanNodeStatsEstimate.Builder {
        +PlanNodeStatsEstimate.Builder setConfidence(double confidence)
        +PlanNodeStatsEstimate.Builder setOutputRowCount(double rowCount)
        +PlanNodeStatsEstimate.Builder addVariableStatistics(VariableReferenceExpression variable, VariableStatsEstimate stats)
        +PlanNodeStatsEstimate build()
    }

    class VariableStatsEstimate {
        +double getNullsFraction()
        +double getDistinctValuesCount()
        +VariableStatsEstimate mapNullsFraction(Function mapper)
        +static VariableStatsEstimate unknown()
    }

    class Aggregation {
    }

    class VariableReferenceExpression {
    }

    class StatsProvider {
        +PlanNodeStatsEstimate getStats(PlanNode node)
    }

    class Session {
    }

    class TypeProvider {
    }

    class PlanNode {
    }

    class Step {
        <<enumeration>>
        PARTIAL
        INTERMEDIATE
        FINAL
        SINGLE
    }

    AggregationStatsRule --> AggregationNode : uses
    AggregationStatsRule --> StatsProvider : uses
    AggregationStatsRule --> PlanNodeStatsEstimate : produces
    AggregationStatsRule --> VariableStatsEstimate : computes
    AggregationStatsRule --> Aggregation : forAggregationStats
    AggregationStatsRule --> VariableReferenceExpression : groupByKeys
    AggregationNode --> Step : hasStep
    StatsProvider --> PlanNode : statsFor
    PlanNodeStatsEstimate --> VariableStatsEstimate : contains
    PlanNodeStatsEstimate.Builder --> PlanNodeStatsEstimate : builds
    VariableStatsEstimate --> VariableReferenceExpression : describesStatsFor

Flow diagram for AggregationStatsRule aggregation step handling

flowchart TD
    Start([Start doCalculate])
    Step["Read aggregation step from AggregationNode"]
    CheckStep{Step is
PARTIAL or INTERMEDIATE?}
    Partial["Call partialGroupBy(sourceStats,
 groupingKeys, aggregations)"]
    FinalOrSingle["Call groupBy(sourceStats,
 groupingKeys, aggregations)"]
    CheckGlobal{groupingKeys empty?}
    GlobalAgg["Set confidence FACT
Set outputRowCount 1"]
    NonGlobalAgg["Compute group-by stats and
rowsCount via
getGroupByVariablesStatistics
and getRowsCount"]
    AggStats["For each aggregation
estimateAggregationStats"]
    BuildEstimate["Build PlanNodeStatsEstimate
and wrap in Optional"]
    End([Return estimate])

    Start --> Step --> CheckStep
    CheckStep -->|Yes| Partial --> AggStats --> BuildEstimate --> End
    CheckStep -->|No| FinalOrSingle --> CheckGlobal
    CheckGlobal -->|Yes| GlobalAgg --> AggStats --> BuildEstimate --> End
    CheckGlobal -->|No| NonGlobalAgg --> AggStats --> BuildEstimate --> End

File-Level Changes

Change	Details	Files
Extend aggregation stats calculation to support PARTIAL and INTERMEDIATE steps and refactor group-by logic.	Replace early return that limited stats calculation to SINGLE step with branching that calls either partialGroupBy or groupBy based on aggregation step. Introduce partialGroupBy to pessimistically forward source row count and grouping key stats for PARTIAL and INTERMEDIATE aggregations while still estimating aggregation outputs. Refactor existing groupBy implementation to use helper methods for row-count computation and grouping key stats, and always estimate stats for aggregation outputs.	`presto-main-base/src/main/java/com/facebook/presto/cost/AggregationStatsRule.java`
Add reusable helpers for grouping key stats and row-count estimation based on distinct values and null fractions.	Add getRowsCount helper that multiplies per-key (NDV + possible null group) to estimate grouped output row count, leaving capping to the caller. Add getGroupByVariablesStatistics to copy grouping key stats from the source while remapping nullsFraction to 0 when there are no nulls, or to 1 / (NDV + 1) when there are nulls. Remove the now-redundant isGlobalAggregation helper in favor of direct isEmpty checks on grouping keys.	`presto-main-base/src/main/java/com/facebook/presto/cost/AggregationStatsRule.java`
Handle global (no-key) aggregations explicitly as single-row outputs with FACT confidence.	Update groupBy so that when there are no grouping keys it sets outputRowCount to 1 and confidence to FACT regardless of source stats. Ensure aggregation output variables still receive estimated (currently unknown) stats even for global aggregations.	`presto-main-base/src/main/java/com/facebook/presto/cost/AggregationStatsRule.java`
Broaden unit test coverage for aggregation stats across multiple shapes, steps, and edge cases.	Introduce shared VariableReferenceExpression constants for x, y, and z to reduce duplication in tests. Add tests for global aggregations with non-zero and zero input rows, verifying single-row output and FACT confidence. Add tests for PARTIAL and INTERMEDIATE aggregations to ensure row counts are preserved and grouping key stats are forwarded. Add tests for SINGLE and FINAL aggregations with single and multiple grouping keys, covering null-handling, NDV-based row-count computation, capping to input rows, and behavior when grouping key stats are unknown. Adjust existing tests to use the new variable constants while preserving prior assertions on z and y stats and row-count capping behavior.	`presto-main-base/src/test/java/com/facebook/presto/cost/TestAggregationStatsRule.java`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've left some high level feedback:

Consider making getRowsCount package-private or private instead of public, since it’s currently only used within AggregationStatsRule and exposing it widens the API surface unnecessarily.
In partialGroupBy, you might want to propagate the source stats confidence level (e.g., result.setConfidence(sourceStats.getConfidence())) to keep the forwarded estimates consistent with the input rather than relying on the default.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Consider making `getRowsCount` package-private or private instead of public, since it’s currently only used within `AggregationStatsRule` and exposing it widens the API surface unnecessarily.
- In `partialGroupBy`, you might want to propagate the source stats confidence level (e.g., `result.setConfidence(sourceStats.getConfidence())`) to keep the forwarded estimates consistent with the input rather than relying on the default.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

aditi-pandit

Thanks @aaneja. Minor comment.

presto-main-base/src/main/java/com/facebook/presto/cost/AggregationStatsRule.java

…on shapes Ported over from https://github.com/trinodb/trino/blob/060d9cbf4e584301b7a2f3faac14afe24ddcf53e/core/trino-main/src/main/java/io/trino/cost/AggregationStatsRule.java

aditi-pandit

Thanks @aaneja

aaneja requested review from a team, feilong-liu and jaystarshot as code owners February 26, 2026 12:49

prestodb-ci added the from:IBM PR from IBM label Feb 26, 2026

prestodb-ci requested review from a team, jkhaliqi and jp-sivaprasad and removed request for a team February 26, 2026 12:49

sourcery-ai bot reviewed Feb 26, 2026

View reviewed changes

aaneja requested a review from elharo as a code owner February 27, 2026 04:26

aaneja force-pushed the improveStatsEstimations branch from abf1e32 to 5398a9f Compare February 27, 2026 06:20

aditi-pandit reviewed Mar 6, 2026

View reviewed changes

presto-main-base/src/main/java/com/facebook/presto/cost/AggregationStatsRule.java Outdated Show resolved Hide resolved

aaneja added 3 commits March 10, 2026 08:11

feat(planner): Update AggregationStatsRule to work for more aggregati…

91079ea

…on shapes Ported over from https://github.com/trinodb/trino/blob/060d9cbf4e584301b7a2f3faac14afe24ddcf53e/core/trino-main/src/main/java/io/trino/cost/AggregationStatsRule.java

Fix HBO tests that use Aggregation

671fcd7

Minor refactor

8cba327

aaneja force-pushed the improveStatsEstimations branch from 5398a9f to 8cba327 Compare March 10, 2026 03:22

aditi-pandit approved these changes Mar 10, 2026

View reviewed changes

tdcmeehan approved these changes Mar 11, 2026

View reviewed changes

aaneja merged commit b7142a3 into prestodb:master Mar 11, 2026
81 checks passed

aaneja deleted the improveStatsEstimations branch March 11, 2026 03:29

This was referenced Mar 31, 2026

docs: Add release notes for 0.297 unix280/presto#51

Closed

docs: Add release notes for 0.297 unix280/presto#52

Open

prestodb-ci mentioned this pull request Apr 1, 2026

docs: Add release notes for 0.297 #27484

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(planner): Update AggregationStatsRule to work for more aggregation shapes#27215

feat(planner): Update AggregationStatsRule to work for more aggregation shapes#27215
aaneja merged 3 commits intoprestodb:masterfrom
aaneja:improveStatsEstimations

aaneja commented Feb 26, 2026 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Feb 26, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

aditi-pandit left a comment

Uh oh!

Uh oh!

aditi-pandit left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

aaneja commented Feb 26, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description, Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for updated AggregationStatsRule stats calculation

Updated class diagram for AggregationStatsRule and related types

Flow diagram for AggregationStatsRule aggregation step handling

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

aditi-pandit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aditi-pandit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aaneja commented Feb 26, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Feb 26, 2026 •

edited

Loading