feat(planner): Update AggregationStatsRule to work for more aggregation shapes#27215
Merged
aaneja merged 3 commits intoprestodb:masterfrom Mar 11, 2026
Merged
feat(planner): Update AggregationStatsRule to work for more aggregation shapes#27215aaneja merged 3 commits intoprestodb:masterfrom
aaneja merged 3 commits intoprestodb:masterfrom
Conversation
Contributor
Reviewer's GuideExtends AggregationStatsRule to handle all aggregation steps and more grouping configurations, adds helper methods for row-count and group-by variable stats computation, and significantly broadens test coverage to validate stats propagation for global, partial, intermediate, final, and multi-key aggregations including edge cases with nulls and NDV capping. Sequence diagram for updated AggregationStatsRule stats calculationsequenceDiagram
actor Optimizer
participant AggregationStatsRule
participant AggregationNode
participant StatsProvider
participant PlanNodeStatsEstimate
Optimizer->>AggregationStatsRule: doCalculate(node, statsProvider, session, types)
AggregationStatsRule->>AggregationNode: getStep()
AggregationStatsRule->>StatsProvider: getStats(node.getSource())
StatsProvider-->>AggregationStatsRule: sourceStats
alt step is PARTIAL or INTERMEDIATE
AggregationStatsRule->>AggregationStatsRule: partialGroupBy(sourceStats, groupingKeys, aggregations)
AggregationStatsRule->>AggregationStatsRule: getGroupByVariablesStatistics(sourceStats, groupingKeys)
AggregationStatsRule->>AggregationStatsRule: estimateAggregationStats(aggregation, sourceStats) *
AggregationStatsRule-->>PlanNodeStatsEstimate: partialEstimate
else step is SINGLE or FINAL
AggregationStatsRule->>AggregationStatsRule: groupBy(sourceStats, groupingKeys, aggregations)
alt groupingKeys is empty
AggregationStatsRule->>PlanNodeStatsEstimate: setConfidence(FACT)
AggregationStatsRule->>PlanNodeStatsEstimate: setOutputRowCount(1)
else groupingKeys not empty
AggregationStatsRule->>AggregationStatsRule: getGroupByVariablesStatistics(sourceStats, groupingKeys)
AggregationStatsRule->>AggregationStatsRule: getRowsCount(sourceStats, groupingKeys)
AggregationStatsRule->>PlanNodeStatsEstimate: setOutputRowCount(min(rowsCount, sourceStats.rowCount))
end
AggregationStatsRule->>AggregationStatsRule: estimateAggregationStats(aggregation, sourceStats) *
AggregationStatsRule-->>PlanNodeStatsEstimate: finalEstimate
end
AggregationStatsRule-->>Optimizer: Optional.of(estimate)
Updated class diagram for AggregationStatsRule and related typesclassDiagram
class AggregationStatsRule {
+Optional doCalculate(AggregationNode node, StatsProvider statsProvider, Session session, TypeProvider types)
+static PlanNodeStatsEstimate groupBy(PlanNodeStatsEstimate sourceStats, Collection groupByVariables, Map aggregations)
+static double getRowsCount(PlanNodeStatsEstimate sourceStats, Collection groupByVariables)
-static PlanNodeStatsEstimate partialGroupBy(PlanNodeStatsEstimate sourceStats, Collection groupByVariables, Map aggregations)
-static Map getGroupByVariablesStatistics(PlanNodeStatsEstimate sourceStats, Collection groupByVariables)
-static VariableStatsEstimate estimateAggregationStats(Aggregation aggregation, PlanNodeStatsEstimate sourceStats)
}
class AggregationNode {
+Step getStep()
+PlanNode getSource()
+Collection getGroupingKeys()
+Map getAggregations()
}
class PlanNodeStatsEstimate {
+double getOutputRowCount()
+VariableStatsEstimate getVariableStatistics(VariableReferenceExpression variable)
+Builder builder()
}
class PlanNodeStatsEstimate.Builder {
+PlanNodeStatsEstimate.Builder setConfidence(double confidence)
+PlanNodeStatsEstimate.Builder setOutputRowCount(double rowCount)
+PlanNodeStatsEstimate.Builder addVariableStatistics(VariableReferenceExpression variable, VariableStatsEstimate stats)
+PlanNodeStatsEstimate build()
}
class VariableStatsEstimate {
+double getNullsFraction()
+double getDistinctValuesCount()
+VariableStatsEstimate mapNullsFraction(Function mapper)
+static VariableStatsEstimate unknown()
}
class Aggregation {
}
class VariableReferenceExpression {
}
class StatsProvider {
+PlanNodeStatsEstimate getStats(PlanNode node)
}
class Session {
}
class TypeProvider {
}
class PlanNode {
}
class Step {
<<enumeration>>
PARTIAL
INTERMEDIATE
FINAL
SINGLE
}
AggregationStatsRule --> AggregationNode : uses
AggregationStatsRule --> StatsProvider : uses
AggregationStatsRule --> PlanNodeStatsEstimate : produces
AggregationStatsRule --> VariableStatsEstimate : computes
AggregationStatsRule --> Aggregation : forAggregationStats
AggregationStatsRule --> VariableReferenceExpression : groupByKeys
AggregationNode --> Step : hasStep
StatsProvider --> PlanNode : statsFor
PlanNodeStatsEstimate --> VariableStatsEstimate : contains
PlanNodeStatsEstimate.Builder --> PlanNodeStatsEstimate : builds
VariableStatsEstimate --> VariableReferenceExpression : describesStatsFor
Flow diagram for AggregationStatsRule aggregation step handlingflowchart TD
Start([Start doCalculate])
Step["Read aggregation step from AggregationNode"]
CheckStep{Step is
PARTIAL or INTERMEDIATE?}
Partial["Call partialGroupBy(sourceStats,
groupingKeys, aggregations)"]
FinalOrSingle["Call groupBy(sourceStats,
groupingKeys, aggregations)"]
CheckGlobal{groupingKeys empty?}
GlobalAgg["Set confidence FACT
Set outputRowCount 1"]
NonGlobalAgg["Compute group-by stats and
rowsCount via
getGroupByVariablesStatistics
and getRowsCount"]
AggStats["For each aggregation
estimateAggregationStats"]
BuildEstimate["Build PlanNodeStatsEstimate
and wrap in Optional"]
End([Return estimate])
Start --> Step --> CheckStep
CheckStep -->|Yes| Partial --> AggStats --> BuildEstimate --> End
CheckStep -->|No| FinalOrSingle --> CheckGlobal
CheckGlobal -->|Yes| GlobalAgg --> AggStats --> BuildEstimate --> End
CheckGlobal -->|No| NonGlobalAgg --> AggStats --> BuildEstimate --> End
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- Consider making
getRowsCountpackage-private or private instead of public, since it’s currently only used withinAggregationStatsRuleand exposing it widens the API surface unnecessarily. - In
partialGroupBy, you might want to propagate the source stats confidence level (e.g.,result.setConfidence(sourceStats.getConfidence())) to keep the forwarded estimates consistent with the input rather than relying on the default.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider making `getRowsCount` package-private or private instead of public, since it’s currently only used within `AggregationStatsRule` and exposing it widens the API surface unnecessarily.
- In `partialGroupBy`, you might want to propagate the source stats confidence level (e.g., `result.setConfidence(sourceStats.getConfidence())`) to keep the forwarded estimates consistent with the input rather than relying on the default.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
abf1e32 to
5398a9f
Compare
aditi-pandit
reviewed
Mar 6, 2026
Contributor
aditi-pandit
left a comment
There was a problem hiding this comment.
Thanks @aaneja. Minor comment.
presto-main-base/src/main/java/com/facebook/presto/cost/AggregationStatsRule.java
Outdated
Show resolved
Hide resolved
5398a9f to
8cba327
Compare
tdcmeehan
approved these changes
Mar 11, 2026
This was referenced Mar 31, 2026
15 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ported over from https://github.com/trinodb/trino/blob/060d9cbf4e584301b7a2f3faac14afe24ddcf53e/core/trino-main/src/main/java/io/trino/cost/AggregationStatsRule.java
Co-authored-by: Kamil Endruszkiewicz kamil.endruszkiewicz@starburstdata.com
Co-authored-by: @copilot (tests)
Description, Motivation and Context
PARTIAL Aggregation nodes break stats propagation and lead to poorer plans
Impact
Stats estimates for Aggregation nodes are propagated
Test Plan
New test added
Contributor checklist
Release Notes
Summary by Sourcery
Extend aggregation statistics estimation to support more aggregation shapes and planner steps, improving row count and confidence handling for aggregation nodes.
New Features:
Enhancements:
Tests: