Improve aggregation cardinality estimation#19595
Improve aggregation cardinality estimation#19595fgwang7w wants to merge 2 commits intoprestodb:masterfrom
Conversation
534a3ed to
225d43c
Compare
225d43c to
3a327cd
Compare
3a327cd to
8dab780
Compare
|
This PR enhances the aggregation stats model to provide better estimates when aggregate functions are present. Could you please help review? @bmckennaah @rschlussel @prestodb/committers Thank you! |
2009a15 to
2c56be5
Compare
There was a problem hiding this comment.
this should be all covered by the groupBy logic. not sure why this extra logic is added here.
There was a problem hiding this comment.
thank you for reviewing. Initially I was planning to use this logic to make sure all scalar functions that are bounded by single row but you are right, this is an extra logic. I will remove this out.
There was a problem hiding this comment.
why did you remove this? e.g. partial aggregation output should probably be estimated to be the same as source stats. But in any case, the node step only shows up in the final plan. At the time when we are making cost based decisions, the only step we see is single
There was a problem hiding this comment.
I remove this logic so that it can sort of "propagate" the stats calculated during SINGLE step phase to the double agg phase where plan will show up as having missing estimated stats. In other words, if we keep this logic, the aggregation nodes with PARTIAL step and FINAL step would return NaN stats causing confusing as if the aggregation node has an unknown estimates but it should have given some stats estimation during SINGLE step phase. Do you have any suggestion on how to avoid this? because current cost model derived from stats has this issue where CostCalculatorUsingExchanges cannot derive cost from PARTIAL step which returns unknown cost instead
There was a problem hiding this comment.
I see. Instead could we have only single/final use these estimates and partial/intermediate just propagate the estimates from the source table. I think that will be a better reflection of what to expect.
presto-main/src/main/java/com/facebook/presto/cost/AggregationStatsRule.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
- use NEGATIVE_INFINITY and POSITIVE_INFINITY to represent the full range.
- aggregations CAN return null on null input, so if the source stats are 100% null and the function is not called on null input, then the nulls fraction here will also be 1.0 (not sure if there are any aggs that return null on mix of null and non-null input, but should be okay to skip this case).
- it's not at all guaranteed that the aggregation columns will be distinct (for the stats estimates i guess it's ok as a heuristic, but NaN would be more accurate. It definitely needs a comment) As an example, if you did
SELECT count(*) from nation GROUP BY regionkey, every row would have a value of 5. - as a further optimization for min/max/count that are only one row, we can get the min, max or count of the source stats.
There was a problem hiding this comment.
#1. fixed
#2. if all input are null, isn't the aggregate of all nulls be a single null?

#3 yes if agg columns are not all distinct, the estimate can only be heuristic in such case to return a max row count possible so that cost-based decision can avoid having NaN case

#4 if there's no global grouping key yes we can optimize further for min/max/count that only returns a single row output
There was a problem hiding this comment.
- agree
- Aggregations can return NULL. But for most queries this is not the case and I do not see how we could understand/leverage that in the optimizer.
- Returning NaN may be more "accurate" but isn't that what leads to the problem we are trying to solve? Ultimately almost all estimates used in the optimizer are "guesses", e.g. Presto uses 0.9 as the selectivity for some single table predicates. Not having some reasonable estimate for the NDV of an aggregate column leads to very bad plans even for simple cases, e.g. "explain select 1 from t1, t2 where a1=a2 and b1 = (select max(b3) from t3 group by a3)". Rewriting a query to get the desired join order is not an option in the vast majority of environments.
- agree
There was a problem hiding this comment.
- yeah, i think the heuristic is okay, but it would be good to have a small comment noting that it's a heuristic. I feel like all these stats rules are a combo of things that are definitionally true of the operator (e.g ungrouped aggregation only returns a single row) and some heuristics that we expect to be true more often than not. I don't think we've been very consistent about doing this throughout the stats rules, but for me it's helpful to call out when/why we are using a heuristic so it's easier to understand where the stats estimates come from.
There was a problem hiding this comment.
- @fgwang7w your example is null grouping keys - I mean the aggregation inputs are null:
presto:di> with x as (SELECT CAST (null as INTEGER) a, 1 as b) SELECT count(a), sum(a), count(*), b from x group by b;
_col0 | _col1 | _col2 | b
-------+-------+-------+---
0 | NULL | 1 | 1
I was suggesting that since we have the null fraction for a column and info for each function whether it returns null on null input, we could use that to be more precise for all null input if we wanted (for an additional improvement beyond what's covered in this PR)
There was a problem hiding this comment.
Thank you for the hints. I refactor the code a little bit, please help review.
feilong-liu
left a comment
There was a problem hiding this comment.
Curions, I see in the PR description
Improve cardinality estimation model for aggregation functions like: min, max, avg, count, sum.
But I do not see any logic specific related to what aggregation functions are in aggregation, which confused me. Just to clarify, it's not tied to specific functions, but general for aggregation, is this correct?
There was a problem hiding this comment.
This assume the input and output count is the same when no group by key stats available? Just curious if there is any reason/benefit of doing this?
There was a problem hiding this comment.
Thanks for reviewing the code. The assumption is that if the agg columns cannot guaranteed uniqueness, the agg estimates takes the source output row count as the upper bound limit to avoid returning NaN which breaks the join reordering.
2c56be5 to
1672d09
Compare
|
@feilong-liu I have updated the PR description to avoid any confusion. please let me know if you have any further concerns. Thanks! |
d78bd37 to
7e1396d
Compare
rschlussel
left a comment
There was a problem hiding this comment.
not sure if you've finished updating. don't see some of the changes you said you made.
There was a problem hiding this comment.
I see. Instead could we have only single/final use these estimates and partial/intermediate just propagate the estimates from the source table. I think that will be a better reflection of what to expect.
There was a problem hiding this comment.
- yeah, i think the heuristic is okay, but it would be good to have a small comment noting that it's a heuristic. I feel like all these stats rules are a combo of things that are definitionally true of the operator (e.g ungrouped aggregation only returns a single row) and some heuristics that we expect to be true more often than not. I don't think we've been very consistent about doing this throughout the stats rules, but for me it's helpful to call out when/why we are using a heuristic so it's easier to understand where the stats estimates come from.
There was a problem hiding this comment.
- @fgwang7w your example is null grouping keys - I mean the aggregation inputs are null:
presto:di> with x as (SELECT CAST (null as INTEGER) a, 1 as b) SELECT count(a), sum(a), count(*), b from x group by b;
_col0 | _col1 | _col2 | b
-------+-------+-------+---
0 | NULL | 1 | 1
I was suggesting that since we have the null fraction for a column and info for each function whether it returns null on null input, we could use that to be more precise for all null input if we wanted (for an additional improvement beyond what's covered in this PR)
rschlussel
left a comment
There was a problem hiding this comment.
not sure if you've finished updating. don't see some of the changes you said you made.
7e1396d to
043c98c
Compare
presto-main/src/main/java/com/facebook/presto/cost/AggregationStatsRule.java
Outdated
Show resolved
Hide resolved
|
I believe we should introduce a customizable constant, which can be used to scale the projected upper limit, particularly when statistics are not accessible. |
043c98c to
66e95c6
Compare
|
Thank you @rschlussel @jaystarshot for giving the review. I have refactored the code a bit more. In addition, I revised the cost calculator to reflect the cpu/memory/io cost derived with the stats estimate. Here's how it would look like now. |
There was a problem hiding this comment.
I think the constant should be multiplied everywhere where the result is updated like here.
There was a problem hiding this comment.
sure will factor in for all cases, thanks
66e95c6 to
f85a26a
Compare
Improve cardinality estimation model for aggregation functions like: min, max, avg, count, sum. These are simple scalar functions which are deterministic to return a single row. In addition, define statistics estimation boundary: 1. If the source only return a single row, it's guaranteed a single row output 2. use the output row count from the source stats as the estimated output row count as the upper bound limit given Aggregation node cannot return output row count more than estimated row count of source node.
74f9932 to
92da20d
Compare
|
|
||
| // TODO implement simple aggregations like: min, max, count, sum | ||
| return VariableStatsEstimate.unknown(); | ||
| // Double.MIN_VALUE is a constant holding the smallest positive nonzero value of type double. |
There was a problem hiding this comment.
this comment is no longer relevant
| // count as the upper bound limit to avoid returning an unknown output potentially breaks the join search space | ||
| double outputRowCountFromSource = getDefaultAggregateSelectivityCoefficient(session) * sourceStats.getOutputRowCount(); | ||
| if (!isNaN(rowsCount)) { | ||
| result.setOutputRowCount(min(rowsCount, outputRowCountFromSource)); |
There was a problem hiding this comment.
I think we want to bound by the total size of the source, not the source * the selectivity coefficient.
There was a problem hiding this comment.
initial thinking was to take the selectivity factor in across all stats even for the output estimation from source because if the selectivity factor 90%, it'll be a discounted upper bound limit for this aggregate operator, is it the right assumption? If not I can change the code back to ignore taking the selectivity factor into account.
| } | ||
|
|
||
| double rowsCount = 1; | ||
| double rowsCount = getDefaultAggregateSelectivityCoefficient(session); |
There was a problem hiding this comment.
does this handle ungrouped aggregations properly now? looks like we'd get a row count < 1 in this case.
There was a problem hiding this comment.
thanks for catching, no this is a bug, will fix
There was a problem hiding this comment.
Thanks. make sure this and the other issues here are also covered by tests.
There was a problem hiding this comment.
yes will do. I am running internal test, once all passed, the new commit will be pushed for you to re-review. thank you for your help!
| int nullRow = (symbolStatistics.getNullsFraction() == 0.0) ? 0 : 1; | ||
| rowsCount *= symbolStatistics.getDistinctValuesCount() + nullRow; | ||
| if (!isNaN(symbolStatistics.getAverageRowSize())) { | ||
| totalSize += symbolStatistics.getAverageRowSize() * rowsCount; |
There was a problem hiding this comment.
this seems incorrect since the rowsCount will change as we go through each column. probably we want to add all the average row sizes per-column and then at the end multiply that by whatever row count we've ended up that.
| result.addVariableStatistics(aggregationEntry.getKey(), estimateAggregationStats(aggregationEntry.getValue(), sourceStats)); | ||
| VariableStatsEstimate aggVariableStatsEstimate = estimateAggregationStats(aggregationEntry.getValue(), sourceStats, rowsCount); | ||
| result.addVariableStatistics(aggregationEntry.getKey(), aggVariableStatsEstimate); | ||
| totalSize += aggVariableStatsEstimate.getAverageRowSize() * aggVariableStatsEstimate.getDistinctValuesCount(); |
There was a problem hiding this comment.
I'm confused about the totalsize calculation in general here. I think the totalsize should something like the sum of the average size per-column (including both aggregation and group by columns) * the number of rows
There was a problem hiding this comment.
you are right. so the implementation will be changed to add up all the average per-column size from group-by variables first, and then sum it with the avg size from aggregation columns. Overall sum is multiplied by the number of rows. I add a comment to elaborate the logic.
|
@fgwang7w do you need any help in merging/testing this PR? |

This PR intents to improve cardinality estimation for aggregation.
min,max,avg,count,sumthat are simple scalar functions, it is deterministic to return a single row. So if the source only return a single row, the estimate is guaranteed to a single row outputmore than estimated row count of source node.
resolve #19354
co-authored-by: bmckennaah
BEFORE:
AFTER:
Query plan improvement on Tpch:

q15:
Query plan improvement on TPC-DS:
q4, q11, q14_1, q14_2, q34, q73, q74, q78, q94