ES|QL: Improve aggregation over constants handling#112392
ES|QL: Improve aggregation over constants handling#112392astefan wants to merge 26 commits intoelastic:mainfrom
Conversation
Duplicate SubstituteSurrogate in "Operator Optimization" batch Many more tests Add tests for mad Add mv handling to top function
| null |null |null | ||
| ; | ||
|
|
||
| ########### failing :-( with InvalidArgumentException: Does not support yet aggregations over constants |
There was a problem hiding this comment.
This will be removed once we have mv_ function for st_centroid_agg.
| @@ -197,11 +202,18 @@ public AggregatorFunctionSupplier supplier(List<Integer> inputChannels) { | |||
| public Expression surrogate() { | |||
There was a problem hiding this comment.
The surrogate method, as it stands now, is more a "surrogate-expression-for-foldable-scenario" kind of method. This implies that the behavior that existed below before this change is not possible anymore.
There was a problem hiding this comment.
Right - we should probably look into introducing a different interface altogether: surrogate was initially used for expressions that knew they'd be transformed.
But it evolved into a mechanism for "folding" however not to a value, but another expression (which itself might be foldable or not).
There was a problem hiding this comment.
I'll leave this one for a follow up I think.
...k/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizer.java
Show resolved
Hide resolved
| * All aggregate functions that are also nullable (COUNT_DISTINCT and COUNT are exceptions), will get a NULL | ||
| * field replacement by the FoldNull rule, COUNT_DISTINCT will benefit from PropagateEvalFoldables. | ||
| */ | ||
| public final class ReplaceAggregatesWithNull extends OptimizerRules.OptimizerRule<Aggregate> { |
There was a problem hiding this comment.
This rule is a simplified variant of SubstituteSurrogates.
| Map<AggregateFunction, Attribute> aggFuncToAttr = new HashMap<>(); // existing aggregate and their respective attributes | ||
| List<Alias> transientEval = new ArrayList<>(); // surrogate functions eval | ||
| boolean changed = false; | ||
| boolean hasSurrogates = false; |
There was a problem hiding this comment.
I've done this to shortcircuit the execution earlier in the execution.
| assertEquals(countd, rule.rule(countd)); | ||
| countd = new CountDistinct(EMPTY, NULL, NULL); | ||
| assertEquals(new Literal(EMPTY, null, LONG), rule.rule(countd)); | ||
| assertEquals(countd, rule.rule(countd)); |
There was a problem hiding this comment.
This is the consequence of CountDistinct not being nullable anymore.
costin
left a comment
There was a problem hiding this comment.
LGTM - great tests and comments!
|
|
||
| @Override | ||
| public Nullability nullable() { | ||
| return Nullability.FALSE; |
| @@ -197,11 +202,18 @@ public AggregatorFunctionSupplier supplier(List<Integer> inputChannels) { | |||
| public Expression surrogate() { | |||
There was a problem hiding this comment.
Right - we should probably look into introducing a different interface altogether: surrogate was initially used for expressions that knew they'd be transformed.
But it evolved into a mechanism for "folding" however not to a value, but another expression (which itself might be foldable or not).
|
|
||
| @Override | ||
| public Expression surrogate() { | ||
| return field().foldable() ? field() : null; |
There was a problem hiding this comment.
Values not only merges values, but also removes duplicates (If no test was triggered because of this, we should add some!)
ROW x = [1, 1, 2] | STATS a = VALUES(x)
-> [1, 2]
There was a problem hiding this comment.
Good catch, BUT I think we have a problem with the documentation. It's not mentioning this aspect. There were other misses in our functions docs (which are fixed in this PR), I think we need to review our documentation on functions and double check its correctness and completeness. I will create an issue.
There was a problem hiding this comment.
@ivancea thank you for pro-actively checking this PR 🙏, that was very helpful.
I've created two issues:
- improving our documentation: ES|QL: review, double check and add missing bits to functions documentation #112437
- this PR also now depends on pending
mv_valuesaddition: ES|QL: add mv_values function #112445
alex-spies
left a comment
There was a problem hiding this comment.
This is great @astefan ! I think this change is sound and added mostly minor remarks.
My only major remark is: I think we need LogicalPlanOptimizerTests cases that prove that the foldable propagation actually takes place. The csv tests are great, but they do not prove that foldable propagation actually takes place, only that the result is correct.
But you already mentioned more optimizer tests as one of the tasks to un-draft :)
x-pack/plugin/esql/qa/testFixtures/src/main/resources/meta.csv-spec
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Out of scope: stats.csv-spec has a bunch of ...OfConst tests that overlap a lot with the tests here, except that they normally start with from employees. Because these test stats more than row, maybe we should move test cases like row ... | stats ... from here to stats.csv-spec in a follow-up PR.
x-pack/plugin/esql/qa/testFixtures/src/main/resources/stats.csv-spec
Outdated
Show resolved
Hide resolved
| } else if (p instanceof Aggregate agg) { | ||
| List<NamedExpression> newAggs = new ArrayList<>(agg.aggregates().size()); | ||
| agg.aggregates().forEach(e -> { | ||
| if (Alias.unwrap(e) instanceof AggregateFunction) { |
There was a problem hiding this comment.
This looks like it cannot propagate into the groups, as in
... | eval x = [1,2,3] | stats sum(field) by x
right? Maybe it's worth adding a comment.
That's another thing we could optimize though if needed, as I think STATS ... BY const is the same as STATS ... | eval x = mv_values(const) | mv_expand x. Not sure that's worth maintaining an optimization rule for, though.
There was a problem hiding this comment.
Yeah, it's not covered. Unintentionally, just I didn't think about this use case.
Will leave it for a follow up, though. There are many things going on in this PR.
...ql/src/main/java/org/elasticsearch/xpack/esql/optimizer/rules/ReplaceAggregatesWithNull.java
Show resolved
Hide resolved
...ql/src/main/java/org/elasticsearch/xpack/esql/optimizer/rules/ReplaceAggregatesWithNull.java
Outdated
Show resolved
Hide resolved
...ql/src/main/java/org/elasticsearch/xpack/esql/optimizer/rules/ReplaceAggregatesWithNull.java
Outdated
Show resolved
Hide resolved
| } else { | ||
| // All aggs actually have been optimized away | ||
| // \_Aggregate[[],[AVG([NULL][NULL]) AS s]] | ||
| // Replace by a local relation with one row, followed by an eval, e.g. | ||
| // \_Eval[[MVAVG([NULL][NULL]) AS s]] | ||
| // \_LocalRelation[[{e}#21],[ConstantNullBlock[positions=1]]] | ||
| plan = new LocalRelation( | ||
| source, | ||
| List.of(new EmptyAttribute(source)), | ||
| LocalSupplier.of(new Block[] { BlockUtils.constantBlock(PlannerUtils.NON_BREAKING_BLOCK_FACTORY, null, 1) }) | ||
| ); | ||
| } |
There was a problem hiding this comment.
The code in lines 86-106 also happens in SubstituteSurrogates, and kinda-sorta also in ReplaceStatsAggExpressionWithEval. I opened #110345 but maybe, instead of reducing the number of opt. rules, we should just refactor the code path that moves expressions out of aggregates and into evals. We could start here and make sure the code is the same as in SubstituteSurrogates.
…aggregations_over_constants
…aggregations_over_constants
Introduce isConstantFoldable() for aggregate functions
…aggregations_over_constants
alex-spies
left a comment
There was a problem hiding this comment.
Heya, just wanted to pick this up again and summarize what I think we need to do:
- Per the discussion with Costin,
percentile(field, null)is still not well defined (null vs invalid query) - okay to hash this out in a follow-up but maybe invalidating for now is safer w.r.t. bwc. ST_CENTROID_AGG(null)should returnnullinstead ofPoint(NaN NaN).- Some additional test cases won't hurt.
- Fixing this edge case where all agg functions are optimized away
Consider this unblocked from my side as my main reason for requesting changes was the discussion about percentile(field, null) and similar cases. We could solve some problems in follow-up PRs as well, as I think the general approach here works :)
There was a problem hiding this comment.
We don't have to hash this out now, but maybe it'd be safer to start with a validation exception now - we can still go back and return null later, while the other way around could be considered a breaking change, albeit in a very edge case scenario.
…aggregations_over_constants
…aggregations_over_constants
|
Is this PR superseded by #139797 or is it something we're planning to do additionally? |
I haven't checked the other PR, only speaking about this one I created and explored some time ago. |
|
I'll take a look in case the other PR missed something we should move 👀 |
This change consists of:
SubstituteSurrogatein "Operator Optimization" batchmedian_absolute_deviationfunctiontopfunctionPropagateEvalFoldablesrule to also deal with aggregate functionsAddresses part of #100634. Missing bits:
AwaitsFixes fromLogicalPlanOptimizerTestscannot be removed yet. This likely has also to do withsubstitutions()batch ->NormalizeAggregate()rule (that needs to be added?) inLogicalPlanOptimizermedian_absolute_deviationis still pending on having its ownmv_functionmv_*function still needs addressingmv_valuessister function is pending additionFixes #110257
Fixes #104430
Fixes #100170
Needs more tests for the new rule and the existent ones in
LogicalPlanOptimizerTests.