ESQL: Extend STATS command to support aggregate expressions#104958
ESQL: Extend STATS command to support aggregate expressions#104958costin merged 8 commits intoelastic:mainfrom
Conversation
|
Hi @costin, I've created a changelog YAML for you. |
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
2d8be06 to
9beaca9
Compare
There was a problem hiding this comment.
The core of this PR - the changes are larger because:
- the ReplaceDuplicateAggWithEval has been removed and incorporated into the new rule.
- the behavior of CombineProjections has been fixed when dealing with a Project/Aggregate, simplifying the clean-up.
There was a problem hiding this comment.
Removed ReplaceDuplicateAgg and pushing down of limits again.
There was a problem hiding this comment.
The new rule breaks down expressions over aggs into eval so the underlying stats only works on top level aggregations.
While at it, it handles also duplicates to avoid repetitive computation.
This keeps the following rule simple since it's guarantees that an Aggregate will only contain AggregateFunctions not expressions over them.
There was a problem hiding this comment.
Small tweak - no need to return EsqlProject instead Project since the tree is already resolved.
There was a problem hiding this comment.
I've updated the temporary name to make it a bit less confusing/repetitive.
There was a problem hiding this comment.
Unrelated change - couldn't help not correct it to save the method invocation and replace the noise map with a good-ol' reliable iteration.
There was a problem hiding this comment.
The gist of this change takes care of the scenario creates by ReplaceStatsAggExpressionWithEval:
stats x = sum(), y = count()
project x, x as a, y
The combine rule previously would combine project into stats which would duplicate the count:
stats x = sum(), a = sum(), y = count()
It removes the project (which is cheap) but duplicates the sum (which is expensive and the reason we didn't want to duplicate it in the first place).
The rule thus tracks is there's any new alias - however it keeps on removing unused aggregations and in case of basic aliasing project, removes the project.
So the following
stats x = sum(), y = count()
project x as a
becomes
stats a = sum()
There was a problem hiding this comment.
Update the generator name strategy to be a bit more meaningful.
There was a problem hiding this comment.
The core of this PR - breaks down the expression over aggs and adds an eval lazily only for the fields that have an expression over aggregate functions.
There was a problem hiding this comment.
We could improve this a bit further for a referencing case: just like we now support | eval x = field, y = x + 1, we could now (but don't yet) support something like: | stats x = max(field), y = x + min(field) -- this now fails because column x isn't known.
There was a problem hiding this comment.
That would be a nice little feature - raised #105102 as a follow-up.
There was a problem hiding this comment.
Handled by ReplaceStatsAggExpressionWithEval
Previously only aggregate functions (max/sum/etc..) were allowed inside
the stats command. This PR allows expressions involving one or multiple
aggregates to be used, such as:
stats x = avg(salary % 3) + max(emp_no),
y = min(emp_no / 3) + 10 - median(salary)
by z = languages % 2
9beaca9 to
47389a0
Compare
|
|
||
| nestedAggsNoGrouping | ||
| FROM employees | ||
| | STATS x = AVG(salary) /2 + MAX(salary), a = AVG(salary), m = MAX(salary) |
There was a problem hiding this comment.
@costin I understand that we tag some of these tests to be included as examples in the docs. Just wondering what the workflow was with Abdon to add these tags? Was it just simply a ping to alert the writer that we want this example in the docs? :)
(Might need to be slightly more explicit about the workflow in general just because I haven't that rhythm yet)
There was a problem hiding this comment.
The tests above are meant for internal consumption hence the ping - better to create other tests, that fix the general dataset and the rest of examples in the docs and add them in, as a separate PR after this one gets merged.
See the previous PRs authored by Abdon.
|
Hi @costin, I've created a changelog YAML for you. |
astefan
left a comment
There was a problem hiding this comment.
LGTM
Left only minor comments. Maybe the one related to an additional test to be of more importance.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/Verifier.java
Outdated
Show resolved
Hide resolved
...k/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizer.java
Outdated
Show resolved
Hide resolved
...k/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizer.java
Show resolved
Hide resolved
alex-spies
left a comment
There was a problem hiding this comment.
Gave this a first round, focusing on tests this time. I'll give this another go tomorrow.
Two observations:
- I think there's bugs for the following cases:
stats max(l) by l=languages(verification exception) and, more severely,stats max(languages) + languages by l = languages(NPE,Cannot invoke \"org.elasticsearch.xpack.esql.planner.Layout$ChannelAndType.channel()\" because the return value of \"org.elasticsearch.xpack.esql.planner.Layout.get(org.elasticsearch.xpack.ql.expression.NameId)\" is null"); the latter works fine without the alias. - This allows shenanigans like
stats max(languages) + languages by languages(using a grouping in the expression), but notstats languages + 1 by languages; although that may be something for a follow-up, if we want to allow this.
Other than that I have mostly minor remarks; the tests do not fully assert the (complex) expressions that are being constructed, maybe we should be stricter there.
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/AnalyzerTests.java
Show resolved
Hide resolved
| |stats x by 1 | ||
| """)); | ||
|
|
||
| assertThat(e.getMessage(), containsString("aggregate function")); |
There was a problem hiding this comment.
Suuuuper nit:
Shouldn't the error message be expected an aggregate function or group here as well? [x] is not an aggregate function technically implies that this should be replaced by an agg function.
| * \_Eval[[____x_AVG@9efc3cf3_SUM@daf9f221{r}#18 / ____x_AVG@9efc3cf3_COUNT@53cd08ed{r}#19 AS __x_AVG@9efc3cf3, __x_AVG@ | ||
| * 9efc3cf3{r}#16 / 2[INTEGER] + __x_MAX@475d0e4d{r}#17 AS x]] | ||
| * \_Limit[500[INTEGER]] | ||
| * \_Aggregate[[],[SUM(salary{f}#11) AS ____x_AVG@9efc3cf3_SUM@daf9f221, COUNT(salary{f}#11) AS ____x_AVG@9efc3cf3_COUNT@53cd0 |
There was a problem hiding this comment.
Nit, low prio: The generated attribute names make optimized plans pretty hard to read/grasp, due to the leading underscores and @9efc3cf3 thingies. Maybe we could streamline the names?
E.g.
\_Aggregate[[],[SUM(salary{f}#11) AS x_AVG_SUM, COUNT(salary{f}#11) AS x_AVG_COUNT, MAX(salary{f}#11) AS x_MAX
We could still append a number in case of conflict.
There was a problem hiding this comment.
I've revisited the naming strategy to rely on a counter across the entire rule - hopefully this avoids clashes and improves readability.
| // sum/count to compute avg | ||
| var div = as(fields.get(0).child(), Div.class); | ||
| // avg + max | ||
| var add = as(fields.get(1).child(), Add.class); |
There was a problem hiding this comment.
We're technically not asserting the whole expression here - maybe better to just assert the expression string?
Applies to all added tests in this file.
| * PERCENTILE(salary{f}#1928,50[INTEGER]) AS __y_MEDIAN@705fccec]] | ||
| * \_Eval[[languages{f}#1926 % 2[INTEGER] AS z, | ||
| * salary{f}#1928 % 3[INTEGER] AS ____x_AVG@e03a7a5c_AVG@e03a7a5c, | ||
| * emp_no{f}#1923 / 3[INTEGER] AS ____y_MIN@80cee21c_MIN@80cee21c]] |
There was a problem hiding this comment.
nit: the attribute names are constructed from the aggs where they will be used, but not from the expression where they will be used; this makes it a bit hard to read this bottom-up (and top-down is also hard). I think it will also lead to tricky attribute names in cases like y = min(emp_no /3) - min(emp_no + 2), as both attributes will be called ____y_MIN@...._MIN@.....
Maybe this would be more consistent?
| * emp_no{f}#1923 / 3[INTEGER] AS ____y_MIN@80cee21c_MIN@80cee21c]] | |
| * emp_no{f}#1923 / 3[INTEGER] AS y_SUB1_MIN]] |
(resp. ...SUB0... for the left hand side)
| |stats 1 | ||
| """)); | ||
|
|
||
| assertThat(e.getMessage(), containsString("expected an aggregate function or group")); |
There was a problem hiding this comment.
The asserts with the same exception messages can be factored out.
| e = min(salary), | ||
| f = max(salary), | ||
| g = max(salary) | ||
| by w = languages % 2 |
There was a problem hiding this comment.
Perhaps one of these test cases can be modified to use different groupings, for example, by auto_bucket(emp_no, 10, 1, 10000), to increase the coverage a bit more.
| * \_Eval[[languages{f}#37 % 2[INTEGER] AS w]] | ||
| * \_EsRelation[test][_meta_field{f}#40, emp_no{f}#34, first_name{f}#35, ..] | ||
| */ | ||
| public void testStatsExpOverAggsWithScalarAndDuplicateAggs() { |
There was a problem hiding this comment.
Do we want to support removing duplicated expressions over aggregations? Like below:
| stats x = avg(salary) /2 + max(salary) , y = avg(salary) /2 + max(salary)
It seems like we recalculate expression part for x and y, the duplicated aggregations - avg and max are not recalculated. Detecting equivalent expressions over aggregations could be more complicated than detecting equivalent aggregations. Perhaps it is because CombineProjections does not check the pattern created by ReplaceStatsAggExpressionWithEval, CombineProjections checks project over project, this case has the pattern of project over eval over project over eval. Just to take a note here, not sure if it worth supporting it.
There was a problem hiding this comment.
I think this would be covered as part of our plan to eliminate common (sub-) expressions: #103301
alex-spies
left a comment
There was a problem hiding this comment.
Gave it another round, focusing on the optimizer code this time. I didn't find anything not already covered by others' remarks. Looks good, very neat feature!
| * Replace nested expressions over aggregates with synthetic eval post the aggregation | ||
| * stats a = sum(a) + min(b) by x | ||
| * becomes | ||
| * stats a1 = sum(a), a2 = min(b) by x | eval a = a1 + a2 | keep a, x | ||
| * | ||
| * Since the logic is very similar, this rule also handles duplicate aggregate functions to avoid duplicate compute | ||
| * stats a = min(x), b = min(x), c = count(*), d = count() by g | ||
| * becomes | ||
| * stats a = min(x), c = count(*) by g | eval b = a, d = c | keep a, b, c, d, g |
There was a problem hiding this comment.
++ to the examples in the javadoc, very useful.
x-pack/plugin/esql/qa/testFixtures/src/main/resources/stats.csv-spec
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/Verifier.java
Outdated
Show resolved
Hide resolved
| String name = expression instanceof NamedExpression ne | ||
| ? ne.name() | ||
| : expression.nodeName() + "@" + Integer.toHexString(expression.hashCode()); | ||
| return "__" + name + "_" + af.functionName() + "@" + Integer.toHexString(af.hashCode()); |
There was a problem hiding this comment.
nit: not sure if this makes it easier to read, but an alternative to unify the way the nodes are named:
| String name = expression instanceof NamedExpression ne | |
| ? ne.name() | |
| : expression.nodeName() + "@" + Integer.toHexString(expression.hashCode()); | |
| return "__" + name + "_" + af.functionName() + "@" + Integer.toHexString(af.hashCode()); | |
| Function<Expression, String> nf = e -> (e == af ? af.functionName() : e.nodeName()) + "@" + Integer.toHexString(e.hashCode()); | |
| String name = expression instanceof NamedExpression ne ? ne.name() : nf.apply(expression); | |
| return "__" + name + "_" + nf.apply(af); |
...k/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizer.java
Outdated
Show resolved
Hide resolved
...k/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizer.java
Show resolved
Hide resolved
...k/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizer.java
Outdated
Show resolved
Hide resolved
There's a subtle bug when dealing with an aliased grouping which sometimes popped up because the validation allowed the grouping column in some queries. |
| // the agg doesn't exist in the Aggregate, create an alias for it and save its attribute | ||
| if (attr == null) { | ||
| var temporaryName = temporaryName(agg, af); | ||
| var temporaryName = temporaryName(af, agg, counter[0]++); |
There was a problem hiding this comment.
Changed the strategy to indicate the inner and outer expression as that can differ across rules - in some an aggregation is an inner expressions while in others is the outer one.
| static String temporaryName(Expression inner, Expression outer, int suffix) { | ||
| String in = toString(inner); | ||
| String out = toString(outer); | ||
| return "$$" + in + "$" + out + "$" + suffix; | ||
| } |
There was a problem hiding this comment.
Opted for $$a$ instead of _ as that was used to replace spaces.
| * Project[[a{r}#5, b{r}#9, $$max(salary)_+_3>$COUNT$2{r}#46 AS d, $$count(salary)_->$MIN$3{r}#47 AS e, $$avg(salary)_+_m | ||
| * >$MAX$1{r}#45 AS g]] | ||
| * \_Eval[[$$$$avg(salary)_+_m>$AVG$0$SUM$0{r}#48 / $$max(salary)_+_3>$COUNT$2{r}#46 AS $$avg(salary)_+_m>$AVG$0, $$avg( | ||
| * salary)_+_m>$AVG$0{r}#44 + $$avg(salary)_+_m>$MAX$1{r}#45 AS a, $$avg(salary)_+_m>$MAX$1{r}#45 + 3[INTEGER] + | ||
| * 3.141592653589793[DOUBLE] + $$max(salary)_+_3>$COUNT$2{r}#46 AS b]] | ||
| * \_Limit[500[INTEGER]] | ||
| * \_Aggregate[[w{r}#28],[SUM(salary{f}#39) AS $$$$avg(salary)_+_m>$AVG$0$SUM$0, MAX(salary{f}#39) AS $$avg(salary)_+_m>$MAX$1 | ||
| * , COUNT(salary{f}#39) AS $$max(salary)_+_3>$COUNT$2, MIN(salary{f}#39) AS $$count(salary)_->$MIN$3]] | ||
| * \_Eval[[languages{f}#37 % 2[INTEGER] AS w]] | ||
| * \_EsRelation[test][_meta_field{f}#40, emp_no{f}#34, first_name{f}#35, ..] | ||
| */ |
There was a problem hiding this comment.
An example of the new naming strategy.
## Summary Sync with elastic/elasticsearch#104958 for support of builtin fn in STATS * validation ✅ * autocomplete ✅ * also fixed `STATS BY <field>` syntax  Sync with elastic/elasticsearch#104913 for new `log` function * validation ✅ - also warning for negative values * autocomplete ✅  Sync with elastic/elasticsearch#105064 for removal of `PROJECT` command * validation ✅ (both new and legacy syntax supported) * autocomplete ✅ (will only suggest new syntax)  Sync with elastic/elasticsearch#105221 for removal of mandatory brackets for `METADATA` command option * validation ✅ (added warning deprecation message when using brackets) * autocomplete ✅  Sync with elastic/elasticsearch#105224 for change of syntax for ENRICH ccq mode * validation ✅ * autocomplete ✅ (not directly promoted, the user has to type `_` to trigger it) * hover ✅ * code actions ✅   Do not merge until those 5 get merged. Additional things in this PR: * Added more tests for `callbacks` not passed scenario * covered more cases like those with `dissect` * Added more tests for signature params number (calling a function with an extra arg should return an error) * Cleaned up some more unused code * Improved messages on too many arguments for functions ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
## Summary Sync with elastic/elasticsearch#104958 for support of builtin fn in STATS * validation ✅ * autocomplete ✅ * also fixed `STATS BY <field>` syntax  Sync with elastic/elasticsearch#104913 for new `log` function * validation ✅ - also warning for negative values * autocomplete ✅  Sync with elastic/elasticsearch#105064 for removal of `PROJECT` command * validation ✅ (both new and legacy syntax supported) * autocomplete ✅ (will only suggest new syntax)  Sync with elastic/elasticsearch#105221 for removal of mandatory brackets for `METADATA` command option * validation ✅ (added warning deprecation message when using brackets) * autocomplete ✅  Sync with elastic/elasticsearch#105224 for change of syntax for ENRICH ccq mode * validation ✅ * autocomplete ✅ (not directly promoted, the user has to type `_` to trigger it) * hover ✅ * code actions ✅   Do not merge until those 5 get merged. Additional things in this PR: * Added more tests for `callbacks` not passed scenario * covered more cases like those with `dissect` * Added more tests for signature params number (calling a function with an extra arg should return an error) * Cleaned up some more unused code * Improved messages on too many arguments for functions ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
## Summary Sync with elastic/elasticsearch#104958 for support of builtin fn in STATS * validation ✅ * autocomplete ✅ * also fixed `STATS BY <field>` syntax  Sync with elastic/elasticsearch#104913 for new `log` function * validation ✅ - also warning for negative values * autocomplete ✅  Sync with elastic/elasticsearch#105064 for removal of `PROJECT` command * validation ✅ (both new and legacy syntax supported) * autocomplete ✅ (will only suggest new syntax)  Sync with elastic/elasticsearch#105221 for removal of mandatory brackets for `METADATA` command option * validation ✅ (added warning deprecation message when using brackets) * autocomplete ✅  Sync with elastic/elasticsearch#105224 for change of syntax for ENRICH ccq mode * validation ✅ * autocomplete ✅ (not directly promoted, the user has to type `_` to trigger it) * hover ✅ * code actions ✅   Do not merge until those 5 get merged. Additional things in this PR: * Added more tests for `callbacks` not passed scenario * covered more cases like those with `dissect` * Added more tests for signature params number (calling a function with an extra arg should return an error) * Cleaned up some more unused code * Improved messages on too many arguments for functions ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
Fix #117770 Fix #117784 #117699 made changes to how we plan aggregations which were supposed to only trigger when a query contained a `CATEGORIZE`, but accidentally changed a code path that seems to only be required for interoperability with pre-8.13 nodes. Because of this, we didn't notice failing tests until the periodic bwc tests ran. The code this PR fixes addresses situations where `Aggregate` plan nodes contained _aliases inside the aggregates_. On `main` and `8.x`, this is effectively an illegal state: since #104958, aliases in the aggregates become `Eval` nodes before and after the `Aggregate` node. But here, on 8.x, we'll just fix this code path so that it behaves exactly as before #117699. If this passes the full-bwc test, I plan to forward-port this by removing the obsolete code path on `main`.
Previously only aggregate functions (max/sum/etc..) were allowed inside
the stats command. This PR allows expressions involving one or multiple
aggregates to be used, such as: