ESQL: Strings support for MAX and MIN aggregations#111544
ESQL: Strings support for MAX and MIN aggregations#111544ivancea merged 25 commits intoelastic:mainfrom
Conversation
# Conflicts: # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/Max.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/Min.java # x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/expression/function/aggregate/MaxTests.java # x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/expression/function/aggregate/MinTests.java
|
Documentation preview: |
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Pinging @elastic/kibana-esql (ES|QL-ui) |
|
Hi @ivancea, I've created a changelog YAML for you. |
|
@elasticmachine update branch |
...gin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/BytesRefArrayState.java
Outdated
Show resolved
Hide resolved
| from apps | ||
| | eval x = version | ||
| | where id > 2 | ||
| | stats max(version), a = max(version), b = max(x), c = max(case(name == "iiiii", "100.0.0"::version, version)); |
There was a problem hiding this comment.
Is this consistent with _search if you sort by the version field? Version goes through a lot to encode itself in a way where sorting the bytes does a nice semver sort and I don't recall precisely what we did about that in ESQL.
There was a problem hiding this comment.
It looks like we preserve that sorting.
There was a problem hiding this comment.
Yes, same as _search. It also uses the same logic as GreaterThan/LesserThan/MvSort/SORT (The compareTo())
|
|
||
| @Aggregator({ @IntermediateState(name = "max", type = "BYTES_REF"), @IntermediateState(name = "seen", type = "BOOLEAN") }) | ||
| @GroupingAggregator | ||
| class MaxBytesRefAggregator { |
There was a problem hiding this comment.
I think it's worth a comment saying that we're comparing the raw bytes representation of the BytesRef. That should be a valid and good sort for most things because we try to represent them that way. But it's not always the kind of sort you want and it's worth calling it out in javadoc.
There was a problem hiding this comment.
Added some comments to both aggregators, explaining that they use the bytes natural order
.../esql/compute/src/main/java/org/elasticsearch/compute/aggregation/MaxBytesRefAggregator.java
Outdated
Show resolved
Hide resolved
| name = "field", | ||
| type = { "boolean", "double", "integer", "long", "date", "ip", "keyword", "text", "long", "version" } | ||
| ) Expression field | ||
| ) { |
There was a problem hiding this comment.
Do you think it's worth adding a NOTE to the docs that the MAX of a keyword and text field is the highest value, sorted by the utf-8 representation? That's the behavior we're committing to here, and I could see a world where folks will need collations. But the utf-8 one is a useful default.
There was a problem hiding this comment.
Right now, other functions I checked use this same logic (the BytesRef compareTo), and behave the same as the SORT command.
So, I'm not sure adding this here would make much sense. If we want to explain it, I wonder if it would be better at ESQL level, instead of at function level
nik9000
left a comment
There was a problem hiding this comment.
LGTM. Though I'd modify the description to change the note about using raw arrays.
| } | ||
|
|
||
| final boolean hasValue(int groupId) { | ||
| boolean hasValue(int groupId) { |
There was a problem hiding this comment.
I think we wouldn't want to extend this and get seen if we're overriding this method.
There was a problem hiding this comment.
Oh God. Fixed! Added instead a single boolean state, just to know whether using a vector or a block in toBlock()
| from apps | ||
| | eval x = version | ||
| | where id > 2 | ||
| | stats max(version), a = max(version), b = max(x), c = max(case(name == "iiiii", "100.0.0"::version, version)); |
Support Version, Keyword and Text in Max an Min aggregations. The current implementation of both max and min does: For non-grouping: - Store a BytesRef - When there's a max/min, copy it to the internal array. Grow it if needed For grouping: - Keep an array of BytesRef (null by default: there's no "initial/default value" here, as there's no "MAX" value for a string) - Each BytesRef stores their own array, which will be grown as needed to copy the new max/min Some notes: - It's not shrinking the arrays, as to avoid having to copy, and potentially grow it again - It's using raw arrays. But maybe it should use BigArrays to compute in the circuit breaker? Part of elastic#110346
## Summary Close elastic/elasticsearch#111544 Follow-on to elastic/elasticsearch#111544
Support Version, Keyword and Text in Max an Min aggregations. The current implementation of both max and min does: For non-grouping: - Store a BytesRef - When there's a max/min, copy it to the internal array. Grow it if needed For grouping: - Keep an array of BytesRef (null by default: there's no "initial/default value" here, as there's no "MAX" value for a string) - Each BytesRef stores their own array, which will be grown as needed to copy the new max/min Some notes: - It's not shrinking the arrays, as to avoid having to copy, and potentially grow it again - It's using raw arrays. But maybe it should use BigArrays to compute in the circuit breaker? Part of elastic#110346
Support Version, Keyword and Text in Max an Min aggregations. The current implementation of both max and min does: For non-grouping: - Store a BytesRef - When there's a max/min, copy it to the internal array. Grow it if needed For grouping: - Keep an array of BytesRef (null by default: there's no "initial/default value" here, as there's no "MAX" value for a string) - Each BytesRef stores their own array, which will be grown as needed to copy the new max/min Some notes: - It's not shrinking the arrays, as to avoid having to copy, and potentially grow it again - It's using raw arrays. But maybe it should use BigArrays to compute in the circuit breaker? Part of elastic#110346
Support Version, Keyword and Text in Max an Min aggregations.
The current implementation of both max and min does:
For non-grouping:
For grouping:
Some notes:
Part of #110346