Skip to content

ESQL: Strings support for MAX and MIN aggregations#111544

Merged
ivancea merged 25 commits intoelastic:mainfrom
ivancea:max-min-aggs-bytesref
Aug 20, 2024
Merged

ESQL: Strings support for MAX and MIN aggregations#111544
ivancea merged 25 commits intoelastic:mainfrom
ivancea:max-min-aggs-bytesref

Conversation

@ivancea
Copy link
Contributor

@ivancea ivancea commented Aug 2, 2024

Support Version, Keyword and Text in Max an Min aggregations.

The current implementation of both max and min does:

For non-grouping:

  • Store a BytesRef
  • When there's a max/min, copy it to the internal array. Grow it if needed

For grouping:

  • Keep an array of BytesRef (null by default: there's no "initial/default value" here, as there's no "MAX" value for a string)
  • Each BytesRef stores their own array, which will be grown as needed to copy the new max/min

Some notes:

  • It's not shrinking the arrays, as to avoid having to copy, and potentially grow it again
  • It's using raw arrays. But maybe it should use BigArrays to compute in the circuit breaker?

Part of #110346

@ivancea ivancea added >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL ES|QL-ui Impacts ES|QL UI labels Aug 2, 2024
@ivancea ivancea requested a review from nik9000 August 2, 2024 10:30
@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2024

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/kibana-esql (ES|QL-ui)

@elasticsearchmachine
Copy link
Collaborator

Hi @ivancea, I've created a changelog YAML for you.

@ivancea
Copy link
Contributor Author

ivancea commented Aug 2, 2024

@elasticmachine update branch

from apps
| eval x = version
| where id > 2
| stats max(version), a = max(version), b = max(x), c = max(case(name == "iiiii", "100.0.0"::version, version));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this consistent with _search if you sort by the version field? Version goes through a lot to encode itself in a way where sorting the bytes does a nice semver sort and I don't recall precisely what we did about that in ESQL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we preserve that sorting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, same as _search. It also uses the same logic as GreaterThan/LesserThan/MvSort/SORT (The compareTo())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


@Aggregator({ @IntermediateState(name = "max", type = "BYTES_REF"), @IntermediateState(name = "seen", type = "BOOLEAN") })
@GroupingAggregator
class MaxBytesRefAggregator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth a comment saying that we're comparing the raw bytes representation of the BytesRef. That should be a valid and good sort for most things because we try to represent them that way. But it's not always the kind of sort you want and it's worth calling it out in javadoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments to both aggregators, explaining that they use the bytes natural order

name = "field",
type = { "boolean", "double", "integer", "long", "date", "ip", "keyword", "text", "long", "version" }
) Expression field
) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth adding a NOTE to the docs that the MAX of a keyword and text field is the highest value, sorted by the utf-8 representation? That's the behavior we're committing to here, and I could see a world where folks will need collations. But the utf-8 one is a useful default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, other functions I checked use this same logic (the BytesRef compareTo), and behave the same as the SORT command.
So, I'm not sure adding this here would make much sense. If we want to explain it, I wonder if it would be better at ESQL level, instead of at function level

@ivancea ivancea requested a review from nik9000 August 12, 2024 13:41
Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Though I'd modify the description to change the note about using raw arrays.

}

final boolean hasValue(int groupId) {
boolean hasValue(int groupId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we wouldn't want to extend this and get seen if we're overriding this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh God. Fixed! Added instead a single boolean state, just to know whether using a vector or a block in toBlock()

from apps
| eval x = version
| where id > 2
| stats max(version), a = max(version), b = max(x), c = max(case(name == "iiiii", "100.0.0"::version, version));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ivancea ivancea merged commit e3f378e into elastic:main Aug 20, 2024
@ivancea ivancea deleted the max-min-aggs-bytesref branch August 20, 2024 13:25
lkts pushed a commit to lkts/elasticsearch that referenced this pull request Aug 20, 2024
Support Version, Keyword and Text in Max an Min aggregations.

The current implementation of both max and min does:

For non-grouping:
- Store a BytesRef
- When there's a max/min, copy it to the internal array. Grow it if needed

For grouping:
- Keep an array of BytesRef (null by default: there's no "initial/default value" here, as there's no "MAX" value for a string)
- Each BytesRef stores their own array, which will be grown as needed to copy the new max/min

Some notes:
- It's not shrinking the arrays, as to avoid having to copy, and potentially grow it again
- It's using raw arrays. But maybe it should use BigArrays to compute in the circuit breaker?

Part of elastic#110346
drewdaemon added a commit to elastic/kibana that referenced this pull request Aug 22, 2024
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
Support Version, Keyword and Text in Max an Min aggregations.

The current implementation of both max and min does:

For non-grouping:
- Store a BytesRef
- When there's a max/min, copy it to the internal array. Grow it if needed

For grouping:
- Keep an array of BytesRef (null by default: there's no "initial/default value" here, as there's no "MAX" value for a string)
- Each BytesRef stores their own array, which will be grown as needed to copy the new max/min

Some notes:
- It's not shrinking the arrays, as to avoid having to copy, and potentially grow it again
- It's using raw arrays. But maybe it should use BigArrays to compute in the circuit breaker?

Part of elastic#110346
davidkyle pushed a commit to davidkyle/elasticsearch that referenced this pull request Sep 5, 2024
Support Version, Keyword and Text in Max an Min aggregations.

The current implementation of both max and min does:

For non-grouping:
- Store a BytesRef
- When there's a max/min, copy it to the internal array. Grow it if needed

For grouping:
- Keep an array of BytesRef (null by default: there's no "initial/default value" here, as there's no "MAX" value for a string)
- Each BytesRef stores their own array, which will be grown as needed to copy the new max/min

Some notes:
- It's not shrinking the arrays, as to avoid having to copy, and potentially grow it again
- It's using raw arrays. But maybe it should use BigArrays to compute in the circuit breaker?

Part of elastic#110346
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL ES|QL-ui Impacts ES|QL UI >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.16.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants