Skip to content

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented May 14, 2020

This saves memory when running numeric significant terms which are not
at the top level by merging its collection into numeric terms and relying
on the optimization that we made in #55873.

This saves memory when running numeric significant terms which are not
at the top level by merging its collection into numeric terms and relying
on the optimization that we made in elastic#55873.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 14, 2020
@nik9000
Copy link
Member Author

nik9000 commented May 15, 2020

Interesting! I ran some performance tests of this change and they don't look good. Something is up!

before: https://gist.github.com/nik9000/28dc2a9e3309a33ea39d1c169462535e#file-gistfile1-txt-L110
after: https://gist.github.com/nik9000/0b95042294ee222dba8de6d1f55accb2#file-gistfile1-txt-L110

This is pretty odd because when I was running some smaller tests locally everything looked faster.

@nik9000 nik9000 marked this pull request as draft May 15, 2020 12:17
@nik9000
Copy link
Member Author

nik9000 commented May 15, 2020

I've been testing this in a faster loop with this. After running the last step a few dozen times to get the jvm happy I got this before and this after. Weird.

@nik9000
Copy link
Member Author

nik9000 commented May 15, 2020

OK! I tracked it down. And I found a nice little improvement I could make.

Before: https://gist.github.com/nik9000/28dc2a9e3309a33ea39d1c169462535e#file-gistfile1-txt-L110
After: https://gist.github.com/nik9000/664d3516db725386e395327c3caa8188#file-gistfile1-txt-L100

About 66% performance improvement when the numeric significant_terms agg is under a non-selective agg. In my case it is a range agg. I expect it'll be pretty similar if it is under a date_histogram.

@nik9000 nik9000 marked this pull request as ready for review May 16, 2020 19:07
@nik9000
Copy link
Member Author

nik9000 commented May 16, 2020

@not-napoleon, I'm fairly comfortable with this now! I'll run some performance tests on the latest code, but I think it is an improvement,

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, pretty ok with it but allowing buckets to become mutable makes me a bit nervous. It was previously just contained to Sigterms and still not great, but now it'll "leak" to the whole terms family. We've had some hairy bugs in the past due to buckets being mutable (accidentally get's changed during reduction, or not understand that the bucket isn't "final" until later, etc).

It cleans up a lot of not-so-nice bits, so I don't want to throw the baby out with the bath water, but did want to mention it in case you had an idea. Not sure it'll be possible to "fix" since sigterms needs that functionality as it's currently designed.


abstract Supplier<B> emptyBucketBuilder(long owningBucketOrd);

abstract void updateBucket(B spare, BucketOrdsEnum ordsEnum, long docCount) throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we slap a TODO/warning on this? I put a warning on the old updateScore() method because it violates the bucket immutability that pretty much all other aggs have, and can lead to some really subtle bugs if you're not careful. And the new setup means it could affect any of the terms aggs, not just sigterms, so the surface area of accidental misuse is a lot larger.

Old warning: https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/search/aggregations/bucket/terms/InternalSignificantTerms.java#L107

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the buckets for all of the terms family are mutable. They all use that PriorityQueue/spare/updateTop stuff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right I think I confused myself, mixing up the reason for that warning. I think it might have come about from Rollup or a pipeline issue because the score is only updated/added at reduction and not ctor, so anyone artificially creating/recreating a bucket will get bad data.

or something, but yeah you're right, term family buckets are already mutable 👍

return result;
}

abstract String describe();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some light javadocs to a few of these abstract methods? Still working through them so perhaps they are self-explanatory in code (or by method name). buildResult / buildBuckets buildSubAggs / buildResult / buildEmptyResult, etc

* Lookup frequencies for terms.
*/
private static class LookupBackgroundFrequencies implements BackgroundFrequencies {
// TODO a reference to the factory is weird - probably should be reference to what we need from it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

boolean collectsFromSingleBucket,
Map<String, Object> metadata) throws IOException {

assert collectsFromSingleBucket;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just throw an exception here? I'm assuming relatively bad things happen if this condition is violated and the user will get garbage results (or a different exception)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Garbage, I think. It enforced by the code below and it mirrors what we do with terms. And I plan to drop it before release anyway, but I can certainly switch it. Because plans don't always work out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, cool. If it's enforced already no big deal 👍

@nik9000 nik9000 merged commit 9aaab6e into elastic:master May 27, 2020
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 27, 2020
)

This saves memory when running numeric significant terms which are not
at the top level by merging its collection into numeric terms and relying
on the optimization that we made in elastic#55873.
nik9000 added a commit that referenced this pull request May 27, 2020
…57221)

This saves memory when running numeric significant terms which are not
at the top level by merging its collection into numeric terms and relying
on the optimization that we made in #55873.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 27, 2020
nik9000 added a commit that referenced this pull request May 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.9.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants