-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Save memory on numeric sig terms when not top #56789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This saves memory when running numeric significant terms which are not at the top level by merging its collection into numeric terms and relying on the optimization that we made in elastic#55873.
|
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
|
Interesting! I ran some performance tests of this change and they don't look good. Something is up! before: https://gist.github.com/nik9000/28dc2a9e3309a33ea39d1c169462535e#file-gistfile1-txt-L110 This is pretty odd because when I was running some smaller tests locally everything looked faster. |
|
OK! I tracked it down. And I found a nice little improvement I could make. Before: https://gist.github.com/nik9000/28dc2a9e3309a33ea39d1c169462535e#file-gistfile1-txt-L110 About 66% performance improvement when the numeric |
|
@not-napoleon, I'm fairly comfortable with this now! I'll run some performance tests on the latest code, but I think it is an improvement, |
polyfractal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, pretty ok with it but allowing buckets to become mutable makes me a bit nervous. It was previously just contained to Sigterms and still not great, but now it'll "leak" to the whole terms family. We've had some hairy bugs in the past due to buckets being mutable (accidentally get's changed during reduction, or not understand that the bucket isn't "final" until later, etc).
It cleans up a lot of not-so-nice bits, so I don't want to throw the baby out with the bath water, but did want to mention it in case you had an idea. Not sure it'll be possible to "fix" since sigterms needs that functionality as it's currently designed.
|
|
||
| abstract Supplier<B> emptyBucketBuilder(long owningBucketOrd); | ||
|
|
||
| abstract void updateBucket(B spare, BucketOrdsEnum ordsEnum, long docCount) throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we slap a TODO/warning on this? I put a warning on the old updateScore() method because it violates the bucket immutability that pretty much all other aggs have, and can lead to some really subtle bugs if you're not careful. And the new setup means it could affect any of the terms aggs, not just sigterms, so the surface area of accidental misuse is a lot larger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the buckets for all of the terms family are mutable. They all use that PriorityQueue/spare/updateTop stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right I think I confused myself, mixing up the reason for that warning. I think it might have come about from Rollup or a pipeline issue because the score is only updated/added at reduction and not ctor, so anyone artificially creating/recreating a bucket will get bad data.
or something, but yeah you're right, term family buckets are already mutable 👍
| return result; | ||
| } | ||
|
|
||
| abstract String describe(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add some light javadocs to a few of these abstract methods? Still working through them so perhaps they are self-explanatory in code (or by method name). buildResult / buildBuckets buildSubAggs / buildResult / buildEmptyResult, etc
| * Lookup frequencies for terms. | ||
| */ | ||
| private static class LookupBackgroundFrequencies implements BackgroundFrequencies { | ||
| // TODO a reference to the factory is weird - probably should be reference to what we need from it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| boolean collectsFromSingleBucket, | ||
| Map<String, Object> metadata) throws IOException { | ||
|
|
||
| assert collectsFromSingleBucket; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just throw an exception here? I'm assuming relatively bad things happen if this condition is violated and the user will get garbage results (or a different exception)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Garbage, I think. It enforced by the code below and it mirrors what we do with terms. And I plan to drop it before release anyway, but I can certainly switch it. Because plans don't always work out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, cool. If it's enforced already no big deal 👍
) This saves memory when running numeric significant terms which are not at the top level by merging its collection into numeric terms and relying on the optimization that we made in elastic#55873.
This saves memory when running numeric significant terms which are not
at the top level by merging its collection into numeric terms and relying
on the optimization that we made in #55873.