-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Before 7.9.0 many of our more complex aggregations made a simplifying assumption that required that they duplicate many data structures once per bucket that contained them. The most expensive of these weighed in at a couple of kilobytes each. So for an aggregation like:
POST _search
{
"aggs": {
"date": {
"date_histogram": { "field": "timestamp", "calendar_interval": "day" },
"aggs": {
"ips": {
"terms": { "field": "ip" }
}
}
}
}
}
When run over three years spends a couple of megabytes just on bucket accounting. More deeply nested aggregations spend even more on this overhead. And 7.9.0 removes all of it which should allow us to run better in lower memory environments.
As a bonus we wrote quite a few Rally benchmarks for aggs to make sure that these tests didn't slow down aggregations. So we can think much more scientifically about aggregation performance. The benchmarks suggest that these changes don't affect simple aggregation trees and speed up complex aggregation trees of similar or higher depth than the example above. Your actual performance changes will vary but it this should help! 🤞
EDIT:
Everything above the EDIT mark was added when I tagged this release highlight so it could be more easily understood in context.
#55873 removed the "multi-bucket wrapper" from the numeric terms aggregator and showed that we can get a pretty substantial performance improvement in some common aggregation requests. This will track work to remove the wrapper for other aggregations because:
- I expect we can get a similar or better performance improvement for each one.
- The wrapper makes it very difficult to reason about aggregations.
- This will give us a good excuse to add rally tracks for these aggregations.
- string
terms(Fix casting of scaled_float in sorts #57207 + Fold some of sig_terms into terms #57361 Merge remaining sig_terms into terms #57397 + Fix an optimization in terms agg #57438 + Save memory when string terms are not on top #57758) -
significant_terms(Save memory on numeric sig terms when not top #56789 + Fix casting of scaled_float in sorts #57207 + Fold some of sig_terms into terms #57361 + Merge remaining sig_terms into terms #57397 + Fix an optimization in terms agg #57438 + Save memory when string terms are not on top #57758) -
rare_terms(Save memory when rare_terms is not on top #57948) -
date_histogram(Save memory when date_histogram is not on top #56921) -
auto_date_histogram(Save memory when auto_date_histogram is not on top #57304) -
histogram(Save memory when histogram agg is not on top #57277) -
parent(Make parent and child aggregator more obvious #57490 + Save memory when parent and child are not on top #57892) -
child(Make parent and child aggregator more obvious #57490 + Save memory when parent and child are not on top #57892) -
geohash_grid(Same memory when geo aggregations are not on top #57483) -
geotile_grid(Same memory when geo aggregations are not on top #57483) -
scripted_metric(Remove deprecated wrapper from scripted_metric #57627) -
significant_text(Give significance lookups their own home #57903 + Save memory when significant_text is not on top #58145)
After this is all done we can:
- Remove
significant_terms's "funny" reference back to its factory for caching. We won't need it because they'll only ever be one aggregator so it can cache. (Give significance lookups their own home #57903) -
Look into non-Moved to Make sure all significant memory usage in aggs are tracked in BigArrays #59892BigArraysbacked memory usage in aggs. This is more important now that we don't get the 5k "artificial" value added to the breaker per bucket. - Replace
descendsFromBucketAggregator(parent)withcollectsFromSingleBucket. (Remove useless aggregation helper #58571) - Look into replacing "lego-ed" data structures with purpose built ones. (7.10: Allocate slightly less per bucket #59740)