Add option to add limit to number of groups for khyperloglog_agg function#21510
Add option to add limit to number of groups for khyperloglog_agg function#21510feilong-liu merged 1 commit intoprestodb:masterfrom
Conversation
1659514 to
3de2d1d
Compare
presto-main/src/main/java/com/facebook/presto/type/khyperloglog/KHyperLogLogStateFactory.java
Outdated
Show resolved
Hide resolved
agrawaldevesh
left a comment
There was a problem hiding this comment.
I am curious how would we generalize this to cover say other expensive agg states down the line with minimal code ? Not something we ought to fix now, but what are your thoughts ? Should these checks go elsewhere then ?
There was a problem hiding this comment.
Can you explain more about these changes, is this some new style of writing agg functions ?
There was a problem hiding this comment.
This is an easier way to write aggregation function, corresponding to the ParametricAggregation class. The input, combine and output function together with the annotations can be used to generate the aggregation function. However, it's also not flexible. As we need to specify the limit on the number of groups, I choose to refactor the code to override functions directly. Both ways are widely used in our codebase.
3de2d1d to
15f3a6f
Compare
All aggregations are compiled by AccumulatorCompiler to generate bytecode, we may need to change here if want to add support for all functions in one place. |
In the past, we've rewritten aggregation functions to "flatten" their internal state. For example, for One potential option suggested by @ZacBlanco is to modify the code generation to specify a max grouping and make that configurable per-function, so that we don't need to do this type of code change for each aggregation function that we discover is expensive. It is unfortunate in this PR that we revert from using the aggregation function framework, as it makes the code harder to read. |
The ideal case is to fix the algorithm if possible, like the |
…tion khyperloglog_agg create many hyperloglog objects during aggregation, and this can lead to native memory usage for GC algrithm due to cross reference between heap memory regions. This PR adds an option to add limit on the number of groups for this aggregation function. Defaults to 0 which means no limit.
15f3a6f to
bbb0af8
Compare
Description
Add an option to limit the number of groups for the khyperloglog_agg. This limit is set in feature config, with property
khyperloglog-agg-group-limit.Motivation and Context
This is to solve the same problem as mentioned in #9553 for the
khyperloglog_aggfunction.In this aggregation function, it will create multiple
HyperLogLogobjects for the same group. It can lead to cross heap memory region reference, which increase the native memory usage as described in the issue.In this PR, I add a limit on the number of groups this aggregation can have. This limit defaults to 0 which means no limit, i.e. the current behavior.
Impact
Have a way to limit native memory usage for this aggregation.
Test Plan
Existing unit tests
Also run verifier suites to make sure that the aggregation function still returns the same results
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.