Skip to content

Add option to add limit to number of groups for khyperloglog_agg function#21510

Merged
feilong-liu merged 1 commit intoprestodb:masterfrom
feilong-liu:fix_khll_memory
Dec 18, 2023
Merged

Add option to add limit to number of groups for khyperloglog_agg function#21510
feilong-liu merged 1 commit intoprestodb:masterfrom
feilong-liu:fix_khll_memory

Conversation

@feilong-liu
Copy link
Contributor

@feilong-liu feilong-liu commented Dec 11, 2023

Description

Add an option to limit the number of groups for the khyperloglog_agg. This limit is set in feature config, with property khyperloglog-agg-group-limit.

Motivation and Context

This is to solve the same problem as mentioned in #9553 for the khyperloglog_agg function.
In this aggregation function, it will create multiple HyperLogLog objects for the same group. It can lead to cross heap memory region reference, which increase the native memory usage as described in the issue.
In this PR, I add a limit on the number of groups this aggregation can have. This limit defaults to 0 which means no limit, i.e. the current behavior.

Impact

Have a way to limit native memory usage for this aggregation.

Test Plan

Existing unit tests
Also run verifier suites to make sure that the aggregation function still returns the same results

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add an feature config property `khyperloglog-agg-group-limit` to set the maximum number of groups `khyperloglog_agg` can have. It will fail the query when the limit is exceeded. The default is 0 which means no limit.

@feilong-liu feilong-liu requested a review from a team as a code owner December 11, 2023 19:52
@feilong-liu feilong-liu marked this pull request as draft December 11, 2023 19:52
@feilong-liu feilong-liu force-pushed the fix_khll_memory branch 2 times, most recently from 1659514 to 3de2d1d Compare December 11, 2023 20:36
@feilong-liu feilong-liu marked this pull request as ready for review December 11, 2023 20:37
@mlyublena mlyublena requested a review from jonhehir December 11, 2023 22:27
Copy link
Contributor

@mlyublena mlyublena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@agrawaldevesh agrawaldevesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious how would we generalize this to cover say other expensive agg states down the line with minimal code ? Not something we ought to fix now, but what are your thoughts ? Should these checks go elsewhere then ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more about these changes, is this some new style of writing agg functions ?

Copy link
Contributor Author

@feilong-liu feilong-liu Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an easier way to write aggregation function, corresponding to the ParametricAggregation class. The input, combine and output function together with the annotations can be used to generate the aggregation function. However, it's also not flexible. As we need to specify the limit on the number of groups, I choose to refactor the code to override functions directly. Both ways are widely used in our codebase.

@feilong-liu
Copy link
Contributor Author

I am curious how would we generalize this to cover say other expensive agg states down the line with minimal code ? Not something we ought to fix now, but what are your thoughts ? Should these checks go elsewhere then ?

All aggregations are compiled by AccumulatorCompiler to generate bytecode, we may need to change here if want to add support for all functions in one place.

@tdcmeehan
Copy link
Contributor

I am curious how would we generalize this to cover say other expensive agg states down the line with minimal code ? Not something we ought to fix now, but what are your thoughts ? Should these checks go elsewhere then ?

In the past, we've rewritten aggregation functions to "flatten" their internal state. For example, for array_agg, we put all the values in one giant array for all groups, and do some record keeping to track which sections of the array go to which group. This is expensive and not generalizable, though.

One potential option suggested by @ZacBlanco is to modify the code generation to specify a max grouping and make that configurable per-function, so that we don't need to do this type of code change for each aggregation function that we discover is expensive. It is unfortunate in this PR that we revert from using the aggregation function framework, as it makes the code harder to read.

@feilong-liu
Copy link
Contributor Author

I am curious how would we generalize this to cover say other expensive agg states down the line with minimal code ? Not something we ought to fix now, but what are your thoughts ? Should these checks go elsewhere then ?

In the past, we've rewritten aggregation functions to "flatten" their internal state. For example, for array_agg, we put all the values in one giant array for all groups, and do some record keeping to track which sections of the array go to which group. This is expensive and not generalizable, though.

One potential option suggested by @ZacBlanco is to modify the code generation to specify a max grouping and make that configurable per-function, so that we don't need to do this type of code change for each aggregation function that we discover is expensive. It is unfortunate in this PR that we revert from using the aggregation function framework, as it makes the code harder to read.

The ideal case is to fix the algorithm if possible, like the arrray_agg, and apply limit if cannot avoid. Fortunately, this will not be a problem for Prestissimo. Change accumulator interface is a more general solution, but with more complex changes. Since this is the only aggregation we found expensive and cannot be simplified with algorithm change, I incline to keep the changes minimal and generalize it if we find more cases

Copy link
Contributor

@ajaygeorge ajaygeorge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…tion

khyperloglog_agg create many hyperloglog objects during aggregation, and this
can lead to native memory usage for GC algrithm due to cross reference between
heap memory regions. This PR adds an option to add limit on the number of groups
for this aggregation function. Defaults to 0 which means no limit.
@feilong-liu feilong-liu merged commit b113d6e into prestodb:master Dec 18, 2023
@feilong-liu feilong-liu deleted the fix_khll_memory branch December 18, 2023 22:03
@wanglinsong wanglinsong mentioned this pull request Feb 12, 2024
64 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants