Add option to add limit to number of groups for khyperloglog_agg function by feilong-liu · Pull Request #21510 · prestodb/presto

feilong-liu · 2023-12-11T19:52:39Z

Description

Add an option to limit the number of groups for the khyperloglog_agg. This limit is set in feature config, with property khyperloglog-agg-group-limit.

Motivation and Context

This is to solve the same problem as mentioned in #9553 for the khyperloglog_agg function.
In this aggregation function, it will create multiple HyperLogLog objects for the same group. It can lead to cross heap memory region reference, which increase the native memory usage as described in the issue.
In this PR, I add a limit on the number of groups this aggregation can have. This limit defaults to 0 which means no limit, i.e. the current behavior.

Impact

Have a way to limit native memory usage for this aggregation.

Test Plan

Existing unit tests
Also run verifier suites to make sure that the aggregation function still returns the same results

Contributor checklist

Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add an feature config property `khyperloglog-agg-group-limit` to set the maximum number of groups `khyperloglog_agg` can have. It will fail the query when the limit is exceeded. The default is 0 which means no limit.

mlyublena

LGTM

presto-main/src/main/java/com/facebook/presto/type/khyperloglog/KHyperLogLogStateFactory.java

agrawaldevesh

I am curious how would we generalize this to cover say other expensive agg states down the line with minimal code ? Not something we ought to fix now, but what are your thoughts ? Should these checks go elsewhere then ?

agrawaldevesh · 2023-12-15T02:06:11Z

...ain/src/main/java/com/facebook/presto/type/khyperloglog/KHyperLogLogAggregationFunction.java

Can you explain more about these changes, is this some new style of writing agg functions ?

This is an easier way to write aggregation function, corresponding to the ParametricAggregation class. The input, combine and output function together with the annotations can be used to generate the aggregation function. However, it's also not flexible. As we need to specify the limit on the number of groups, I choose to refactor the code to override functions directly. Both ways are widely used in our codebase.

feilong-liu · 2023-12-15T05:50:41Z

I am curious how would we generalize this to cover say other expensive agg states down the line with minimal code ? Not something we ought to fix now, but what are your thoughts ? Should these checks go elsewhere then ?

All aggregations are compiled by AccumulatorCompiler to generate bytecode, we may need to change here if want to add support for all functions in one place.

tdcmeehan · 2023-12-15T14:55:42Z

I am curious how would we generalize this to cover say other expensive agg states down the line with minimal code ? Not something we ought to fix now, but what are your thoughts ? Should these checks go elsewhere then ?

In the past, we've rewritten aggregation functions to "flatten" their internal state. For example, for array_agg, we put all the values in one giant array for all groups, and do some record keeping to track which sections of the array go to which group. This is expensive and not generalizable, though.

One potential option suggested by @ZacBlanco is to modify the code generation to specify a max grouping and make that configurable per-function, so that we don't need to do this type of code change for each aggregation function that we discover is expensive. It is unfortunate in this PR that we revert from using the aggregation function framework, as it makes the code harder to read.

feilong-liu · 2023-12-17T06:50:38Z

I am curious how would we generalize this to cover say other expensive agg states down the line with minimal code ? Not something we ought to fix now, but what are your thoughts ? Should these checks go elsewhere then ?

In the past, we've rewritten aggregation functions to "flatten" their internal state. For example, for array_agg, we put all the values in one giant array for all groups, and do some record keeping to track which sections of the array go to which group. This is expensive and not generalizable, though.

One potential option suggested by @ZacBlanco is to modify the code generation to specify a max grouping and make that configurable per-function, so that we don't need to do this type of code change for each aggregation function that we discover is expensive. It is unfortunate in this PR that we revert from using the aggregation function framework, as it makes the code harder to read.

The ideal case is to fix the algorithm if possible, like the arrray_agg, and apply limit if cannot avoid. Fortunately, this will not be a problem for Prestissimo. Change accumulator interface is a more general solution, but with more complex changes. Since this is the only aggregation we found expensive and cannot be simplified with algorithm change, I incline to keep the changes minimal and generalize it if we find more cases

ajaygeorge

LGTM

…tion khyperloglog_agg create many hyperloglog objects during aggregation, and this can lead to native memory usage for GC algrithm due to cross reference between heap memory regions. This PR adds an option to add limit on the number of groups for this aggregation function. Defaults to 0 which means no limit.

feilong-liu requested a review from a team as a code owner December 11, 2023 19:52

feilong-liu requested a review from presto-oss December 11, 2023 19:52

feilong-liu marked this pull request as draft December 11, 2023 19:52

feilong-liu force-pushed the fix_khll_memory branch 2 times, most recently from 1659514 to 3de2d1d Compare December 11, 2023 20:36

feilong-liu marked this pull request as ready for review December 11, 2023 20:37

feilong-liu requested review from MnO2, ajaygeorge and mlyublena December 11, 2023 20:38

mlyublena requested a review from jonhehir December 11, 2023 22:27

mlyublena approved these changes Dec 14, 2023

View reviewed changes

kaikalur approved these changes Dec 14, 2023

View reviewed changes

presto-main/src/main/java/com/facebook/presto/type/khyperloglog/KHyperLogLogStateFactory.java Outdated Show resolved Hide resolved

agrawaldevesh reviewed Dec 15, 2023

View reviewed changes

feilong-liu force-pushed the fix_khll_memory branch from 3de2d1d to 15f3a6f Compare December 15, 2023 05:42

ajaygeorge approved these changes Dec 18, 2023

View reviewed changes

feilong-liu force-pushed the fix_khll_memory branch from 15f3a6f to bbb0af8 Compare December 18, 2023 21:09

feilong-liu merged commit b113d6e into prestodb:master Dec 18, 2023

feilong-liu deleted the fix_khll_memory branch December 18, 2023 22:03

feilong-liu mentioned this pull request Jan 30, 2024

Add limit to merge function for KHyperLogLog #21824

Merged

6 tasks

wanglinsong mentioned this pull request Feb 12, 2024

Add release notes for 0.286 #21906

Merged

64 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to add limit to number of groups for khyperloglog_agg function#21510

Add option to add limit to number of groups for khyperloglog_agg function#21510
feilong-liu merged 1 commit intoprestodb:masterfrom
feilong-liu:fix_khll_memory

feilong-liu commented Dec 11, 2023 •

edited

Loading

Uh oh!

mlyublena left a comment

Uh oh!

Uh oh!

agrawaldevesh left a comment

Uh oh!

agrawaldevesh Dec 15, 2023

Uh oh!

feilong-liu Dec 15, 2023 •

edited

Loading

Uh oh!

feilong-liu commented Dec 15, 2023

Uh oh!

tdcmeehan commented Dec 15, 2023

Uh oh!

feilong-liu commented Dec 17, 2023

Uh oh!

ajaygeorge left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

feilong-liu commented Dec 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Uh oh!

mlyublena left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

agrawaldevesh left a comment

Choose a reason for hiding this comment

Uh oh!

agrawaldevesh Dec 15, 2023

Choose a reason for hiding this comment

Uh oh!

feilong-liu Dec 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

feilong-liu commented Dec 15, 2023

Uh oh!

tdcmeehan commented Dec 15, 2023

Uh oh!

feilong-liu commented Dec 17, 2023

Uh oh!

ajaygeorge left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

feilong-liu commented Dec 11, 2023 •

edited

Loading

feilong-liu Dec 15, 2023 •

edited

Loading