feat: Add session properties for aggregation compaction by duxiao1212 · Pull Request #26874 · prestodb/presto

duxiao1212 · 2025-12-30T02:13:38Z

Summary: Per title

Differential Revision: D89909129

sourcery-ai · 2025-12-30T02:13:48Z

Reviewer's Guide

Adds native session and worker properties to control aggregation string compaction thresholds and ratios, wires them to Velox QueryConfig, and updates tests to validate the new mappings.

Class diagram for new aggregation compaction session properties

classDiagram
    class SessionProperties {
        +static const char* kUseVeloxGeospatialJoin
        +static const char* kAggregationCompactionBytesThreshold
        +static const char* kAggregationCompactionUnusedMemoryRatio
        +SessionProperties()
        +bool hasVeloxConfig(key)
        +std::string toVeloxConfig()
    }

    class QueryConfig {
        +static const char* kAggregationCompactionBytesThreshold
        +static const char* kAggregationCompactionUnusedMemoryRatio
        +uint64_t aggregationCompactionBytesThreshold()
        +double aggregationCompactionUnusedMemoryRatio()
    }

    SessionProperties ..> QueryConfig : uses

    class NativeWorkerSessionPropertyProvider {
        +static final String NATIVE_INDEX_LOOKUP_JOIN_SPLIT_OUTPUT
        +static final String NATIVE_UNNEST_SPLIT_OUTPUT
        +static final String NATIVE_USE_VELOX_GEOSPATIAL_JOIN
        +static final String NATIVE_AGGREGATION_COMPACTION_BYTES_THRESHOLD
        +static final String NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO
        -List~PropertyMetadata~ sessionProperties
        +NativeWorkerSessionPropertyProvider(FeaturesConfig featuresConfig)
    }

    class FeaturesConfig {
        +boolean isNativeExecutionEnabled()
    }

    NativeWorkerSessionPropertyProvider ..> FeaturesConfig : reads
    NativeWorkerSessionPropertyProvider ..> SessionProperties : aligns_property_names

File-Level Changes

Change	Details	Files
Introduce native session properties to control aggregation string compaction behavior in Presto CPP and wire them to Velox QueryConfig.	Declare new session property keys for aggregation compaction byte threshold and unused memory ratio in the SessionProperties header Register the new aggregation compaction properties in the SessionProperties constructor, including descriptions, types, config keys, and default values from QueryConfig Ensure the new session properties are included in the mapping used to translate session properties to Velox config keys in tests	`presto-native-execution/presto_cpp/main/SessionProperties.h` `presto-native-execution/presto_cpp/main/SessionProperties.cpp` `presto-native-execution/presto_cpp/main/tests/SessionPropertiesTest.cpp`
Expose corresponding native worker session properties on the Java side so they can be set in Presto configurations.	Add Java string constants for the new native aggregation compaction session property names Register long and double session properties for aggregation compaction threshold and unused memory ratio with appropriate defaults and descriptions in NativeWorkerSessionPropertyProvider	`presto-main-base/src/main/java/com/facebook/presto/sessionpropertyproviders/NativeWorkerSessionPropertyProvider.java`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

xiaoxmeng

@duxiao1212 thanks!

presto-docs/src/main/sphinx/presto_cpp/properties-session.rst

sourcery-ai

Hey - I've found 5 issues, and left some high level feedback:

For NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO, consider adding explicit validation or clamping when reading the session property to enforce the documented [0, 1) range, so misconfigurations (e.g., negative or >=1 values) fail fast rather than silently producing unexpected behavior.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- For `NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO`, consider adding explicit validation or clamping when reading the session property to enforce the documented [0, 1) range, so misconfigurations (e.g., negative or >=1 values) fail fast rather than silently producing unexpected behavior.

## Individual Comments

### Comment 1
<location> `presto-main-base/src/main/java/com/facebook/presto/sessionpropertyproviders/NativeWorkerSessionPropertyProvider.java:447-448` </location>
<code_context>
+                                "aggregate with StringView type during global aggregation.",
+                        0L,
+                        !nativeExecution),
+                doubleProperty(
+                        NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO,
+                        "Ratio of unused (evicted) bytes to total bytes that triggers compaction. " +
+                                "The value is in the range of [0, 1). NOTE: Currently only applies to approx_most_frequent " +
</code_context>

<issue_to_address>
**issue (bug_risk):** Consider enforcing the documented [0, 1) range for the unused-memory ratio session property.

`NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO` is documented as `[0, 1)`, but there’s no guard against negative values, `1.0`, or >1 being set. If the native code relies on this invariant, out-of-range values could cause incorrect or undefined behavior.

Please either enforce the range (reject with a clear error) or clamp the value into `[0, 1)` when reading the property so the implementation matches the documented contract.
</issue_to_address>

### Comment 2
<location> `presto-native-execution/presto_cpp/main/SessionProperties.cpp:590-599` </location>
<code_context>
+      QueryConfig::kAggregationCompactionBytesThreshold,
+      std::to_string(c.aggregationCompactionBytesThreshold()));
+
+  addSessionProperty(
+      kAggregationCompactionUnusedMemoryRatio,
+      "Ratio of unused (evicted) bytes to total bytes that triggers compaction. "
+      "The value is in the range of [0, 1). Default is 0.25. NOTE: Currently only applies "
+      "to approx_most_frequent aggregate with StringView type during global "
+      "aggregation. May extend to other aggregates.",
+      DOUBLE(),
+      false,
+      QueryConfig::kAggregationCompactionUnusedMemoryRatio,
+      std::to_string(c.aggregationCompactionUnusedMemoryRatio()));
 }

</code_context>

<issue_to_address>
**issue (bug_risk):** Align C++ session property behavior with the documented [0, 1) range for the unused-memory ratio.

The property docs specify `[0, 1)`, but the code doesn’t validate this. If callers pass values < 0, `>= 1`, or NaN from session config, aggregation/compaction behavior could diverge from what’s expected. Please enforce this range at application time (e.g., clamp or reject and fall back to the default) so runtime behavior matches the documented contract.
</issue_to_address>

### Comment 3
<location> `presto-docs/src/main/sphinx/presto_cpp/properties-session.rst:554-558` </location>
<code_context>
+* **Type:** ``bigint``
+* **Default value:** ``0``
+
+Native Execution only. Memory threshold in bytes for triggering string compaction
+during global aggregation. When total string storage exceeds this limit with high
+unused memory ratio, compaction is triggered to reclaim dead strings. Disabled by
+default (0). Currently only applies to approx_most_frequent aggregate with
+StringView type during global aggregation.
</code_context>

<issue_to_address>
**nitpick (typo):** Consider rephrasing the 'with high unused memory ratio' clause for better grammar and clarity.

The phrase "When total string storage exceeds this limit with high unused memory ratio" is a bit awkward. Consider "When total string storage exceeds this limit and the unused memory ratio is high" or "...with a high unused memory ratio" for clearer grammar.

```suggestion
Native Execution only. Memory threshold in bytes for triggering string compaction
during global aggregation. When total string storage exceeds this limit and the
unused memory ratio is high, compaction is triggered to reclaim dead strings.
Disabled by default (0). Currently only applies to approx_most_frequent aggregate
with StringView type during global aggregation.
```
</issue_to_address>

### Comment 4
<location> `presto-main-base/src/main/java/com/facebook/presto/sessionpropertyproviders/NativeWorkerSessionPropertyProvider.java:91` </location>
<code_context>
     public static final String NATIVE_INDEX_LOOKUP_JOIN_SPLIT_OUTPUT = "native_index_lookup_join_split_output";
     public static final String NATIVE_UNNEST_SPLIT_OUTPUT = "native_unnest_split_output";
     public static final String NATIVE_USE_VELOX_GEOSPATIAL_JOIN = "native_use_velox_geospatial_join";
+    public static final String NATIVE_AGGREGATION_COMPACTION_BYTES_THRESHOLD = "native_aggregation_compaction_bytes_threshold";
+    public static final String NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO = "native_aggregation_compaction_unused_memory_ratio";

</code_context>

<issue_to_address>
**issue (review_instructions):** This new session property is added here but there is no corresponding documentation change in the docs files shown in the diff.

The instruction requires adding documentation for any new session property defined in this file. I don’t see a corresponding addition in `presto-docs/src/main/sphinx/presto_cpp/properties-session.rst` (or other docs) in this diff. Please add or update the appropriate documentation section to describe `native_aggregation_compaction_bytes_threshold` and its behavior.

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/NativeWorkerSessionPropertyProvider.java`

**Instructions:**
Make sure that accompanying documentation is added for any session property which is added to this file.

</details>
</issue_to_address>

### Comment 5
<location> `presto-main-base/src/main/java/com/facebook/presto/sessionpropertyproviders/NativeWorkerSessionPropertyProvider.java:92` </location>
<code_context>
     public static final String NATIVE_UNNEST_SPLIT_OUTPUT = "native_unnest_split_output";
     public static final String NATIVE_USE_VELOX_GEOSPATIAL_JOIN = "native_use_velox_geospatial_join";
+    public static final String NATIVE_AGGREGATION_COMPACTION_BYTES_THRESHOLD = "native_aggregation_compaction_bytes_threshold";
+    public static final String NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO = "native_aggregation_compaction_unused_memory_ratio";

     private final List<PropertyMetadata<?>> sessionProperties;
</code_context>

<issue_to_address>
**issue (review_instructions):** This new session property is also missing corresponding documentation changes in the docs diff.

Similarly, `native_aggregation_compaction_unused_memory_ratio` is introduced here without an accompanying documentation update in the shown docs files. To follow the review instruction, please add or extend the relevant section in the session properties documentation to cover this setting and how it should be used.

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/NativeWorkerSessionPropertyProvider.java`

**Instructions:**
Make sure that accompanying documentation is added for any session property which is added to this file.

</details>
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-12-30T02:21:36Z

presto-docs/src/main/sphinx/presto_cpp/properties-session.rst

+Native Execution only. Memory threshold in bytes for triggering string compaction
+during global aggregation. When total string storage exceeds this limit with high
+unused memory ratio, compaction is triggered to reclaim dead strings. Disabled by
+default (0). Currently only applies to approx_most_frequent aggregate with
+StringView type during global aggregation.


nitpick (typo): Consider rephrasing the 'with high unused memory ratio' clause for better grammar and clarity.

The phrase "When total string storage exceeds this limit with high unused memory ratio" is a bit awkward. Consider "When total string storage exceeds this limit and the unused memory ratio is high" or "...with a high unused memory ratio" for clearer grammar.

Suggested change

Native Execution only. Memory threshold in bytes for triggering string compaction

during global aggregation. When total string storage exceeds this limit with high

unused memory ratio, compaction is triggered to reclaim dead strings. Disabled by

default (0). Currently only applies to approx_most_frequent aggregate with

StringView type during global aggregation.

Native Execution only. Memory threshold in bytes for triggering string compaction

during global aggregation. When total string storage exceeds this limit and the

unused memory ratio is high, compaction is triggered to reclaim dead strings.

Disabled by default (0). Currently only applies to approx_most_frequent aggregate

with StringView type during global aggregation.

Summary: Summary This diff adds two new session properties to configure string compaction for the approx_most_frequent aggregate function during global aggregation: 1. native_aggregation_compaction_bytes_threshold (BIGINT, default: 0) Memory threshold in bytes for triggering string compaction When total string storage exceeds this limit with high unused memory ratio, compaction is triggered to reclaim dead strings Disabled by default (0) 2. native_aggregation_compaction_unused_memory_ratio (DOUBLE, default: 0.25) Ratio of unused (evicted) bytes to total bytes that triggers compaction Value is between 0.0 and 1.0 Reviewed By: xiaoxmeng Differential Revision: D89909129

aditi-pandit

Thanks @duxiao1212 for this code.

Have a bunch of comments about the documentation added for these properties.

@amitkdutta

presto-docs/src/main/sphinx/presto_cpp/properties-session.rst

aditi-pandit · 2026-01-02T03:52:07Z

presto-docs/src/main/sphinx/presto_cpp/properties-session.rst

+* **Type:** ``bigint``
+* **Default value:** ``0``
+
+Native Execution only. Memory threshold in bytes for triggering string compaction


Do you envision compaction to apply for non-string data types ? If this is only for strings then we could clarify the naming to be specific about string compaction.

Also its hard to follow from an end-user perspective
"When total string storage exceeds this limit and the
unused memory ratio is high, compaction is triggered to reclaim dead strings."

It might be useful to describe how total string storage is calculated so that its easier to understand how to set this property. If we can compute that from some Velox metrics (available through Prometheus), then it would be great to share the computation.

Do you have some Velox blog article or documentation for this work ? Would be great to link that here.

@aditi-pandit , thanks for the great questions,

Is this only for strings?
Yes, currently this is specific to string compaction for the approx_most_frequent aggregate with StringView type during global aggregation. It may extend to other aggregates in the future, but for now, it's limited to this use case.

How does the compaction mechanism work?
This is not a general Velox string compaction mechanism. It's a per-aggregation-function compaction that operates on the accumulator object during global aggregation (addSingleGroupRawInput).

For approx_most_frequent, the accumulator uses:
A Strings struct that stores non-inlined strings in memory blocks managed by HashStringAllocator
An ApproxMostFrequentStreamSummary data structure that maintains a fixed-capacity summary of top-k frequent values
When values are evicted from the summary (due to the stream summary algorithm), the strings remain in memory as "dead strings." The compaction mechanism:

Tracks activeBytes_ (bytes used by strings currently in the summary) and evictedBytes_ (bytes used by evicted/dead strings)
When activeBytes_ + evictedBytes_ > compactionBytesThreshold AND evictedBytes_ > compactionBytesThreshold * compactionUnusedMemoryRatio, compaction is triggered
Compaction copies only the active strings to a new Strings storage and frees the old storage

How to set this property?
The total string storage (activeBytes_ + evictedBytes_) depends on:
The capacity of the stream summary (user-specified parameter)
The size of the string values being aggregated
The churn rate of values being evicted

Unfortunately, there's no direct Velox/Prometheus metric for this today. Users may need to estimate based on their data characteristics (e.g., average string size × summary capacity × expected churn).

Users may need to estimate based on their data characteristics (e.g., average string size × summary capacity × expected churn).

This would have been useful information to include in the documentation for the reader.

aditi-pandit · 2026-01-02T03:53:40Z

presto-docs/src/main/sphinx/presto_cpp/properties-session.rst

+* **Maximum value:** ``1``
+* **Default value:** ``0.25``
+
+Native Execution only. Ratio of unused (evicted) bytes to total bytes that triggers


Here as well its difficult to understand how "unused (evicted) bytes" is computed. If we can compute that from some Velox metrics (available through Prometheus), then it would be great to share the computation.

Summary: Summary This diff adds two new session properties to configure string compaction for the approx_most_frequent aggregate function during global aggregation: 1. native_aggregation_compaction_bytes_threshold (BIGINT, default: 0) Memory threshold in bytes for triggering string compaction When total string storage exceeds this limit with high unused memory ratio, compaction is triggered to reclaim dead strings Disabled by default (0) 2. native_aggregation_compaction_unused_memory_ratio (DOUBLE, default: 0.25) Ratio of unused (evicted) bytes to total bytes that triggers compaction Value is between 0.0 and 1.0 Reviewed By: xiaoxmeng Differential Revision: D89909129

Summary: Per title Differential Revision: D89909129

hantangwangd · 2026-04-01T16:30:51Z

Hi @duxiao1212, thanks for this PR! As part of the release process — do you think this change warrants a release note? If so, would you like to add one? Happy to help if you'd prefer.

duxiao1212 requested review from a team, elharo and steveburnett as code owners December 30, 2025 02:13

prestodb-ci added the from:Meta PR from Meta label Dec 30, 2025

facebook-github-bot added fb-exported meta-exported labels Dec 30, 2025

xiaoxmeng previously approved these changes Dec 30, 2025

View reviewed changes

presto-docs/src/main/sphinx/presto_cpp/properties-session.rst Outdated Show resolved Hide resolved

sourcery-ai bot reviewed Dec 30, 2025

View reviewed changes

duxiao1212 dismissed xiaoxmeng’s stale review via c63d822 December 30, 2025 02:36

duxiao1212 force-pushed the export-D89909129 branch from 9c4b64a to c63d822 Compare December 30, 2025 02:36

duxiao1212 force-pushed the export-D89909129 branch from c63d822 to 7414d90 Compare December 30, 2025 02:47

duxiao1212 force-pushed the export-D89909129 branch from 7414d90 to aa0766c Compare January 1, 2026 20:41

duxiao1212 force-pushed the export-D89909129 branch from aa0766c to 6c31e29 Compare January 1, 2026 20:41

aditi-pandit reviewed Jan 2, 2026

View reviewed changes

duxiao1212 force-pushed the export-D89909129 branch from 6c31e29 to 37ce2c8 Compare January 5, 2026 14:20

duxiao1212 requested a review from NikhilCollooru January 6, 2026 16:30

NikhilCollooru approved these changes Jan 6, 2026

View reviewed changes

duxiao1212 merged commit 705585c into prestodb:master Jan 6, 2026
82 of 84 checks passed

tdcmeehan pushed a commit to rdtr/presto that referenced this pull request Jan 14, 2026

feat: Add session properties for aggregation compaction (prestodb#26874)

be20599

Summary: Per title Differential Revision: D89909129

This was referenced Mar 31, 2026

docs: Add release notes for 0.297 unix280/presto#51

Closed

docs: Add release notes for 0.297 unix280/presto#52

Open

prestodb-ci mentioned this pull request Apr 1, 2026

docs: Add release notes for 0.297 #27484

Open

15 tasks

Conversation

duxiao1212 commented Dec 30, 2025

Uh oh!

sourcery-ai bot commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Class diagram for new aggregation compaction session properties

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

xiaoxmeng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

aditi-pandit left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aditi-pandit Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

duxiao1212 Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

steveburnett Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

aditi-pandit Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hantangwangd commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

sourcery-ai bot commented Dec 30, 2025 •

edited

Loading

aditi-pandit left a comment •

edited

Loading