Skip to content

feat: Add session properties for aggregation compaction#26874

Merged
duxiao1212 merged 1 commit intoprestodb:masterfrom
duxiao1212:export-D89909129
Jan 6, 2026
Merged

feat: Add session properties for aggregation compaction#26874
duxiao1212 merged 1 commit intoprestodb:masterfrom
duxiao1212:export-D89909129

Conversation

@duxiao1212
Copy link
Copy Markdown
Contributor

Summary: Per title

Differential Revision: D89909129

@duxiao1212 duxiao1212 requested review from a team, elharo and steveburnett as code owners December 30, 2025 02:13
@prestodb-ci prestodb-ci added the from:Meta PR from Meta label Dec 30, 2025
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Dec 30, 2025

Reviewer's Guide

Adds native session and worker properties to control aggregation string compaction thresholds and ratios, wires them to Velox QueryConfig, and updates tests to validate the new mappings.

Class diagram for new aggregation compaction session properties

classDiagram
    class SessionProperties {
        +static const char* kUseVeloxGeospatialJoin
        +static const char* kAggregationCompactionBytesThreshold
        +static const char* kAggregationCompactionUnusedMemoryRatio
        +SessionProperties()
        +bool hasVeloxConfig(key)
        +std::string toVeloxConfig()
    }

    class QueryConfig {
        +static const char* kAggregationCompactionBytesThreshold
        +static const char* kAggregationCompactionUnusedMemoryRatio
        +uint64_t aggregationCompactionBytesThreshold()
        +double aggregationCompactionUnusedMemoryRatio()
    }

    SessionProperties ..> QueryConfig : uses

    class NativeWorkerSessionPropertyProvider {
        +static final String NATIVE_INDEX_LOOKUP_JOIN_SPLIT_OUTPUT
        +static final String NATIVE_UNNEST_SPLIT_OUTPUT
        +static final String NATIVE_USE_VELOX_GEOSPATIAL_JOIN
        +static final String NATIVE_AGGREGATION_COMPACTION_BYTES_THRESHOLD
        +static final String NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO
        -List~PropertyMetadata~ sessionProperties
        +NativeWorkerSessionPropertyProvider(FeaturesConfig featuresConfig)
    }

    class FeaturesConfig {
        +boolean isNativeExecutionEnabled()
    }

    NativeWorkerSessionPropertyProvider ..> FeaturesConfig : reads
    NativeWorkerSessionPropertyProvider ..> SessionProperties : aligns_property_names
Loading

File-Level Changes

Change Details Files
Introduce native session properties to control aggregation string compaction behavior in Presto CPP and wire them to Velox QueryConfig.
  • Declare new session property keys for aggregation compaction byte threshold and unused memory ratio in the SessionProperties header
  • Register the new aggregation compaction properties in the SessionProperties constructor, including descriptions, types, config keys, and default values from QueryConfig
  • Ensure the new session properties are included in the mapping used to translate session properties to Velox config keys in tests
presto-native-execution/presto_cpp/main/SessionProperties.h
presto-native-execution/presto_cpp/main/SessionProperties.cpp
presto-native-execution/presto_cpp/main/tests/SessionPropertiesTest.cpp
Expose corresponding native worker session properties on the Java side so they can be set in Presto configurations.
  • Add Java string constants for the new native aggregation compaction session property names
  • Register long and double session properties for aggregation compaction threshold and unused memory ratio with appropriate defaults and descriptions in NativeWorkerSessionPropertyProvider
presto-main-base/src/main/java/com/facebook/presto/sessionpropertyproviders/NativeWorkerSessionPropertyProvider.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

xiaoxmeng
xiaoxmeng previously approved these changes Dec 30, 2025
Copy link
Copy Markdown
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@duxiao1212 thanks!

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 5 issues, and left some high level feedback:

  • For NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO, consider adding explicit validation or clamping when reading the session property to enforce the documented [0, 1) range, so misconfigurations (e.g., negative or >=1 values) fail fast rather than silently producing unexpected behavior.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- For `NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO`, consider adding explicit validation or clamping when reading the session property to enforce the documented [0, 1) range, so misconfigurations (e.g., negative or >=1 values) fail fast rather than silently producing unexpected behavior.

## Individual Comments

### Comment 1
<location> `presto-main-base/src/main/java/com/facebook/presto/sessionpropertyproviders/NativeWorkerSessionPropertyProvider.java:447-448` </location>
<code_context>
+                                "aggregate with StringView type during global aggregation.",
+                        0L,
+                        !nativeExecution),
+                doubleProperty(
+                        NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO,
+                        "Ratio of unused (evicted) bytes to total bytes that triggers compaction. " +
+                                "The value is in the range of [0, 1). NOTE: Currently only applies to approx_most_frequent " +
</code_context>

<issue_to_address>
**issue (bug_risk):** Consider enforcing the documented [0, 1) range for the unused-memory ratio session property.

`NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO` is documented as `[0, 1)`, but there’s no guard against negative values, `1.0`, or >1 being set. If the native code relies on this invariant, out-of-range values could cause incorrect or undefined behavior.

Please either enforce the range (reject with a clear error) or clamp the value into `[0, 1)` when reading the property so the implementation matches the documented contract.
</issue_to_address>

### Comment 2
<location> `presto-native-execution/presto_cpp/main/SessionProperties.cpp:590-599` </location>
<code_context>
+      QueryConfig::kAggregationCompactionBytesThreshold,
+      std::to_string(c.aggregationCompactionBytesThreshold()));
+
+  addSessionProperty(
+      kAggregationCompactionUnusedMemoryRatio,
+      "Ratio of unused (evicted) bytes to total bytes that triggers compaction. "
+      "The value is in the range of [0, 1). Default is 0.25. NOTE: Currently only applies "
+      "to approx_most_frequent aggregate with StringView type during global "
+      "aggregation. May extend to other aggregates.",
+      DOUBLE(),
+      false,
+      QueryConfig::kAggregationCompactionUnusedMemoryRatio,
+      std::to_string(c.aggregationCompactionUnusedMemoryRatio()));
 }

</code_context>

<issue_to_address>
**issue (bug_risk):** Align C++ session property behavior with the documented [0, 1) range for the unused-memory ratio.

The property docs specify `[0, 1)`, but the code doesn’t validate this. If callers pass values < 0, `>= 1`, or NaN from session config, aggregation/compaction behavior could diverge from what’s expected. Please enforce this range at application time (e.g., clamp or reject and fall back to the default) so runtime behavior matches the documented contract.
</issue_to_address>

### Comment 3
<location> `presto-docs/src/main/sphinx/presto_cpp/properties-session.rst:554-558` </location>
<code_context>
+* **Type:** ``bigint``
+* **Default value:** ``0``
+
+Native Execution only. Memory threshold in bytes for triggering string compaction
+during global aggregation. When total string storage exceeds this limit with high
+unused memory ratio, compaction is triggered to reclaim dead strings. Disabled by
+default (0). Currently only applies to approx_most_frequent aggregate with
+StringView type during global aggregation.
</code_context>

<issue_to_address>
**nitpick (typo):** Consider rephrasing the 'with high unused memory ratio' clause for better grammar and clarity.

The phrase "When total string storage exceeds this limit with high unused memory ratio" is a bit awkward. Consider "When total string storage exceeds this limit and the unused memory ratio is high" or "...with a high unused memory ratio" for clearer grammar.

```suggestion
Native Execution only. Memory threshold in bytes for triggering string compaction
during global aggregation. When total string storage exceeds this limit and the
unused memory ratio is high, compaction is triggered to reclaim dead strings.
Disabled by default (0). Currently only applies to approx_most_frequent aggregate
with StringView type during global aggregation.
```
</issue_to_address>

### Comment 4
<location> `presto-main-base/src/main/java/com/facebook/presto/sessionpropertyproviders/NativeWorkerSessionPropertyProvider.java:91` </location>
<code_context>
     public static final String NATIVE_INDEX_LOOKUP_JOIN_SPLIT_OUTPUT = "native_index_lookup_join_split_output";
     public static final String NATIVE_UNNEST_SPLIT_OUTPUT = "native_unnest_split_output";
     public static final String NATIVE_USE_VELOX_GEOSPATIAL_JOIN = "native_use_velox_geospatial_join";
+    public static final String NATIVE_AGGREGATION_COMPACTION_BYTES_THRESHOLD = "native_aggregation_compaction_bytes_threshold";
+    public static final String NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO = "native_aggregation_compaction_unused_memory_ratio";

</code_context>

<issue_to_address>
**issue (review_instructions):** This new session property is added here but there is no corresponding documentation change in the docs files shown in the diff.

The instruction requires adding documentation for any new session property defined in this file. I don’t see a corresponding addition in `presto-docs/src/main/sphinx/presto_cpp/properties-session.rst` (or other docs) in this diff. Please add or update the appropriate documentation section to describe `native_aggregation_compaction_bytes_threshold` and its behavior.

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/NativeWorkerSessionPropertyProvider.java`

**Instructions:**
Make sure that accompanying documentation is added for any session property which is added to this file.

</details>
</issue_to_address>

### Comment 5
<location> `presto-main-base/src/main/java/com/facebook/presto/sessionpropertyproviders/NativeWorkerSessionPropertyProvider.java:92` </location>
<code_context>
     public static final String NATIVE_UNNEST_SPLIT_OUTPUT = "native_unnest_split_output";
     public static final String NATIVE_USE_VELOX_GEOSPATIAL_JOIN = "native_use_velox_geospatial_join";
+    public static final String NATIVE_AGGREGATION_COMPACTION_BYTES_THRESHOLD = "native_aggregation_compaction_bytes_threshold";
+    public static final String NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO = "native_aggregation_compaction_unused_memory_ratio";

     private final List<PropertyMetadata<?>> sessionProperties;
</code_context>

<issue_to_address>
**issue (review_instructions):** This new session property is also missing corresponding documentation changes in the docs diff.

Similarly, `native_aggregation_compaction_unused_memory_ratio` is introduced here without an accompanying documentation update in the shown docs files. To follow the review instruction, please add or extend the relevant section in the session properties documentation to cover this setting and how it should be used.

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/NativeWorkerSessionPropertyProvider.java`

**Instructions:**
Make sure that accompanying documentation is added for any session property which is added to this file.

</details>
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +554 to +558
Native Execution only. Memory threshold in bytes for triggering string compaction
during global aggregation. When total string storage exceeds this limit with high
unused memory ratio, compaction is triggered to reclaim dead strings. Disabled by
default (0). Currently only applies to approx_most_frequent aggregate with
StringView type during global aggregation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (typo): Consider rephrasing the 'with high unused memory ratio' clause for better grammar and clarity.

The phrase "When total string storage exceeds this limit with high unused memory ratio" is a bit awkward. Consider "When total string storage exceeds this limit and the unused memory ratio is high" or "...with a high unused memory ratio" for clearer grammar.

Suggested change
Native Execution only. Memory threshold in bytes for triggering string compaction
during global aggregation. When total string storage exceeds this limit with high
unused memory ratio, compaction is triggered to reclaim dead strings. Disabled by
default (0). Currently only applies to approx_most_frequent aggregate with
StringView type during global aggregation.
Native Execution only. Memory threshold in bytes for triggering string compaction
during global aggregation. When total string storage exceeds this limit and the
unused memory ratio is high, compaction is triggered to reclaim dead strings.
Disabled by default (0). Currently only applies to approx_most_frequent aggregate
with StringView type during global aggregation.

duxiao1212 added a commit to duxiao1212/presto that referenced this pull request Dec 30, 2025
Summary:

Summary
This diff adds two new session properties to configure string compaction for the approx_most_frequent aggregate function during global aggregation:

1. native_aggregation_compaction_bytes_threshold (BIGINT, default: 0)
Memory threshold in bytes for triggering string compaction
When total string storage exceeds this limit with high unused memory ratio, compaction is triggered to reclaim dead strings
Disabled by default (0)

2.  native_aggregation_compaction_unused_memory_ratio (DOUBLE, default: 0.25)
Ratio of unused (evicted) bytes to total bytes that triggers compaction
Value is between 0.0 and 1.0

Reviewed By: xiaoxmeng

Differential Revision: D89909129
duxiao1212 added a commit to duxiao1212/presto that referenced this pull request Jan 1, 2026
Summary:

Summary
This diff adds two new session properties to configure string compaction for the approx_most_frequent aggregate function during global aggregation:

1. native_aggregation_compaction_bytes_threshold (BIGINT, default: 0)
Memory threshold in bytes for triggering string compaction
When total string storage exceeds this limit with high unused memory ratio, compaction is triggered to reclaim dead strings
Disabled by default (0)

2.  native_aggregation_compaction_unused_memory_ratio (DOUBLE, default: 0.25)
Ratio of unused (evicted) bytes to total bytes that triggers compaction
Value is between 0.0 and 1.0

Reviewed By: xiaoxmeng

Differential Revision: D89909129
duxiao1212 added a commit to duxiao1212/presto that referenced this pull request Jan 1, 2026
Summary:

Summary
This diff adds two new session properties to configure string compaction for the approx_most_frequent aggregate function during global aggregation:

1. native_aggregation_compaction_bytes_threshold (BIGINT, default: 0)
Memory threshold in bytes for triggering string compaction
When total string storage exceeds this limit with high unused memory ratio, compaction is triggered to reclaim dead strings
Disabled by default (0)

2.  native_aggregation_compaction_unused_memory_ratio (DOUBLE, default: 0.25)
Ratio of unused (evicted) bytes to total bytes that triggers compaction
Value is between 0.0 and 1.0

Reviewed By: xiaoxmeng

Differential Revision: D89909129
Copy link
Copy Markdown
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @duxiao1212 for this code.

Have a bunch of comments about the documentation added for these properties.

@amitkdutta

* **Type:** ``bigint``
* **Default value:** ``0``

Native Execution only. Memory threshold in bytes for triggering string compaction
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you envision compaction to apply for non-string data types ? If this is only for strings then we could clarify the naming to be specific about string compaction.

Also its hard to follow from an end-user perspective
"When total string storage exceeds this limit and the
unused memory ratio is high, compaction is triggered to reclaim dead strings."

It might be useful to describe how total string storage is calculated so that its easier to understand how to set this property. If we can compute that from some Velox metrics (available through Prometheus), then it would be great to share the computation.

Do you have some Velox blog article or documentation for this work ? Would be great to link that here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit , thanks for the great questions,

  1. Is this only for strings?
    Yes, currently this is specific to string compaction for the approx_most_frequent aggregate with StringView type during global aggregation. It may extend to other aggregates in the future, but for now, it's limited to this use case.

  2. How does the compaction mechanism work?
    This is not a general Velox string compaction mechanism. It's a per-aggregation-function compaction that operates on the accumulator object during global aggregation (addSingleGroupRawInput).

For approx_most_frequent, the accumulator uses:
A Strings struct that stores non-inlined strings in memory blocks managed by HashStringAllocator
An ApproxMostFrequentStreamSummary data structure that maintains a fixed-capacity summary of top-k frequent values
When values are evicted from the summary (due to the stream summary algorithm), the strings remain in memory as "dead strings." The compaction mechanism:

Tracks activeBytes_ (bytes used by strings currently in the summary) and evictedBytes_ (bytes used by evicted/dead strings)
When activeBytes_ + evictedBytes_ > compactionBytesThreshold AND evictedBytes_ > compactionBytesThreshold * compactionUnusedMemoryRatio, compaction is triggered
Compaction copies only the active strings to a new Strings storage and frees the old storage

  1. How to set this property?
    The total string storage (activeBytes_ + evictedBytes_) depends on:
    The capacity of the stream summary (user-specified parameter)
    The size of the string values being aggregated
    The churn rate of values being evicted

Unfortunately, there's no direct Velox/Prometheus metric for this today. Users may need to estimate based on their data characteristics (e.g., average string size × summary capacity × expected churn).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users may need to estimate based on their data characteristics (e.g., average string size × summary capacity × expected churn).

This would have been useful information to include in the documentation for the reader.

* **Maximum value:** ``1``
* **Default value:** ``0.25``

Native Execution only. Ratio of unused (evicted) bytes to total bytes that triggers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well its difficult to understand how "unused (evicted) bytes" is computed. If we can compute that from some Velox metrics (available through Prometheus), then it would be great to share the computation.

Summary:

Summary
This diff adds two new session properties to configure string compaction for the approx_most_frequent aggregate function during global aggregation:

1. native_aggregation_compaction_bytes_threshold (BIGINT, default: 0)
Memory threshold in bytes for triggering string compaction
When total string storage exceeds this limit with high unused memory ratio, compaction is triggered to reclaim dead strings
Disabled by default (0)

2.  native_aggregation_compaction_unused_memory_ratio (DOUBLE, default: 0.25)
Ratio of unused (evicted) bytes to total bytes that triggers compaction
Value is between 0.0 and 1.0

Reviewed By: xiaoxmeng

Differential Revision: D89909129
@duxiao1212 duxiao1212 merged commit 705585c into prestodb:master Jan 6, 2026
82 of 84 checks passed
tdcmeehan pushed a commit to rdtr/presto that referenced this pull request Jan 14, 2026
Summary: Per title

Differential Revision: D89909129
@hantangwangd
Copy link
Copy Markdown
Member

Hi @duxiao1212, thanks for this PR! As part of the release process — do you think this change warrants a release note? If so, would you like to add one? Happy to help if you'd prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants