feat: Add session properties for aggregation compaction#26874
feat: Add session properties for aggregation compaction#26874duxiao1212 merged 1 commit intoprestodb:masterfrom
Conversation
Reviewer's GuideAdds native session and worker properties to control aggregation string compaction thresholds and ratios, wires them to Velox QueryConfig, and updates tests to validate the new mappings. Class diagram for new aggregation compaction session propertiesclassDiagram
class SessionProperties {
+static const char* kUseVeloxGeospatialJoin
+static const char* kAggregationCompactionBytesThreshold
+static const char* kAggregationCompactionUnusedMemoryRatio
+SessionProperties()
+bool hasVeloxConfig(key)
+std::string toVeloxConfig()
}
class QueryConfig {
+static const char* kAggregationCompactionBytesThreshold
+static const char* kAggregationCompactionUnusedMemoryRatio
+uint64_t aggregationCompactionBytesThreshold()
+double aggregationCompactionUnusedMemoryRatio()
}
SessionProperties ..> QueryConfig : uses
class NativeWorkerSessionPropertyProvider {
+static final String NATIVE_INDEX_LOOKUP_JOIN_SPLIT_OUTPUT
+static final String NATIVE_UNNEST_SPLIT_OUTPUT
+static final String NATIVE_USE_VELOX_GEOSPATIAL_JOIN
+static final String NATIVE_AGGREGATION_COMPACTION_BYTES_THRESHOLD
+static final String NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO
-List~PropertyMetadata~ sessionProperties
+NativeWorkerSessionPropertyProvider(FeaturesConfig featuresConfig)
}
class FeaturesConfig {
+boolean isNativeExecutionEnabled()
}
NativeWorkerSessionPropertyProvider ..> FeaturesConfig : reads
NativeWorkerSessionPropertyProvider ..> SessionProperties : aligns_property_names
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 5 issues, and left some high level feedback:
- For
NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO, consider adding explicit validation or clamping when reading the session property to enforce the documented [0, 1) range, so misconfigurations (e.g., negative or >=1 values) fail fast rather than silently producing unexpected behavior.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- For `NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO`, consider adding explicit validation or clamping when reading the session property to enforce the documented [0, 1) range, so misconfigurations (e.g., negative or >=1 values) fail fast rather than silently producing unexpected behavior.
## Individual Comments
### Comment 1
<location> `presto-main-base/src/main/java/com/facebook/presto/sessionpropertyproviders/NativeWorkerSessionPropertyProvider.java:447-448` </location>
<code_context>
+ "aggregate with StringView type during global aggregation.",
+ 0L,
+ !nativeExecution),
+ doubleProperty(
+ NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO,
+ "Ratio of unused (evicted) bytes to total bytes that triggers compaction. " +
+ "The value is in the range of [0, 1). NOTE: Currently only applies to approx_most_frequent " +
</code_context>
<issue_to_address>
**issue (bug_risk):** Consider enforcing the documented [0, 1) range for the unused-memory ratio session property.
`NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO` is documented as `[0, 1)`, but there’s no guard against negative values, `1.0`, or >1 being set. If the native code relies on this invariant, out-of-range values could cause incorrect or undefined behavior.
Please either enforce the range (reject with a clear error) or clamp the value into `[0, 1)` when reading the property so the implementation matches the documented contract.
</issue_to_address>
### Comment 2
<location> `presto-native-execution/presto_cpp/main/SessionProperties.cpp:590-599` </location>
<code_context>
+ QueryConfig::kAggregationCompactionBytesThreshold,
+ std::to_string(c.aggregationCompactionBytesThreshold()));
+
+ addSessionProperty(
+ kAggregationCompactionUnusedMemoryRatio,
+ "Ratio of unused (evicted) bytes to total bytes that triggers compaction. "
+ "The value is in the range of [0, 1). Default is 0.25. NOTE: Currently only applies "
+ "to approx_most_frequent aggregate with StringView type during global "
+ "aggregation. May extend to other aggregates.",
+ DOUBLE(),
+ false,
+ QueryConfig::kAggregationCompactionUnusedMemoryRatio,
+ std::to_string(c.aggregationCompactionUnusedMemoryRatio()));
}
</code_context>
<issue_to_address>
**issue (bug_risk):** Align C++ session property behavior with the documented [0, 1) range for the unused-memory ratio.
The property docs specify `[0, 1)`, but the code doesn’t validate this. If callers pass values < 0, `>= 1`, or NaN from session config, aggregation/compaction behavior could diverge from what’s expected. Please enforce this range at application time (e.g., clamp or reject and fall back to the default) so runtime behavior matches the documented contract.
</issue_to_address>
### Comment 3
<location> `presto-docs/src/main/sphinx/presto_cpp/properties-session.rst:554-558` </location>
<code_context>
+* **Type:** ``bigint``
+* **Default value:** ``0``
+
+Native Execution only. Memory threshold in bytes for triggering string compaction
+during global aggregation. When total string storage exceeds this limit with high
+unused memory ratio, compaction is triggered to reclaim dead strings. Disabled by
+default (0). Currently only applies to approx_most_frequent aggregate with
+StringView type during global aggregation.
</code_context>
<issue_to_address>
**nitpick (typo):** Consider rephrasing the 'with high unused memory ratio' clause for better grammar and clarity.
The phrase "When total string storage exceeds this limit with high unused memory ratio" is a bit awkward. Consider "When total string storage exceeds this limit and the unused memory ratio is high" or "...with a high unused memory ratio" for clearer grammar.
```suggestion
Native Execution only. Memory threshold in bytes for triggering string compaction
during global aggregation. When total string storage exceeds this limit and the
unused memory ratio is high, compaction is triggered to reclaim dead strings.
Disabled by default (0). Currently only applies to approx_most_frequent aggregate
with StringView type during global aggregation.
```
</issue_to_address>
### Comment 4
<location> `presto-main-base/src/main/java/com/facebook/presto/sessionpropertyproviders/NativeWorkerSessionPropertyProvider.java:91` </location>
<code_context>
public static final String NATIVE_INDEX_LOOKUP_JOIN_SPLIT_OUTPUT = "native_index_lookup_join_split_output";
public static final String NATIVE_UNNEST_SPLIT_OUTPUT = "native_unnest_split_output";
public static final String NATIVE_USE_VELOX_GEOSPATIAL_JOIN = "native_use_velox_geospatial_join";
+ public static final String NATIVE_AGGREGATION_COMPACTION_BYTES_THRESHOLD = "native_aggregation_compaction_bytes_threshold";
+ public static final String NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO = "native_aggregation_compaction_unused_memory_ratio";
</code_context>
<issue_to_address>
**issue (review_instructions):** This new session property is added here but there is no corresponding documentation change in the docs files shown in the diff.
The instruction requires adding documentation for any new session property defined in this file. I don’t see a corresponding addition in `presto-docs/src/main/sphinx/presto_cpp/properties-session.rst` (or other docs) in this diff. Please add or update the appropriate documentation section to describe `native_aggregation_compaction_bytes_threshold` and its behavior.
<details>
<summary>Review instructions:</summary>
**Path patterns:** `**/NativeWorkerSessionPropertyProvider.java`
**Instructions:**
Make sure that accompanying documentation is added for any session property which is added to this file.
</details>
</issue_to_address>
### Comment 5
<location> `presto-main-base/src/main/java/com/facebook/presto/sessionpropertyproviders/NativeWorkerSessionPropertyProvider.java:92` </location>
<code_context>
public static final String NATIVE_UNNEST_SPLIT_OUTPUT = "native_unnest_split_output";
public static final String NATIVE_USE_VELOX_GEOSPATIAL_JOIN = "native_use_velox_geospatial_join";
+ public static final String NATIVE_AGGREGATION_COMPACTION_BYTES_THRESHOLD = "native_aggregation_compaction_bytes_threshold";
+ public static final String NATIVE_AGGREGATION_COMPACTION_UNUSED_MEMORY_RATIO = "native_aggregation_compaction_unused_memory_ratio";
private final List<PropertyMetadata<?>> sessionProperties;
</code_context>
<issue_to_address>
**issue (review_instructions):** This new session property is also missing corresponding documentation changes in the docs diff.
Similarly, `native_aggregation_compaction_unused_memory_ratio` is introduced here without an accompanying documentation update in the shown docs files. To follow the review instruction, please add or extend the relevant section in the session properties documentation to cover this setting and how it should be used.
<details>
<summary>Review instructions:</summary>
**Path patterns:** `**/NativeWorkerSessionPropertyProvider.java`
**Instructions:**
Make sure that accompanying documentation is added for any session property which is added to this file.
</details>
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| Native Execution only. Memory threshold in bytes for triggering string compaction | ||
| during global aggregation. When total string storage exceeds this limit with high | ||
| unused memory ratio, compaction is triggered to reclaim dead strings. Disabled by | ||
| default (0). Currently only applies to approx_most_frequent aggregate with | ||
| StringView type during global aggregation. |
There was a problem hiding this comment.
nitpick (typo): Consider rephrasing the 'with high unused memory ratio' clause for better grammar and clarity.
The phrase "When total string storage exceeds this limit with high unused memory ratio" is a bit awkward. Consider "When total string storage exceeds this limit and the unused memory ratio is high" or "...with a high unused memory ratio" for clearer grammar.
| Native Execution only. Memory threshold in bytes for triggering string compaction | |
| during global aggregation. When total string storage exceeds this limit with high | |
| unused memory ratio, compaction is triggered to reclaim dead strings. Disabled by | |
| default (0). Currently only applies to approx_most_frequent aggregate with | |
| StringView type during global aggregation. | |
| Native Execution only. Memory threshold in bytes for triggering string compaction | |
| during global aggregation. When total string storage exceeds this limit and the | |
| unused memory ratio is high, compaction is triggered to reclaim dead strings. | |
| Disabled by default (0). Currently only applies to approx_most_frequent aggregate | |
| with StringView type during global aggregation. |
9c4b64a to
c63d822
Compare
Summary: Summary This diff adds two new session properties to configure string compaction for the approx_most_frequent aggregate function during global aggregation: 1. native_aggregation_compaction_bytes_threshold (BIGINT, default: 0) Memory threshold in bytes for triggering string compaction When total string storage exceeds this limit with high unused memory ratio, compaction is triggered to reclaim dead strings Disabled by default (0) 2. native_aggregation_compaction_unused_memory_ratio (DOUBLE, default: 0.25) Ratio of unused (evicted) bytes to total bytes that triggers compaction Value is between 0.0 and 1.0 Reviewed By: xiaoxmeng Differential Revision: D89909129
c63d822 to
7414d90
Compare
Summary: Summary This diff adds two new session properties to configure string compaction for the approx_most_frequent aggregate function during global aggregation: 1. native_aggregation_compaction_bytes_threshold (BIGINT, default: 0) Memory threshold in bytes for triggering string compaction When total string storage exceeds this limit with high unused memory ratio, compaction is triggered to reclaim dead strings Disabled by default (0) 2. native_aggregation_compaction_unused_memory_ratio (DOUBLE, default: 0.25) Ratio of unused (evicted) bytes to total bytes that triggers compaction Value is between 0.0 and 1.0 Reviewed By: xiaoxmeng Differential Revision: D89909129
7414d90 to
aa0766c
Compare
Summary: Summary This diff adds two new session properties to configure string compaction for the approx_most_frequent aggregate function during global aggregation: 1. native_aggregation_compaction_bytes_threshold (BIGINT, default: 0) Memory threshold in bytes for triggering string compaction When total string storage exceeds this limit with high unused memory ratio, compaction is triggered to reclaim dead strings Disabled by default (0) 2. native_aggregation_compaction_unused_memory_ratio (DOUBLE, default: 0.25) Ratio of unused (evicted) bytes to total bytes that triggers compaction Value is between 0.0 and 1.0 Reviewed By: xiaoxmeng Differential Revision: D89909129
aa0766c to
6c31e29
Compare
There was a problem hiding this comment.
Thanks @duxiao1212 for this code.
Have a bunch of comments about the documentation added for these properties.
| * **Type:** ``bigint`` | ||
| * **Default value:** ``0`` | ||
|
|
||
| Native Execution only. Memory threshold in bytes for triggering string compaction |
There was a problem hiding this comment.
Do you envision compaction to apply for non-string data types ? If this is only for strings then we could clarify the naming to be specific about string compaction.
Also its hard to follow from an end-user perspective
"When total string storage exceeds this limit and the
unused memory ratio is high, compaction is triggered to reclaim dead strings."
It might be useful to describe how total string storage is calculated so that its easier to understand how to set this property. If we can compute that from some Velox metrics (available through Prometheus), then it would be great to share the computation.
Do you have some Velox blog article or documentation for this work ? Would be great to link that here.
There was a problem hiding this comment.
@aditi-pandit , thanks for the great questions,
-
Is this only for strings?
Yes, currently this is specific to string compaction for the approx_most_frequent aggregate with StringView type during global aggregation. It may extend to other aggregates in the future, but for now, it's limited to this use case. -
How does the compaction mechanism work?
This is not a general Velox string compaction mechanism. It's a per-aggregation-function compaction that operates on the accumulator object during global aggregation (addSingleGroupRawInput).
For approx_most_frequent, the accumulator uses:
A Strings struct that stores non-inlined strings in memory blocks managed by HashStringAllocator
An ApproxMostFrequentStreamSummary data structure that maintains a fixed-capacity summary of top-k frequent values
When values are evicted from the summary (due to the stream summary algorithm), the strings remain in memory as "dead strings." The compaction mechanism:
Tracks activeBytes_ (bytes used by strings currently in the summary) and evictedBytes_ (bytes used by evicted/dead strings)
When activeBytes_ + evictedBytes_ > compactionBytesThreshold AND evictedBytes_ > compactionBytesThreshold * compactionUnusedMemoryRatio, compaction is triggered
Compaction copies only the active strings to a new Strings storage and frees the old storage
- How to set this property?
The total string storage (activeBytes_ + evictedBytes_) depends on:
The capacity of the stream summary (user-specified parameter)
The size of the string values being aggregated
The churn rate of values being evicted
Unfortunately, there's no direct Velox/Prometheus metric for this today. Users may need to estimate based on their data characteristics (e.g., average string size × summary capacity × expected churn).
There was a problem hiding this comment.
Users may need to estimate based on their data characteristics (e.g., average string size × summary capacity × expected churn).
This would have been useful information to include in the documentation for the reader.
| * **Maximum value:** ``1`` | ||
| * **Default value:** ``0.25`` | ||
|
|
||
| Native Execution only. Ratio of unused (evicted) bytes to total bytes that triggers |
There was a problem hiding this comment.
Here as well its difficult to understand how "unused (evicted) bytes" is computed. If we can compute that from some Velox metrics (available through Prometheus), then it would be great to share the computation.
Summary: Summary This diff adds two new session properties to configure string compaction for the approx_most_frequent aggregate function during global aggregation: 1. native_aggregation_compaction_bytes_threshold (BIGINT, default: 0) Memory threshold in bytes for triggering string compaction When total string storage exceeds this limit with high unused memory ratio, compaction is triggered to reclaim dead strings Disabled by default (0) 2. native_aggregation_compaction_unused_memory_ratio (DOUBLE, default: 0.25) Ratio of unused (evicted) bytes to total bytes that triggers compaction Value is between 0.0 and 1.0 Reviewed By: xiaoxmeng Differential Revision: D89909129
6c31e29 to
37ce2c8
Compare
Summary: Per title Differential Revision: D89909129
|
Hi @duxiao1212, thanks for this PR! As part of the release process — do you think this change warrants a release note? If so, would you like to add one? Happy to help if you'd prefer. |
Summary: Per title
Differential Revision: D89909129