Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a followup of #47856 . It makes the memory tracking more accurate in several places:

  1. In ShuffleExternalSorter/UnsafeExternalSorter, the memory is used by both the sorter itself, and its underlying in-memort sorter (for sorting shuffle partition ids). We need to add them up to calcuate the current memory usage.
  2. In ExternalAppendOnlyUnsafeRowArray, the records are inserted to an in-memory buffer first. If the buffer gets too large (currently based on num records), we switch to UnsafeExternalSorter. The in-memory buffer also needs a memory based threshold

Why are the changes needed?

More accurate memory tracking results to better spill decisions

Does this PR introduce any user-facing change?

No, the feature is not released yet.

How was this patch tested?

existing tests

Was this patch authored or co-authored using generative AI tooling?

no

@cloud-fan
Copy link
Contributor Author

cc @cxzl25 @attilapiros @mridulm

private val inMemoryThreshold = conf.windowExecBufferInMemoryThreshold
private val spillThreshold = conf.windowExecBufferSpillThreshold
private val spillSizeThreshold = conf.windowExecBufferSpillSizeThreshold
private val sizeInBytesSpillThreshold = conf.windowExecBufferSpillSizeThreshold
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I reuse the spill threshold config for the ExternalAppendOnlyUnsafeRowArray in-memory buffer threshold as well. We can add new configs if we want (but it will be a lot of configs).

if (numRows < numRowsInMemoryBufferThreshold) {
// Once spills, we will switch to UnsafeExternalSorter permanently.
if (spillableArray == null && numRows < numRowsInMemoryBufferThreshold &&
inMemoryBufferSizeInBytes < sizeInBytesSpillThreshold) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should sizeInBytesInMemoryBufferThreshold be used ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I made a typo here...

* of using an [[ArrayBuffer]] or [[Array]].
*/
private[sql] class ExternalAppendOnlyUnsafeRowArray(
class ExternalAppendOnlyUnsafeRowArray(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does private[sql] need to be removed?

The previous PR added parameters in ExternalAppendOnlyUnsafeRowArray, but the documentation for ExternalAppendOnlyUnsafeRowArray was not updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other data structures (ShuffleExternalSorter, UnsafeExternalSorter, etc.) are all public classes under a private package. I just make them consistent here, not a big deal and I'm fine to revert as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the classdoc

Copy link
Contributor

@cxzl25 cxzl25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Ngone51 Ngone51 closed this in 336ca8c Sep 4, 2025
@Ngone51
Copy link
Member

Ngone51 commented Sep 4, 2025

Merged to master, thanks!

huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
…memory based spill threshold

### What changes were proposed in this pull request?

This is a followup of apache#47856 . It makes the memory tracking more accurate in several places:
1. In `ShuffleExternalSorter`/`UnsafeExternalSorter`, the memory is used by both the sorter itself, and its underlying in-memort sorter (for sorting shuffle partition ids). We need to add them up to calcuate the current memory usage.
2. In `ExternalAppendOnlyUnsafeRowArray`, the records are inserted to an in-memory buffer first. If the buffer gets too large (currently based on num records), we switch to `UnsafeExternalSorter`. The in-memory buffer also needs a memory based threshold

### Why are the changes needed?

More accurate memory tracking results to better spill decisions

### Does this PR introduce _any_ user-facing change?

No, the feature is not released yet.

### How was this patch tested?

existing tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#52190 from cloud-fan/spill.

Lead-authored-by: Wenchen Fan <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Yi Wu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants