-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-49386][CORE][SQL][FOLLOWUP] More accurate memory tracking for memory based spill threshold #52190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| private val inMemoryThreshold = conf.windowExecBufferInMemoryThreshold | ||
| private val spillThreshold = conf.windowExecBufferSpillThreshold | ||
| private val spillSizeThreshold = conf.windowExecBufferSpillSizeThreshold | ||
| private val sizeInBytesSpillThreshold = conf.windowExecBufferSpillSizeThreshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I reuse the spill threshold config for the ExternalAppendOnlyUnsafeRowArray in-memory buffer threshold as well. We can add new configs if we want (but it will be a lot of configs).
core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java
Show resolved
Hide resolved
core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java
Outdated
Show resolved
Hide resolved
| if (numRows < numRowsInMemoryBufferThreshold) { | ||
| // Once spills, we will switch to UnsafeExternalSorter permanently. | ||
| if (spillableArray == null && numRows < numRowsInMemoryBufferThreshold && | ||
| inMemoryBufferSizeInBytes < sizeInBytesSpillThreshold) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should sizeInBytesInMemoryBufferThreshold be used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I made a typo here...
| * of using an [[ArrayBuffer]] or [[Array]]. | ||
| */ | ||
| private[sql] class ExternalAppendOnlyUnsafeRowArray( | ||
| class ExternalAppendOnlyUnsafeRowArray( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does private[sql] need to be removed?
The previous PR added parameters in ExternalAppendOnlyUnsafeRowArray, but the documentation for ExternalAppendOnlyUnsafeRowArray was not updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other data structures (ShuffleExternalSorter, UnsafeExternalSorter, etc.) are all public classes under a private package. I just make them consistent here, not a big deal and I'm fine to revert as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the classdoc
sql/core/src/main/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArray.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArray.scala
Outdated
Show resolved
Hide resolved
…lAppendOnlyUnsafeRowArray.scala
cxzl25
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Merged to master, thanks! |
…memory based spill threshold ### What changes were proposed in this pull request? This is a followup of apache#47856 . It makes the memory tracking more accurate in several places: 1. In `ShuffleExternalSorter`/`UnsafeExternalSorter`, the memory is used by both the sorter itself, and its underlying in-memort sorter (for sorting shuffle partition ids). We need to add them up to calcuate the current memory usage. 2. In `ExternalAppendOnlyUnsafeRowArray`, the records are inserted to an in-memory buffer first. If the buffer gets too large (currently based on num records), we switch to `UnsafeExternalSorter`. The in-memory buffer also needs a memory based threshold ### Why are the changes needed? More accurate memory tracking results to better spill decisions ### Does this PR introduce _any_ user-facing change? No, the feature is not released yet. ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#52190 from cloud-fan/spill. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Yi Wu <[email protected]>
What changes were proposed in this pull request?
This is a followup of #47856 . It makes the memory tracking more accurate in several places:
ShuffleExternalSorter/UnsafeExternalSorter, the memory is used by both the sorter itself, and its underlying in-memort sorter (for sorting shuffle partition ids). We need to add them up to calcuate the current memory usage.ExternalAppendOnlyUnsafeRowArray, the records are inserted to an in-memory buffer first. If the buffer gets too large (currently based on num records), we switch toUnsafeExternalSorter. The in-memory buffer also needs a memory based thresholdWhy are the changes needed?
More accurate memory tracking results to better spill decisions
Does this PR introduce any user-facing change?
No, the feature is not released yet.
How was this patch tested?
existing tests
Was this patch authored or co-authored using generative AI tooling?
no