Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jan 12, 2024

What changes were proposed in this pull request?

This PR fixes a long-standing bug in ShuffleExternalSorter about the "spilled disk bytes" metrics. When we close the sorter, we will spill the remaining data in the buffer, with a flag isLastFile = true. This flag means the spilling will not increase the "spilled disk bytes" metrics. This makes sense if the sorter has never spilled before, then the final spill file will be used as the final shuffle output file, and we should keep the "spilled disk bytes" metrics as 0. However, if spilling did happen before, then we simply miscount the final spill file for the "spilled disk bytes" metrics today.

This PR fixes this issue, by setting that flag when closing the sorter only if this is the first spilling.

Why are the changes needed?

make metrics accurate

Does this PR introduce any user-facing change?

no

How was this patch tested?

updated tests

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the CORE label Jan 12, 2024
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename the flag to be more accurate. It doesn't mean the last spill file, but should be the only spill file so that it will be used as the final shuffle output file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment was wrong before. If this flag is true, we are writing the final shuffle output file and will increase the shuffle write metrics rather than the spill metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the logging here so that it applies to the last spilling as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the actual change of this PR. We should only set this flag if we have not spilled before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hack is not needed anymore.

ShuffleWriteMetrics shuffleWriteMetrics = taskMetrics.shuffleWriteMetrics();
assertEquals(dataToWrite.size(), shuffleWriteMetrics.recordsWritten());
assertTrue(taskMetrics.diskBytesSpilled() > 0L);
assertTrue(taskMetrics.diskBytesSpilled() < mergedOutputFile.length());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous test was simply too relaxed.

@cloud-fan
Copy link
Contributor Author

cc @mridulm

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

dongjoon-hyun pushed a commit that referenced this pull request Jan 12, 2024
…ling bytes metric

### What changes were proposed in this pull request?

This PR fixes a long-standing bug in ShuffleExternalSorter about the "spilled disk bytes" metrics. When we close the sorter, we will spill the remaining data in the buffer, with a flag `isLastFile = true`. This flag means the spilling will not increase the "spilled disk bytes" metrics. This makes sense if the sorter has never spilled before, then the final spill file will be used as the final shuffle output file, and we should keep the "spilled disk bytes" metrics as 0. However, if spilling did happen before, then we simply miscount the final spill file for the "spilled disk bytes" metrics today.

This PR fixes this issue, by setting that flag when closing the sorter only if this is the first spilling.

### Why are the changes needed?

make metrics accurate

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

updated tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44709 from cloud-fan/shuffle.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 4ea3742)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Jan 12, 2024
…ling bytes metric

### What changes were proposed in this pull request?

This PR fixes a long-standing bug in ShuffleExternalSorter about the "spilled disk bytes" metrics. When we close the sorter, we will spill the remaining data in the buffer, with a flag `isLastFile = true`. This flag means the spilling will not increase the "spilled disk bytes" metrics. This makes sense if the sorter has never spilled before, then the final spill file will be used as the final shuffle output file, and we should keep the "spilled disk bytes" metrics as 0. However, if spilling did happen before, then we simply miscount the final spill file for the "spilled disk bytes" metrics today.

This PR fixes this issue, by setting that flag when closing the sorter only if this is the first spilling.

### Why are the changes needed?

make metrics accurate

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

updated tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44709 from cloud-fan/shuffle.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 4ea3742)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Merged to master/3.5/3.4. Thank you, @cloud-fan and @gengliangwang .

szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
…ling bytes metric

### What changes were proposed in this pull request?

This PR fixes a long-standing bug in ShuffleExternalSorter about the "spilled disk bytes" metrics. When we close the sorter, we will spill the remaining data in the buffer, with a flag `isLastFile = true`. This flag means the spilling will not increase the "spilled disk bytes" metrics. This makes sense if the sorter has never spilled before, then the final spill file will be used as the final shuffle output file, and we should keep the "spilled disk bytes" metrics as 0. However, if spilling did happen before, then we simply miscount the final spill file for the "spilled disk bytes" metrics today.

This PR fixes this issue, by setting that flag when closing the sorter only if this is the first spilling.

### Why are the changes needed?

make metrics accurate

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

updated tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#44709 from cloud-fan/shuffle.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 4ea3742)
Signed-off-by: Dongjoon Hyun <[email protected]>
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…ling bytes metric

### What changes were proposed in this pull request?

This PR fixes a long-standing bug in ShuffleExternalSorter about the "spilled disk bytes" metrics. When we close the sorter, we will spill the remaining data in the buffer, with a flag `isLastFile = true`. This flag means the spilling will not increase the "spilled disk bytes" metrics. This makes sense if the sorter has never spilled before, then the final spill file will be used as the final shuffle output file, and we should keep the "spilled disk bytes" metrics as 0. However, if spilling did happen before, then we simply miscount the final spill file for the "spilled disk bytes" metrics today.

This PR fixes this issue, by setting that flag when closing the sorter only if this is the first spilling.

### Why are the changes needed?

make metrics accurate

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

updated tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#44709 from cloud-fan/shuffle.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 4ea3742)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants