-
Notifications
You must be signed in to change notification settings - Fork 267
fix: Reduce number of shuffle spill files, fix spilled_bytes metric, add some unit tests #1440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1440 +/- ##
============================================
+ Coverage 56.12% 58.59% +2.46%
- Complexity 976 1017 +41
============================================
Files 119 122 +3
Lines 11743 12223 +480
Branches 2251 2295 +44
============================================
+ Hits 6591 7162 +571
+ Misses 4012 3909 -103
- Partials 1140 1152 +12 ☔ View full report in Codecov by Sentry. |
| spill_file.seek(SeekFrom::Start(spill.offsets[i]))?; | ||
| std::io::copy(&mut spill_file.take(length), &mut output_data) | ||
| .map_err(Self::to_df_err)?; | ||
| if let Some(spill_data) = self.buffered_partitions[i].spill_file.as_ref() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically saying, if 1) We have a SpillFile, and 2) the length of that SpillFile is greater than 0 -> we need to copy that spilled data to the output buffer. My question is: because we're now reusing spill files instead of creating them for each spill event, when does the reused SpillFile get truncated back to 0 now that we've copied all of the data to output_data? If it's happening somewhere that I don't see, perhaps a comment here where that happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct. We never truncate the spill file. We just append to it. At the end of the shuffle, we copy the contents of each spill file to the shuffle file. I will add some comments to make this clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments and also removed the check for length > 0 since that was redundant (we only create the spill file if we have data to spill)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I misunderstood the granularity at which the spill file could be written to the output file.
|
Thanks for tackling this, @andygrove! Great to see shuffle write improving. |
| assert_eq!(0, buffer.num_active_rows); | ||
| assert_eq!(0, buffer.frozen.len()); | ||
| assert_eq!(0, buffer.reservation.size()); | ||
| assert!(buffer.spill_file.is_some()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_eq!(9914, buffer.spill_file.as_ref().unwrap().file.len())? That's the frozen buffer length above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggesting that. I have added it.
|
@kazuyukitanimura @comphead could I get a committer approval? |
|
|
||
| #[test] | ||
| #[cfg_attr(miri, ignore)] // miri can't call foreign function `ZSTD_createCCtx` | ||
| #[cfg(not(target_os = "macos"))] // Github MacOS runner fails with "Too many open files". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was previously failing because it created too many spill files. It now passes.
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove does that mean we got a single shuffle file per partition or single file per executor?
Just to clarify: when you say shuffle file do you mean the final output or the spill file? My understanding is the final output is a single shuffle file and a single index file (the index provides partition offsets in the shuffle file) per executor. |
For each |
That's correct. There is no change to the shuffle data and index files, just fewer spill files. |
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @andygrove lgtm
|
Thanks for the reviews @mbutrovich and @comphead |
…add some unit tests (apache#1440)
Which issue does this PR close?
Part of #1436
Closes #1437
Rationale for this change
This is a minor fix to massively reduce the number of shuffle spill files that are created during native shuffle.
There is now a maximum of one spill file per output partition. Previously, we had seen tens of thousands of spill files in some cases.
What changes are included in this PR?
spill_filetoPartitionBufferHow are these changes tested?
I ran TPC-H locally with these settings and confirmed that I saw exchanges spilling.
The
spilled_bytesmetric now looks reasonable: