fix: Reduce number of shuffle spill files, fix spilled_bytes metric, add some unit tests #1440

andygrove · 2025-02-23T18:01:37Z

Which issue does this PR close?

Part of #1436

Rationale for this change

This is a minor fix to massively reduce the number of shuffle spill files that are created during native shuffle.

There is now a maximum of one spill file per output partition. Previously, we had seen tens of thousands of spill files in some cases.

What changes are included in this PR?

Add spill_file to PartitionBuffer
Add unit test to demonstrate the current behavior of memory pool interactions (which does have some issues and I plan on creating future PRs to start addressing the issues exposed by these unit tests)

How are these changes tested?

I ran TPC-H locally with these settings and confirmed that I saw exchanges spilling.

    --conf spark.driver.memory=8G \
    --conf spark.executor.instances=1 \
    --conf spark.executor.cores=8 \
    --conf spark.cores.max=8 \
    --conf spark.executor.memory=8g \
    --conf spark.memory.offHeap.enabled=true \
    --conf spark.memory.offHeap.size=3g \

The spilled_bytes metric now looks reasonable:

codecov-commenter · 2025-02-23T19:02:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.59%. Comparing base (f09f8af) to head (d8613e7).
Report is 52 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1440      +/-   ##
============================================
+ Coverage     56.12%   58.59%   +2.46%     
- Complexity      976     1017      +41     
============================================
  Files           119      122       +3     
  Lines         11743    12223     +480     
  Branches       2251     2295      +44     
============================================
+ Hits           6591     7162     +571     
+ Misses         4012     3909     -103     
- Partials       1140     1152      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mbutrovich · 2025-02-26T21:32:21Z

native/core/src/execution/shuffle/shuffle_writer.rs

-                    spill_file.seek(SeekFrom::Start(spill.offsets[i]))?;
-                    std::io::copy(&mut spill_file.take(length), &mut output_data)
-                        .map_err(Self::to_df_err)?;
+            if let Some(spill_data) = self.buffered_partitions[i].spill_file.as_ref() {


This is basically saying, if 1) We have a SpillFile, and 2) the length of that SpillFile is greater than 0 -> we need to copy that spilled data to the output buffer. My question is: because we're now reusing spill files instead of creating them for each spill event, when does the reused SpillFile get truncated back to 0 now that we've copied all of the data to output_data? If it's happening somewhere that I don't see, perhaps a comment here where that happens.

Yes, that is correct. We never truncate the spill file. We just append to it. At the end of the shuffle, we copy the contents of each spill file to the shuffle file. I will add some comments to make this clearer.

I added some comments and also removed the check for length > 0 since that was redundant (we only create the spill file if we have data to spill)

Makes sense. I misunderstood the granularity at which the spill file could be written to the output file.

mbutrovich · 2025-02-26T21:40:43Z

Thanks for tackling this, @andygrove! Great to see shuffle write improving.

mbutrovich · 2025-02-26T21:48:11Z

native/core/src/execution/shuffle/shuffle_writer.rs

+        assert_eq!(0, buffer.num_active_rows);
+        assert_eq!(0, buffer.frozen.len());
+        assert_eq!(0, buffer.reservation.size());
+        assert!(buffer.spill_file.is_some());


assert_eq!(9914, buffer.spill_file.as_ref().unwrap().file.len())? That's the frozen buffer length above.

Thanks for suggesting that. I have added it.

andygrove · 2025-02-27T14:01:21Z

@kazuyukitanimura @comphead could I get a committer approval?

andygrove · 2025-02-27T19:12:20Z

native/core/src/execution/shuffle/shuffle_writer.rs


    #[test]
    #[cfg_attr(miri, ignore)] // miri can't call foreign function `ZSTD_createCCtx`
-    #[cfg(not(target_os = "macos"))] // Github MacOS runner fails with "Too many open files".


This test was previously failing because it created too many spill files. It now passes.

comphead

Thanks @andygrove does that mean we got a single shuffle file per partition or single file per executor?

mbutrovich · 2025-02-27T20:36:42Z

Thanks @andygrove does that mean we got a single shuffle file per partition or single file per executor?

Just to clarify: when you say shuffle file do you mean the final output or the spill file? My understanding is the final output is a single shuffle file and a single index file (the index provides partition offsets in the shuffle file) per executor.

andygrove · 2025-02-27T20:46:00Z

Thanks @andygrove does that mean we got a single shuffle file per partition or single file per executor?

For each ShuffleMapTask there will now be a maximum of one spill file per output partition. An executor could be running multiple tasks in parallel.

andygrove · 2025-02-27T20:46:43Z

Thanks @andygrove does that mean we got a single shuffle file per partition or single file per executor?

Just to clarify: when you say shuffle file do you mean the final output or the spill file? My understanding is the final output is a single shuffle file and a single index file (the index provides partition offsets in the shuffle file) per executor.

That's correct. There is no change to the shuffle data and index files, just fewer spill files.

comphead

thanks @andygrove lgtm

andygrove · 2025-02-27T21:11:23Z

Thanks for the reviews @mbutrovich and @comphead

…add some unit tests (apache#1440)

andygrove added 7 commits February 22, 2025 10:43

refactos shuffle_writer

f9bcd44

more moving code around

af0c5ec

more moving code around

3fec32e

more moving code around

6c1a27a

more moving code around

6d77b3b

reduce number of spill files

a549985

create spill file on demand

d380873

andygrove added 4 commits February 23, 2025 12:11

code cleanup:

8964ef8

revert comment change

5ed5b1b

fix metric

8bd783a

fix metric type

d0c5f30

andygrove changed the title ~~fix: Reduce number of shuffle spill files [wip]~~ fix: Reduce number of shuffle spill files and fix spilled_bytes metric Feb 24, 2025

andygrove added 5 commits February 24, 2025 08:07

format

50d3620

refactor: free PartitionBuffer memory in spill method

7939992

upmerge

372743a

add test

05ac2bb

more tests

309862a

andygrove changed the title ~~fix: Reduce number of shuffle spill files and fix spilled_bytes metric~~ fix: Reduce number of shuffle spill files, fix spilled_bytes metric, add some unit tests Feb 26, 2025

improve test

33008b6

This was referenced Feb 26, 2025

Native shuffle double allocates memory #1448

Closed

Native shuffle inaccurate estimate of builder memory allocation #1449

Closed

andygrove added 2 commits February 26, 2025 07:11

better test

4498b15

add link to issue

133df5c

andygrove marked this pull request as ready for review February 26, 2025 19:57

andygrove mentioned this pull request Feb 26, 2025

perf: Reduce native shuffle memory overhead by 50% #1452

Merged

mbutrovich reviewed Feb 26, 2025

View reviewed changes

address feedback

2ebec29

fmt

303ceb0

mbutrovich approved these changes Feb 27, 2025

View reviewed changes

enable more tests

d8613e7

andygrove commented Feb 27, 2025

View reviewed changes

comphead reviewed Feb 27, 2025

View reviewed changes

comphead approved these changes Feb 27, 2025

View reviewed changes

andygrove merged commit 928e1a2 into apache:main Feb 27, 2025
74 checks passed

andygrove deleted the shuffle-quick-fix branch February 27, 2025 21:11

andygrove mentioned this pull request Mar 3, 2025

Reduce spilling overhead in Comet shuffle #1436

Closed

coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025

fix: Reduce number of shuffle spill files, fix spilled_bytes metric, …

19893da

…add some unit tests (apache#1440)

fix: Reduce number of shuffle spill files, fix spilled_bytes metric, add some unit tests #1440

fix: Reduce number of shuffle spill files, fix spilled_bytes metric, add some unit tests #1440

Uh oh!

Conversation

andygrove commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

codecov-commenter commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mbutrovich Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

mbutrovich Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

mbutrovich commented Feb 26, 2025

Uh oh!

mbutrovich Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove commented Feb 27, 2025

Uh oh!

andygrove Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

mbutrovich commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented Feb 27, 2025

Uh oh!

andygrove commented Feb 27, 2025

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andygrove commented Feb 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andygrove commented Feb 23, 2025 •

edited

Loading

codecov-commenter commented Feb 23, 2025 •

edited

Loading

mbutrovich Feb 26, 2025 •

edited

Loading

mbutrovich commented Feb 27, 2025 •

edited

Loading