Improve HashBuilderOperator unspill parallelism #26076

sopel39 · 2025-06-26T13:40:15Z

HashBuilderOperator is unspilling sequentially partition by partition. This commit improves join unspilling performance by making FileSingleStreamSpiller unspill single partition in parallel. FileSingleStreamSpiller is enhanced so it can spill to/unspill from multiple files.

This improves unspilling time for join from minutes to seconds on larger machines

Description

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## General
* Improve performance of spilling join queries. ({issue}`26076`)

Fixes #26007

sopel39 · 2025-06-26T13:40:45Z

cc @osscm

pettyjamesm

@sopel39 - I'm not sure I understand the purpose of segmenting the individual files at 64MB boundaries. The code comments mention that we avoid materializing more than one segment at a time- but the code already avoided materializing more than one page from the input stream on read back at a time. Segmenting the files is adding a good bit of complexity to the implementation and it seems unnecessary on a first read through.

core/trino-main/src/main/java/io/trino/spiller/FileSingleStreamSpillerFactory.java

sopel39 · 2025-07-09T09:57:18Z

@sopel39 - I'm not sure I understand the purpose of segmenting the individual files at 64MB boundaries

Currently, when join hits spilling, it's parallelism will be reduced to single IO stream (build side unspilling) and single CPU thread (reconstructing of lookup source). This PR addresses the first issue (single IO stream). It was observed, that this PR reduces IO part of join unspilling from minutes to seconds.

Segmenting the files is adding a good bit of complexity to the implementation and it seems unnecessary on a first read through.

Segments are deterministic. When reading segments back it's possible to restore original pages order without any extra index. This allows to read back segments in parallel, which is the purpose of this PR.

pettyjamesm

This still seems massively more complicated than necessary. Wouldn't it be simpler to cut files at the segment boundary and support "pipelined" parallelism instead (i.e.: allow concurrent read-back of up to n files)? It seems like that would give you more control over the read back concurrency and simplify the logic significantly.

core/trino-main/src/main/java/io/trino/spiller/FileSingleStreamSpiller.java

sopel39 · 2025-07-10T19:24:38Z

Wouldn't it be simpler to cut files at the segment boundary and support "pipelined" parallelism instead (i.e.: allow concurrent read-back of up to n files)?

You mean create a new file for each segment? It would create a massive number of files that have to be tracked. Not ideal if you want to keep memory and resources constrained. Also, it wouldn't necessarily make it simpler as you still have to keep track of segment size, but now you also have to keep track of potentially hundreds of thousands of spill files.

dain · 2025-07-17T17:43:14Z

I think this can be simplified using the approach James suggested. For the spill, you can create a directory (in each target) for the spill and then number the files sequentially. Each file is 64MB and you form a total order. If file count becomes a problem you could scale up the file size as more files are created, but I don't expect this to be a problem on local disks.

The memory requirement is just the directory name and the max sequence number. This will also make cleanup on local disks trivial as you can just delete the directories. You could also parallel delete, but I don't expect that to be a problem on local disks.

I think this might also makes unspill easier to auto tune since you can have parallel loaders and simply monitor the ordered output queue of the pages. When the queue if not empty, you can reduce readers... you would also want to reduce readers if the time spent waiting for reads is high... anyway a future optimization.

sopel39 · 2025-07-18T14:42:34Z

@dain @pettyjamesm there are multiple ways to tackle this issue as we can see. I don't see one or the other dramatically simpler or harder. I think we all understand the approach I've proposed, so clearly it's not that complex. Unless this approach is fundamentally broken, I would rather keep it. BTW: you are mostly arguing with AI generated code. None of the lines here were written by me, but the concept and review of the final code come from me.

pettyjamesm · 2025-07-21T15:17:43Z

@dain @pettyjamesm there are multiple ways to tackle this issue as we can see. I don't see one or the other dramatically simpler or harder.

I don't agree that the approaches are fundamentally equivalent in terms of complexity. Tracking the current byte offset within a fixed number of open iterators and switching between them during read-back (while maintaining parallelism) is definitely harder to reason about compared to "reading fully" from N files with each input file being the unit of parallelism. As noted in our offline conversation, there are other upsides to using the file level approach like the ability to delete spill data from disk incrementally and the ability to tune the level of parallelism higher 1 thread per spill path.

sopel39 · 2025-07-21T21:49:39Z

Tracking the current byte offset

It's tracking page size really. Something that has to be done in any approach during spill. In unspill, you suggest to replace page size tracking with tracking of multiple spill files in spill directories (new concept in spill!). I would argue it's even more complex with possible side-effects from OS (what iffs?)

within a fixed number of open iterators and switching between

iterators won't go away independently of approach. In dir approach they would be nested too (per dir, per file). No complexity saved.

there are other upsides to using the file level approach like the ability to delete spill data from disk incrementally

It's a marginal difference and not really a contract of SingleStreamSpiller. For example, improved join implementation could unspill same partition multiple times (e.g. when unspilling partitions before probe finishing).

the ability to tune the level of parallelism higher 1 thread per spill path.

We would rather like to reduce parallelism sometimes, which is possible with this solution.

Overall, I still don't see any fundamental flaw in this approach and refuse to refactor PRs (or redo them completely) based on taste :)

dain · 2025-07-22T19:22:42Z

There are multiple ways to tackle this issue as we can see. I don't see one or the other dramatically simpler or harder. I think we all understand the approach I've proposed, so clearly it's not that complex. Unless this approach is fundamentally broken, I would rather keep it.

I find the approach quite complex and I think this can and should be done in a much simpler way. For the spilling code, there are very few maintainers, and we by necessity need to keep this as simple as possible. Therefore, I do not think we should merge this code, and we should focus on the simpler design.

sopel39 · 2025-07-22T21:04:48Z

I find the approach quite complex and I think this can and should be done in a much simpler way.

I spend last few posts explaining why it's not more complex than other approaches. I think the only alternative here is not to land this or any subsequent PRs that improve spill and join performance. That is also an option. I could also add a feature toggle.

For the spilling code, there are very few maintainers

In general there are very few active OSS maintainers left. Does that mean that the project should stall? That is an option too.

github-actions · 2025-08-13T17:02:57Z

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

github-actions · 2025-09-03T17:03:05Z

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

sopel39 · 2025-10-14T16:34:06Z

@pettyjamesm PTAL. I've simplified the approach and used RR to distribute pages across files.

core/trino-main/src/main/java/io/trino/spiller/FileSingleStreamSpiller.java

HashBuilderOperator is unspilling sequentially partition by partition. This commit improves join unspilling performance by making FileSingleStreamSpiller unspill single partition in parallel. FileSingleStreamSpiller is enhanced so it can spill to/unspill from multiple files.

pettyjamesm

LGTM, thanks @sopel39

sopel39 · 2025-10-20T08:49:29Z

@pettyjamesm thanks for review

findepi · 2025-10-24T07:15:33Z

core/trino-main/src/main/java/io/trino/spiller/SingleStreamSpillerFactory.java

+        return create(types, spillContext, memoryContext, false);
+    }
+
+    SingleStreamSpiller create(List<Type> types, SpillContext spillContext, LocalMemoryContext memoryContext, boolean parallelSpill);


SingleStreamSpiller maintains data order of data being spilled and read back (FIFO)

is it still the case when it's created as parallelSpill?

if yes -- we should deprecate the old method and use the new one
if no - SingleStreamSpiller interface needs update

sopel39 requested a review from raunaqmorarka June 26, 2025 13:40

cla-bot bot added the cla-signed label Jun 26, 2025

raunaqmorarka requested a review from pettyjamesm July 7, 2025 21:16

pettyjamesm reviewed Jul 8, 2025

View reviewed changes

core/trino-main/src/main/java/io/trino/spiller/FileSingleStreamSpillerFactory.java Outdated Show resolved Hide resolved

core/trino-main/src/main/java/io/trino/spiller/FileSingleStreamSpillerFactory.java Show resolved Hide resolved

sopel39 requested a review from pettyjamesm July 9, 2025 09:58

pettyjamesm reviewed Jul 10, 2025

View reviewed changes

core/trino-main/src/main/java/io/trino/spiller/FileSingleStreamSpiller.java Outdated Show resolved Hide resolved

core/trino-main/src/main/java/io/trino/spiller/FileSingleStreamSpiller.java Outdated Show resolved Hide resolved

sopel39 requested a review from pettyjamesm July 18, 2025 18:02

github-actions bot added the stale label Aug 13, 2025

github-actions bot closed this Sep 3, 2025

sopel39 reopened this Oct 14, 2025

sopel39 force-pushed the oss/multi_file_spill branch from 1ec5446 to 04aceb0 Compare October 14, 2025 16:33

sopel39 force-pushed the oss/multi_file_spill branch 2 times, most recently from 81540e0 to df32212 Compare October 14, 2025 16:39

github-actions bot removed the stale label Oct 14, 2025

pettyjamesm reviewed Oct 14, 2025

View reviewed changes

core/trino-main/src/main/java/io/trino/spiller/FileSingleStreamSpiller.java Show resolved Hide resolved

core/trino-main/src/main/java/io/trino/spiller/FileSingleStreamSpiller.java Show resolved Hide resolved

pettyjamesm reviewed Oct 14, 2025

View reviewed changes

core/trino-main/src/main/java/io/trino/spiller/FileSingleStreamSpiller.java Outdated Show resolved Hide resolved

sopel39 force-pushed the oss/multi_file_spill branch from df32212 to 044494a Compare October 15, 2025 10:55

sopel39 requested a review from pettyjamesm October 15, 2025 10:56

pettyjamesm reviewed Oct 16, 2025

View reviewed changes

sopel39 force-pushed the oss/multi_file_spill branch from 044494a to 806728b Compare October 17, 2025 11:28

sopel39 requested a review from pettyjamesm October 17, 2025 11:29

pettyjamesm approved these changes Oct 17, 2025

View reviewed changes

sopel39 merged commit ffb3722 into trinodb:master Oct 20, 2025
98 checks passed

sopel39 deleted the oss/multi_file_spill branch October 20, 2025 08:49

github-actions bot added this to the 478 milestone Oct 20, 2025

ebyhr mentioned this pull request Oct 21, 2025

Add Trino 478 release note #26726

Closed

findepi reviewed Oct 24, 2025

View reviewed changes

Improve HashBuilderOperator unspill parallelism #26076

Improve HashBuilderOperator unspill parallelism #26076

Uh oh!

Conversation

sopel39 commented Jun 26, 2025 • edited by ebyhr Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context and related issues

Release notes

Uh oh!

sopel39 commented Jun 26, 2025

Uh oh!

pettyjamesm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sopel39 commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pettyjamesm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sopel39 commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dain commented Jul 17, 2025

Uh oh!

sopel39 commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pettyjamesm commented Jul 21, 2025

Uh oh!

sopel39 commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dain commented Jul 22, 2025

Uh oh!

sopel39 commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 13, 2025

Uh oh!

github-actions bot commented Sep 3, 2025

Uh oh!

sopel39 commented Oct 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pettyjamesm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sopel39 commented Oct 20, 2025

Uh oh!

findepi Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

sopel39 commented Jun 26, 2025 •

edited by ebyhr

Loading

sopel39 commented Jul 9, 2025 •

edited

Loading

sopel39 commented Jul 10, 2025 •

edited

Loading

sopel39 commented Jul 18, 2025 •

edited

Loading

sopel39 commented Jul 21, 2025 •

edited

Loading

sopel39 commented Jul 22, 2025 •

edited

Loading