Skip to content

Conversation

@sopel39
Copy link
Member

@sopel39 sopel39 commented Jun 26, 2025

HashBuilderOperator is unspilling sequentially partition by partition. This commit improves join unspilling performance by making FileSingleStreamSpiller unspill single partition in parallel. FileSingleStreamSpiller is enhanced so it can spill to/unspill from multiple files.

This improves unspilling time for join from minutes to seconds on larger machines

Description

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## General
* Improve performance of spilling join queries. ({issue}`26076`)

Fixes #26007

@sopel39 sopel39 requested a review from raunaqmorarka June 26, 2025 13:40
@cla-bot cla-bot bot added the cla-signed label Jun 26, 2025
@sopel39
Copy link
Member Author

sopel39 commented Jun 26, 2025

cc @osscm

@raunaqmorarka raunaqmorarka requested a review from pettyjamesm July 7, 2025 21:16
Copy link
Member

@pettyjamesm pettyjamesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sopel39 - I'm not sure I understand the purpose of segmenting the individual files at 64MB boundaries. The code comments mention that we avoid materializing more than one segment at a time- but the code already avoided materializing more than one page from the input stream on read back at a time. Segmenting the files is adding a good bit of complexity to the implementation and it seems unnecessary on a first read through.

@sopel39
Copy link
Member Author

sopel39 commented Jul 9, 2025

@sopel39 - I'm not sure I understand the purpose of segmenting the individual files at 64MB boundaries

Currently, when join hits spilling, it's parallelism will be reduced to single IO stream (build side unspilling) and single CPU thread (reconstructing of lookup source). This PR addresses the first issue (single IO stream). It was observed, that this PR reduces IO part of join unspilling from minutes to seconds.

Segmenting the files is adding a good bit of complexity to the implementation and it seems unnecessary on a first read through.

Segments are deterministic. When reading segments back it's possible to restore original pages order without any extra index. This allows to read back segments in parallel, which is the purpose of this PR.

@sopel39 sopel39 requested a review from pettyjamesm July 9, 2025 09:58
Copy link
Member

@pettyjamesm pettyjamesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still seems massively more complicated than necessary. Wouldn't it be simpler to cut files at the segment boundary and support "pipelined" parallelism instead (i.e.: allow concurrent read-back of up to n files)? It seems like that would give you more control over the read back concurrency and simplify the logic significantly.

@sopel39
Copy link
Member Author

sopel39 commented Jul 10, 2025

Wouldn't it be simpler to cut files at the segment boundary and support "pipelined" parallelism instead (i.e.: allow concurrent read-back of up to n files)?

You mean create a new file for each segment? It would create a massive number of files that have to be tracked. Not ideal if you want to keep memory and resources constrained. Also, it wouldn't necessarily make it simpler as you still have to keep track of segment size, but now you also have to keep track of potentially hundreds of thousands of spill files.

@dain
Copy link
Member

dain commented Jul 17, 2025

I think this can be simplified using the approach James suggested. For the spill, you can create a directory (in each target) for the spill and then number the files sequentially. Each file is 64MB and you form a total order. If file count becomes a problem you could scale up the file size as more files are created, but I don't expect this to be a problem on local disks.

The memory requirement is just the directory name and the max sequence number. This will also make cleanup on local disks trivial as you can just delete the directories. You could also parallel delete, but I don't expect that to be a problem on local disks.

I think this might also makes unspill easier to auto tune since you can have parallel loaders and simply monitor the ordered output queue of the pages. When the queue if not empty, you can reduce readers... you would also want to reduce readers if the time spent waiting for reads is high... anyway a future optimization.

@sopel39
Copy link
Member Author

sopel39 commented Jul 18, 2025

@dain @pettyjamesm there are multiple ways to tackle this issue as we can see. I don't see one or the other dramatically simpler or harder. I think we all understand the approach I've proposed, so clearly it's not that complex. Unless this approach is fundamentally broken, I would rather keep it. BTW: you are mostly arguing with AI generated code. None of the lines here were written by me, but the concept and review of the final code come from me.

@sopel39 sopel39 requested a review from pettyjamesm July 18, 2025 18:02
@pettyjamesm
Copy link
Member

@dain @pettyjamesm there are multiple ways to tackle this issue as we can see. I don't see one or the other dramatically simpler or harder.

I don't agree that the approaches are fundamentally equivalent in terms of complexity. Tracking the current byte offset within a fixed number of open iterators and switching between them during read-back (while maintaining parallelism) is definitely harder to reason about compared to "reading fully" from N files with each input file being the unit of parallelism. As noted in our offline conversation, there are other upsides to using the file level approach like the ability to delete spill data from disk incrementally and the ability to tune the level of parallelism higher 1 thread per spill path.

@sopel39
Copy link
Member Author

sopel39 commented Jul 21, 2025

Tracking the current byte offset

It's tracking page size really. Something that has to be done in any approach during spill. In unspill, you suggest to replace page size tracking with tracking of multiple spill files in spill directories (new concept in spill!). I would argue it's even more complex with possible side-effects from OS (what iffs?)

within a fixed number of open iterators and switching between

iterators won't go away independently of approach. In dir approach they would be nested too (per dir, per file). No complexity saved.

there are other upsides to using the file level approach like the ability to delete spill data from disk incrementally

It's a marginal difference and not really a contract of SingleStreamSpiller. For example, improved join implementation could unspill same partition multiple times (e.g. when unspilling partitions before probe finishing).

the ability to tune the level of parallelism higher 1 thread per spill path.

We would rather like to reduce parallelism sometimes, which is possible with this solution.

Overall, I still don't see any fundamental flaw in this approach and refuse to refactor PRs (or redo them completely) based on taste :)

@dain
Copy link
Member

dain commented Jul 22, 2025

There are multiple ways to tackle this issue as we can see. I don't see one or the other dramatically simpler or harder. I think we all understand the approach I've proposed, so clearly it's not that complex. Unless this approach is fundamentally broken, I would rather keep it.

I find the approach quite complex and I think this can and should be done in a much simpler way. For the spilling code, there are very few maintainers, and we by necessity need to keep this as simple as possible. Therefore, I do not think we should merge this code, and we should focus on the simpler design.

@sopel39
Copy link
Member Author

sopel39 commented Jul 22, 2025

I find the approach quite complex and I think this can and should be done in a much simpler way.

I spend last few posts explaining why it's not more complex than other approaches. I think the only alternative here is not to land this or any subsequent PRs that improve spill and join performance. That is also an option. I could also add a feature toggle.

For the spilling code, there are very few maintainers

In general there are very few active OSS maintainers left. Does that mean that the project should stall? That is an option too.

@github-actions
Copy link

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions github-actions bot added the stale label Aug 13, 2025
@github-actions
Copy link

github-actions bot commented Sep 3, 2025

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Sep 3, 2025
@sopel39 sopel39 reopened this Oct 14, 2025
@sopel39 sopel39 force-pushed the oss/multi_file_spill branch from 1ec5446 to 04aceb0 Compare October 14, 2025 16:33
@sopel39
Copy link
Member Author

sopel39 commented Oct 14, 2025

@pettyjamesm PTAL. I've simplified the approach and used RR to distribute pages across files.

@sopel39 sopel39 force-pushed the oss/multi_file_spill branch 2 times, most recently from 81540e0 to df32212 Compare October 14, 2025 16:39
@github-actions github-actions bot removed the stale label Oct 14, 2025
@sopel39 sopel39 force-pushed the oss/multi_file_spill branch from df32212 to 044494a Compare October 15, 2025 10:55
@sopel39 sopel39 requested a review from pettyjamesm October 15, 2025 10:56
HashBuilderOperator is unspilling sequentially partition by partition.
This commit improves join unspilling performance by making FileSingleStreamSpiller
unspill single partition in parallel. FileSingleStreamSpiller is enhanced
so it can spill to/unspill from multiple files.
@sopel39 sopel39 force-pushed the oss/multi_file_spill branch from 044494a to 806728b Compare October 17, 2025 11:28
@sopel39 sopel39 requested a review from pettyjamesm October 17, 2025 11:29
Copy link
Member

@pettyjamesm pettyjamesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @sopel39

@sopel39 sopel39 merged commit ffb3722 into trinodb:master Oct 20, 2025
98 checks passed
@sopel39 sopel39 deleted the oss/multi_file_spill branch October 20, 2025 08:49
@sopel39
Copy link
Member Author

sopel39 commented Oct 20, 2025

@pettyjamesm thanks for review

@github-actions github-actions bot added this to the 478 milestone Oct 20, 2025
return create(types, spillContext, memoryContext, false);
}

SingleStreamSpiller create(List<Type> types, SpillContext spillContext, LocalMemoryContext memoryContext, boolean parallelSpill);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SingleStreamSpiller maintains data order of data being spilled and read back (FIFO)

is it still the case when it's created as parallelSpill?

if yes -- we should deprecate the old method and use the new one
if no - SingleStreamSpiller interface needs update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Improve unspilling concurrency

4 participants