You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
While working on #7179 I noticed a potential improvement
The key observation is that merging K sorted streams of total rows N:
takes time proportional to O(N*K)
is a single threaded operation
K is often called the "Fan In" of the merge
The implementation of ExternalSorter::in_mem_sort_stream will effectively merge all the
buffered batches at once, as shown below. This can be a very large fan in -- 100s or 1000s of RecordBatches
Another potential variation might be to get more cores involves in the merging by parallelizing the merge, as described in the Morsel-Driven Parallelism paper
Additional context
No response
The text was updated successfully, but these errors were encountered:
I think it is actually O(n log(k)) which is the same as the tournament merge.
the intermediate merges can be run in parallel on multiple cores
This is the big one imo, the part that will be tricky is making sure that any cursor conversion is only done once per row on input, and any materialization of the final sorted order done once on output. The sorting of these cursors can then be done in parallel. FWIW I would really like to also be able to use the sort cursors for the initial sort, but I have not got around to this yet
alamb
changed the title
Improve performance of large sorts with Cascaded merge
Improve performance of large sorts with Cascaded merge / tree
Jan 16, 2024
Is your feature request related to a problem or challenge?
While working on #7179 I noticed a potential improvement
The key observation is that merging
K
sorted streams of total rowsN
:O(N*K)
K
is often called the "Fan In" of the mergeThe implementation of ExternalSorter::in_mem_sort_stream will effectively merge all the
buffered batches at once, as shown below. This can be a very large fan in -- 100s or 1000s of RecordBatches
Describe the solution you'd like
A classical approach to such sorts is to use a "cascaded merge" which uses a series of merge operations each with a limited the fanout (e.g. to 10)
This is often better because:
O(N*ln(N)*ln(K))
, there is some additional overhead ofln(N)
as the same row must now be compared several timesIt would be awesome if someone wanted to:
The sort benchmark (TODO) (thanks @jaylmiller!) may be interesting:
Describe alternatives you've considered
Another potential variation might be to get more cores involves in the merging by parallelizing the merge, as described in the Morsel-Driven Parallelism paper
Additional context
No response
The text was updated successfully, but these errors were encountered: