-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PagedMergeSort, a merge sort using O(√n) space #71
Add PagedMergeSort, a merge sort using O(√n) space #71
Conversation
Adds the PagedMergeSort algorithm, a merge sort with O(sqrt n) auxiliary space usage.
Codecov Report
@@ Coverage Diff @@
## master #71 +/- ##
==========================================
+ Coverage 89.16% 90.80% +1.63%
==========================================
Files 1 1
Lines 360 511 +151
==========================================
+ Hits 321 464 +143
- Misses 39 47 +8
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
d2dd7f7
to
4d6e518
Compare
- unify function signatures - remove trailing whitespace - rename variables - always use space after comma in fuction definitions/calls - improve comments
Thank you! This is a valuable contribution! I hope you don't mind some critique :) It's a very large sorting algorithm so I have yet to read it cover-to-cover. If possible, it would be nice to get similar or better performance with a simpler implementation, but that can be tricky. That said, I have performed some black-box analysis with a couple results: First, a correctness issue, I ran this script to check stability and found a counterexample:@testset begin
for T in (Float64, Int, UInt8)
for alg in [ThreadedPagedMergeSort, PagedMergeSort]
for order in [Forward, Reverse, By(identity), By(abs), By(Returns(0)), By(Base.Fix2(÷, 100))]
fails = Int[]
for n in vcat(0:30, 40:10:100, 110:50:1000)
v = rand(T, n)
# @test sort(v; order) == sort(v; alg, order)
sort(v; order) == sort(v; alg, order) || push!(fails, n)
end
if !isempty(fails)
println("$(T) $(alg) $(typeof(order))\n\t$(fails)")
end
end
end
end
end julia> issorted(sort(1:100, by=Returns(0), alg=PagedMergeSort))
false Second, some notes on benchmarking, use using BenchmarkTools, SortingAlgorithms, Random, Test
versioninfo()
for i in 0:5
n = 17^i
println("sort!(rand(Int, $n))")
v = rand(Int, n)
print("Default: "); @btime sort!($v) setup=(rand!($v)) evals=1
print("Paged MS: "); @btime sort!($v; alg=PagedMergeSort) setup=(rand!($v)) evals=1
print("Threaded: "); @btime sort!($v; alg=ThreadedPagedMergeSort) setup=(rand!($v)) evals=1
end On my computer, it produces the following results:Julia Version 1.9.0-beta4
Commit b75ddb787ff (2023-02-07 21:53 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin21.4.0)
CPU: 4 × Intel(R) Core(TM) i5-8210Y CPU @ 1.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 2 on 2 virtual cores
sort!(rand(Int, 1))
Default: 63.000 ns (0 allocations: 0 bytes)
Paged MS: 64.000 ns (0 allocations: 0 bytes)
Threaded: 63.000 ns (0 allocations: 0 bytes)
sort!(rand(Int, 17))
Default: 248.000 ns (0 allocations: 0 bytes)
Paged MS: 390.000 ns (2 allocations: 256 bytes)
Threaded: 386.000 ns (2 allocations: 256 bytes)
sort!(rand(Int, 289))
Default: 5.214 μs (1 allocation: 2.44 KiB)
Paged MS: 8.231 μs (2 allocations: 704 bytes)
Threaded: 8.453 μs (2 allocations: 704 bytes)
sort!(rand(Int, 4913))
Default: 91.831 μs (3 allocations: 46.67 KiB)
Paged MS: 206.361 μs (2 allocations: 2.38 KiB)
Threaded: 203.543 μs (2 allocations: 2.38 KiB)
sort!(rand(Int, 83521))
Default: 1.757 ms (3 allocations: 660.80 KiB)
Paged MS: 4.627 ms (2 allocations: 9.38 KiB)
Threaded: 3.074 ms (54 allocations: 23.03 KiB)
sort!(rand(Int, 1419857))
Default: 39.456 ms (3 allocations: 10.84 MiB)
Paged MS: 111.730 ms (3 allocations: 37.48 KiB)
Threaded: 69.457 ms (57 allocations: 79.28 KiB) |
Adds the PagedMergeSort algorithm, a merge sort with O(sqrt n) auxiliary space usage.
- unify function signatures - remove trailing whitespace - rename variables - always use space after comma in fuction definitions/calls - improve comments
ce9ad62
to
88e5b22
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a few inline comments from looking over the parts of the code I understand. I'll need to get a better understanding of block sorts before I can give a full review.
Thank you for taking the time to review this PR. I am looking forward to the critique :).
That was caused by the initial optimizations. Base sort returned a range, while all algorithms from SortingAlgorithms returned a vector. Fixed now after implementing initial optimizations.
This algorithm is most useful for sorting large amounts of data, where
I found Pluto notebooks that load the correct package from github very convenient when benchmarking on different PCs not setup for package development. Run the notebook -> done. But you are right, if you already checked out the branch for review, a simple script is even more convenient. I have added a benchmark script to PR #72.
There are some ways to simplify the code, but all are less performant:
Feel free to suggest improvements in this regard. I will continue trying to simplify, too. But this is the main reason I did not go for an algorithm with O(1) space. While not especially elegant, this PR is much shorter and concenpually simpler than HolyGrailSort, for example. |
I'm still seeing a problem: julia> PagedMergeSort
Base.Sort.MissingOptimization(
Base.Sort.BoolOptimization(
Base.Sort.Small{10}(
Base.Sort.InsertionSortAlg(),
Base.Sort.IEEEFloatOptimization(
SortingAlgorithms.PagedMergeSortAlg()))))
julia> sort(1:100, by=Returns(0), alg=PagedMergeSort)
100-element Vector{Int64}:
1
2
3
4
5
6
7
14
15
16
⋮
98
99
100
83
84
85
86
87
88 |
This is not a block sort as defined here. That's why I called it PagedMergeSort and not BlockMergeSort, although block merge sort would be a fitting name. Btw, is there a good way to make the gif in the OP available in the documentation somehow? I think it is very helpful to understand the merge procedure. |
Co-authored-by: Lilith Orion Hafner <[email protected]>
Co-authored-by: Lilith Orion Hafner <[email protected]>
Co-authored-by: Lilith Orion Hafner <[email protected]>
If I understand it correctly, what you describe is in essence the current solution. But I will try to improve it to make this obvious when looking at the code. And maybe I'll manage to remove the dependency on StaticArrays, by making the control flow explicit, following your suggestion. Looping through the three free pages at the end of the subarrays is accomplished using The linear scan is tracked with Update: Success! The dependency is eliminated and I think the code is more clear now. |
and always use "page" instead of "block"
and eliminate dependency on StaticArrays
a880a05
to
cb89f11
Compare
Great work! It is much simpler now. I think it can be even simpler, though. I've put some suggestions in a pr to the head branch of this pr because I wanted to make sure that my suggestions worked before recommending them. |
refactor PagedMergeSort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename buffer/buf to scratch or to t?
Base.Sort's "scratch" is an implementation detail and will likely change semantics to use buffers once they are a thing, so it's for the best not to integrate too closely with that system.
However, I still think sharing a common name might be a good idea. IIRC, Base uses t
and scratch
, the (obsolete) radix sort in SotihngAlgorithsm.jl uses ts
.
Right now, PagedMergeSort uses both t
and buf
which brings the total number of names for this concept up to 3 in this repo and 4 in this repo+base. I recommend using scratch
as an abbreviation for "scratch space".
CI only tests the fallback to PagedMergeSort for ThreadedPagedMergeSort.
That's not good. Maybe we should split off PagedMergeSort and try to merge it first and separately, and do ThreadedPagedMergeSort in a second PR.
src/SortingAlgorithms.jl
Outdated
while_condition1(offset) = (_,_,k) -> k <= offset + pagesize | ||
while a < m-pagesize && b < hi-pagesize | ||
pages = next_page!(pageLocations, pages, pagesize, lo, a) | ||
offset = page_offset(pages.current) | ||
a,b,_ = merge!(while_condition1(offset),v,v,v,o,a,b,offset+1) | ||
end | ||
# merge until either A or B is empty or the last page is reached | ||
k, offset = nothing, nothing | ||
while_condition2(offset) = (a,b,k) -> k <= offset + pagesize && a <= m && b <= hi | ||
while a <= m && b <= hi && pages.currentNumber + 3 < nPages | ||
pages = next_page!(pageLocations, pages, pagesize, lo, a) | ||
offset = page_offset(pages.current) | ||
a,b,k = merge!(while_condition2(offset),v,v,v,o,a,b,offset+1) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These while loops and while_condition
s feel strange to me; I wonder if there is a better approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was the best way to implement this I could think of. But of course I am open to suggestions for implrovement.
use copyto! remove premature variable definitions
These are good suggestions imho. I have implemented them in 3f337f0 and d3ea719. Except for rebasing, my checklist of open questions is now completed. |
Occured for small inputs when not using initial optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thank you! This is a nice algorithm to have. |
Thank you for your efforts reviewing and massively improving the code. |
As far as I am aware, there is no stable sorting algorithm in Julia that uses less than O(n) auxiliary space. This PR fixes that, by implementing PagedMergeSort, a merge sort using O(√n) space while achieving about the same speed as the regular MergeSort from Base.Sort.
This is done by using a merge routine that splits the array into blocks/pages of size √n, merges into these blocks, and than rearranges them using a page table. This is illustrated below. The auxiliary space is on the right.
The basic idea is laid out here, but I use a different page table and reordering scheme than the author of that post. By using the inverse permutation in the page table (ie. storing the location of the data belonging in block i in the page table at index i), we can follow the permutation cycles during reordering without swapping, copying the correct data into the previously emptied location.
The additional data movement from reordering is almost negligible compared to the merging, and the main merging loop has only one condition (merge until block is full) as opposed to two (merge until either source array runs out), so the performance is about the same as the regular merge sort.
At deeper recursion levels, where the scratch space is big enough, normal merging is used, where one input is copied into the scratch space. When the scratch space is large enough to hold the complete subarray, the input is merged interleaved from both sides, which increases performance for random data, making PagedMergeSort actually faster than MergeSort. (But it decreases performamce for presorted data). Benchmarks are below. They can be replicated by running this Pluto notebook.
If someone needs a sorting algorithm with less than O(n) memory usage, sorting is probably a significant aspect of their whole program, so I also included a multithreaded version of PagedMergeSort.
Checklist before merging
Some notes
Benchmarks
TLDR: PagedMergeSort is slightly faster than MergeSort, but slower than the default radix sort on random data. It allocates less than 400 KiB to sort a 1 GiB vector. Both MergeSort and PagedMergeSort are faster for sortperm! than the default ScratchQuicksort when sorting large vectors. The speedup depends on memory latency. ThreadedPagedMergeSort is the fastest algorithm for all tested inputs and machines (but it is the only multithreaded one).