-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add two-array radix and sample sorts and a distributed sort #13347
Conversation
Each call to _Merge created two temporary arrays that together matched the portion of the Data array being processed. So each level of recursive call used its own set of arrays that combine to the size of Data. Instead, create one Scratch array at the outset (as in chapel-lang#13347).
9939144
to
a2c8e32
Compare
ec8a087
to
60d1637
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high-level, changes look good. Had a few trivial comments.
Implements a distributed sort
🎉
} | ||
*/ | ||
// TODO: These shallowCopy functions should handle Block,Cyclic arrays | ||
inline proc shallowCopy(ref A, dst, src, nElts) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that should be provided as a standalone module, or do we expect this to be handled at the language level relatively soon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually know the future of it, presently...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue that captures the desire for a better way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not even sure what to put it an issue right now. For the moment, there is a TODO comment about moving it elsewhere.
|
||
proc getNumBuckets() { | ||
return numBuckets * (1 + equalBuckets:int); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we had a Chapel linter, it would probably complain to you about inconsistent spacing (newlines) between function definitions.
|
||
if A._instance.isDefaultRectangular() { | ||
var size = (nElts:size_t)*c_sizeof(A.eltType); | ||
c_memcpy(c_ptrTo(A[dst]), c_ptrTo(A[src]), size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there were a possibility of overlap, this should use memmove
instead of memcpy
. But AFAICT this shallowCopy
is presently used only with size=1
.
Similarly below, if DstA
and SrcA
might be the same array. (Or slices or views of the same array.)
var start = tid * blockSize; | ||
var n = blockSize; | ||
if start + n > nElts then | ||
n = nElts - start; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If nTasks > nElts + 1
, you'll end up with a negative n
here -- is that what's wrong?
Two array zero-copy mergesort [PR by @cassella - thanks! Reviewed/tested/merged by @mppf] Each call to `_Merge` created two temporary arrays that together matched the portion of the Data array being processed. So each level of recursive call used its own set of arrays that combine to the size of Data. Instead, create one Scratch array at the outset (as in #13347). Rather than copy our portion of Data into Scratch to start off each `_Merge()`, the recursive levels will alternate merging from Data into Scratch, and merging from Scratch into Data. Performance is a little hard to get a feel for on my laptop, but I think I'm seeing 15-20% improvement in mergesort. The variability decreased a lot too. I tested with array sizes 1000-1040 and 2020-2060 to hit different points near minlen, starting near an even and an odd power of 2. Those runs had a halt() at the "we'll never reach this point" comment, which was not hit.
Distributed sort improvements `modules/packages/Sort.chpl` includes two implementations of distributed sorting: * `Sort.TwoArrayRadixSort.twoArrayRadixSort` * `Sort.TwoArraySampleSort.twoArraySampleSort` Neither of these `Sort.TwoArray*` submodules are included by default (and so they have limited impact on compilation time). Additionally, they are not documented. However, they can provide more efficient sorting for Block-distributed arrays and they include code specifically for the distributed case. These sorts were added in PR #13347. See also https://chapel-lang.org/CHIUW/2019/Ferguson.pdf for more information about these TwoArray sorts. Note that twoArraySampleSort does not currently compile for distributed arrays. This PR takes several steps to improve the performance of distributed two-array sorting: * use two-array sorting for large subproblems within a locale (instead of calling msbRadixSort which is in-place and less parallel) * perform recursive sub-problems in the distributed sort in a distributed manner. Instead of having all locales participate in each distributed subproblem - assign each subproblem an "owner" and from there create tasks for all the locales involved in that subproblem. Since a given locale can be involved in two subproblems (when the bucket boundary for the subproblem is within the region of the array owned by that locale) we added state1 and state2 to keep separate counts in these cases. This change also involved storing the tasks (subproblems) in a nested structure - each subproblem contains a representation of which locales will be involved in it. * optimized the all-to-all communication of the counts This PR improves scalability of `twoArrayRadixSort` quite a bit. Thanks to @ronawho for reviewing and gathering performance information! - [x] full local testing - [x] test/library/packages/Sort with `CHPL_COMM=gasnet`
Sort.sort
Sort
now usesRandom
and that changes its behavior. See Calling functions in a module without any use #13523.Reviewed by @ben-albrecht - thanks!