Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add two-array radix and sample sorts and a distributed sort #13347

Merged
merged 7 commits into from
Jul 24, 2019

Conversation

mppf
Copy link
Member

@mppf mppf commented Jul 1, 2019

  • Factors bucketize process out into separate parts in the Sort module
  • Implements two-array radix and sample sorts
  • Implements a distributed sort
  • Adds a directory of sort tests in different languages for cross-language performance comparisons. (These are not run automatically).
  • Includes work towards a prototype implementation of an in-place parallel bucketizer from ips4o in Chapel (added as .notest just so we don't lose track of it)
  • Does not currently change the sort code that is run for a normal sort called in Sort.sort
  • Changes test/library/standard/Random/stonea/fillRandom.chpl into a future because Sort now uses Random and that changes its behavior. See Calling functions in a module without any use #13523.

Reviewed by @ben-albrecht - thanks!

  • full local testing

cassella added a commit to cassella/chapel that referenced this pull request Jul 11, 2019
Each call to _Merge created two temporary arrays that together matched
the portion of the Data array being processed.  So each level of
recursive call used its own set of arrays that combine to the size of
Data.

Instead, create one Scratch array at the outset (as in chapel-lang#13347).
@mppf mppf force-pushed the distributed-radix-sort branch from 9939144 to a2c8e32 Compare July 17, 2019 13:01
@mppf mppf force-pushed the distributed-radix-sort branch from ec8a087 to 60d1637 Compare July 23, 2019 14:15
Copy link
Member

@ben-albrecht ben-albrecht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high-level, changes look good. Had a few trivial comments.

Implements a distributed sort

🎉

}
*/
// TODO: These shallowCopy functions should handle Block,Cyclic arrays
inline proc shallowCopy(ref A, dst, src, nElts) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that should be provided as a standalone module, or do we expect this to be handled at the language level relatively soon?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually know the future of it, presently...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue that captures the desire for a better way to do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not even sure what to put it an issue right now. For the moment, there is a TODO comment about moving it elsewhere.


proc getNumBuckets() {
return numBuckets * (1 + equalBuckets:int);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had a Chapel linter, it would probably complain to you about inconsistent spacing (newlines) between function definitions.

@mppf mppf merged commit cacd1f4 into chapel-lang:master Jul 24, 2019
@mppf mppf deleted the distributed-radix-sort branch July 24, 2019 14:16

if A._instance.isDefaultRectangular() {
var size = (nElts:size_t)*c_sizeof(A.eltType);
c_memcpy(c_ptrTo(A[dst]), c_ptrTo(A[src]), size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there were a possibility of overlap, this should use memmove instead of memcpy. But AFAICT this shallowCopy is presently used only with size=1.

Similarly below, if DstA and SrcA might be the same array. (Or slices or views of the same array.)

var start = tid * blockSize;
var n = blockSize;
if start + n > nElts then
n = nElts - start;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nTasks > nElts + 1, you'll end up with a negative n here -- is that what's wrong?

mppf added a commit that referenced this pull request Dec 12, 2019
Two array zero-copy mergesort

[PR by @cassella - thanks! Reviewed/tested/merged by @mppf]

Each call to `_Merge` created two temporary arrays that together matched
the portion of the Data array being processed.  So each level of
recursive call used its own set of arrays that combine to the size of
Data.
    
Instead, create one Scratch array at the outset (as in #13347).

Rather than copy our portion of Data into Scratch to start off each
`_Merge()`, the recursive levels will alternate merging from Data into
Scratch, and merging from Scratch into Data.

Performance is a little hard to get a feel for on my laptop, but I think
I'm seeing 15-20% improvement in mergesort.  The variability decreased a
lot too.


I tested with array sizes 1000-1040 and 2020-2060 to hit different points
near minlen, starting near an even and an odd power of 2.  Those runs had
a halt() at the "we'll never reach this point" comment, which was not
hit.
mppf added a commit that referenced this pull request Dec 15, 2020
Distributed sort improvements

`modules/packages/Sort.chpl` includes two implementations of distributed
sorting:
 * `Sort.TwoArrayRadixSort.twoArrayRadixSort`
 * `Sort.TwoArraySampleSort.twoArraySampleSort`

Neither of these `Sort.TwoArray*` submodules are included by default (and
so they have limited impact on compilation time). Additionally, they are
not documented. However, they can provide more efficient sorting for
Block-distributed arrays and they include code specifically for the
distributed case. These sorts were added in PR #13347. See also
https://chapel-lang.org/CHIUW/2019/Ferguson.pdf for more information
about these TwoArray sorts. Note that twoArraySampleSort does not
currently compile for distributed arrays.

This PR takes several steps to improve the performance of distributed
two-array sorting:
 * use two-array sorting for large subproblems within a locale (instead
   of calling msbRadixSort which is in-place and less parallel)
 * perform recursive sub-problems in the distributed sort in a
   distributed manner. Instead of having all locales participate in each
   distributed subproblem - assign each subproblem an "owner" and from
   there create tasks for all the locales involved in that subproblem.
   Since a given locale can be involved in two subproblems (when the
   bucket boundary for the subproblem is within the region of the array
   owned by that locale) we added state1 and state2 to keep separate
   counts in these cases. This change also involved storing the tasks
   (subproblems) in a nested structure - each subproblem contains a
   representation of which locales will be involved in it.
 * optimized the all-to-all communication of the counts

This PR improves scalability of `twoArrayRadixSort` quite a bit.

Thanks to @ronawho for reviewing and gathering performance information!

- [x] full local testing
- [x] test/library/packages/Sort with `CHPL_COMM=gasnet`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants