`IndexSplit` for hierarchical parallelism #981

Yurlungur · 2023-12-11T18:28:50Z

PR Summary

Here I port in a tool written by @jdolence used for balancing the considerations of GPU vs CPU hardware constraints when using hierarchical parallelism. The IndexSplit class lets one automatically fuse different indices in two levels of hierarhical parallelism. For example, put k and j in an outer loop and i in an inner loop, or k in the outer loop and j and i in the inner loop.

I also add some documentation describing best practices for hierarchical parallelism that we got from our experience in performance tuning downstream codes.

PR Checklist

doc/sphinx/src/nested_par_for.rst

forrestglines

Very neat! This is intended to increase work per thread for GPUs and decrease number of blocks -- how has this affected performance? Are we planning on using this in production code?

I could see applying the k/j loop to the inner thread loop increase the amount of work on the finest level for GPUs in order to maximize occupancy. This might help performance for blocks smaller than 64~128.

I've added some text to the documentation that might help make better decisions for choosing number of teams/work per team. I'm drawing from my previous experience/knowledge with CUDA, I don't know if it's been in line with your testing.

I also couldn't figure out how the code might handle the case where nkp doesn't divide evenly into the k range so that one team necessarily does less work than the others. It seems the code would just run over the loop bounds as is.

doc/sphinx/src/nested_par_for.rst

src/interface/mesh_data.hpp

tst/unit/test_index_split.cpp

doc/sphinx/src/nested_par_for.rst

Co-authored-by: forrestglines <[email protected]>

doc/sphinx/src/nested_par_for.rst

Co-authored-by: forrestglines <[email protected]>

doc/sphinx/src/nested_par_for.rst

…on into jmm/indexsplit

forrestglines · 2023-12-14T21:26:19Z

Yeah for k at least, the model is to manually loop over the leftover k indices. Like so:
	    const auto krange = idx_sp.GetBoundsK(outer_idx);
	    const auto jrange = idx_sp.GetBoundsJ(outer_idx);
	    const auto irange = idx_sp.GetInnerBounds(jrange);

	    // Whatever part of k is not in the outer loop can be looped over
	    // with a normal for loop here
	    for (int k = krange.s; k <= krange.e; ++k) {
so the "spillover" gets kind of distributed among all the teams. Does this answer your concern satisfactorily? (I assume that's also the remaining unresolved issues in the discussion?)

So krange.e - krange.s will not be equal across different outer_idx this hypothetical case? For some outer_idx that difference will be smaller or greater?

If yes, then I'm satisfied.

doc/sphinx/src/nested_par_for.rst

Co-authored-by: forrestglines <[email protected]>

Yurlungur · 2023-12-15T18:34:56Z

@forrestglines thought more about your big-picture question about excess work and how that functionally works out. Also discussed briefly with @jdolence .

Big picture is that mismatched sizes do result in one thread block (for example) doing extra work. There's no special fix for that. The right amount of work gets done, but it might not be distributed perfectly evenly. Obviously that's not ideal, but the IndexSplit machinery does let the user choose the sizing for a choice optimal for that loop (and it depends on the loop). My suggestion is that we document this shortcoming and keep in our minds some improved automation for choosing the right nkp and njp down the line. Would this work for you?

forrestglines · 2023-12-15T18:52:21Z

Big picture is that mismatched sizes do result in one thread block (for example) doing extra work.

I agree! But I think I might have miscommunicated my concern. I don't see where in your code that the range of k changes for this last (or for some set) of teams. I think this extra/less work is unaccounted for. I'm suggesting that IndexSplit has a bug that will cause segfaults by running over intended bounds of k, not a performance concern.

I don't see this pattern tested in your unit test so I don't know if it works. I think this should be a test case in your PR before it is merged. It might already be working in your downstream code but I think there should be a test.

I've been busy at the KUG and fixing cylindrical coordinates in AthenaPK (which I think I've finally found the last bug). If you want, I can write the test that I want to see when I'm back at home later today.

Other than that, I'm very happy with the updated documentation and updated choice of parallelization.

src/utils/index_split.hpp

Yurlungur

I don't see where in your code that the range of k changes for this last (or for some set) of teams. I think this extra/less work is unaccounted for. I'm suggesting that IndexSplit has a bug that will cause segfaults by running over intended bounds of k, not a performance concern.

Ah I see... Actually we accidentally eliminated this mechanism in review. The static cast (I highlight it below) is incorrect. That value should be a double. The static casting later when the ranges are computed distributes the extra work. (To be honest, I didn't fully understand this. I copied this code from @jdolence 's implementation, and I missed some details.)

I don't see this pattern tested in your unit test so I don't know if it works. I think this should be a test case in your PR before it is merged. It might already be working in your downstream code but I think there should be a test.

That's fair. I added tests---let me know what you think.

src/utils/index_split.cpp

lroberts36

This LGTM. I didn't carefully check the indexing logic, but I trust that the tests and Riot have caught any issues.

doc/sphinx/src/nested_par_for.rst

Co-authored-by: Luke Roberts <[email protected]>

forrestglines

I reviewed the new changes, looks good to me now! I very much appreciate the new tests.

…r_for_outer sums up the work for every member of the inner loop

Yurlungur · 2023-12-19T01:06:22Z

Thanks @forrestglines @lroberts36 for your detailed reviews. Going to set to auto merge now.

jonahm-LANL added 2 commits December 8, 2023 18:12

Add indexsplit

1b8681a

add docs

fc9c3b4

Yurlungur added documentation Improvements or additions to documentation enhancement New feature or request labels Dec 11, 2023

Yurlungur requested review from brryan, bprather, pbrady, forrestglines, jdolence, lroberts36, pgrete and pdmullen December 11, 2023 18:28

Yurlungur self-assigned this Dec 11, 2023

jonahm-LANL added 2 commits December 11, 2023 11:35

index split

81859b4

changelog

fd01483

forrestglines reviewed Dec 13, 2023

View reviewed changes

doc/sphinx/src/nested_par_for.rst Outdated Show resolved Hide resolved

forrestglines requested changes Dec 13, 2023

View reviewed changes

Yurlungur and others added 6 commits December 13, 2023 16:23

Merge branch 'develop' into jmm/indexsplit

ccea896

Update doc/sphinx/src/nested_par_for.rst

faa65e7

Co-authored-by: forrestglines <[email protected]>

Update doc/sphinx/src/nested_par_for.rst

36df614

Co-authored-by: forrestglines <[email protected]>

Update tst/unit/test_index_split.cpp

ca31631

Co-authored-by: forrestglines <[email protected]>

Update doc/sphinx/src/nested_par_for.rst

ddd8b53

Co-authored-by: forrestglines <[email protected]>

Update src/utils/index_split.cpp

b64f348

Co-authored-by: forrestglines <[email protected]>

forrestglines reviewed Dec 14, 2023

View reviewed changes

doc/sphinx/src/nested_par_for.rst Outdated Show resolved Hide resolved

Yurlungur and others added 4 commits December 14, 2023 08:59

Update doc/sphinx/src/nested_par_for.rst

6d8c2d0

Co-authored-by: forrestglines <[email protected]>

some of forrests comments

ad4edcd

merge

2da13f8

tweak docs

f082bec

Yurlungur commented Dec 14, 2023

View reviewed changes

doc/sphinx/src/nested_par_for.rst Outdated Show resolved Hide resolved

Merge branch 'develop' into jmm/indexsplit

e9b8052

jonahm-LANL and others added 4 commits December 14, 2023 11:09

NSMS not real

e43901d

Merge branch 'develop' into jmm/indexsplit

83fafdf

DummyFunctor

4c9db82

Merge branch 'jmm/indexsplit' of github.com:parthenon-hpc-lab/parthen…

99bb599

…on into jmm/indexsplit

forrestglines reviewed Dec 14, 2023

View reviewed changes

doc/sphinx/src/nested_par_for.rst Outdated Show resolved Hide resolved

jonahm-LANL and others added 2 commits December 14, 2023 17:16

try to get this dummy functor to work

4c58afd

Update doc/sphinx/src/nested_par_for.rst

475b242

Co-authored-by: forrestglines <[email protected]>

forrestglines reviewed Dec 15, 2023

View reviewed changes

src/utils/index_split.hpp Show resolved Hide resolved

forrestglines reviewed Dec 15, 2023

View reviewed changes

src/utils/index_split.hpp Show resolved Hide resolved

add fglines requested test. Fix bug introduced in review process

39d147a

Yurlungur commented Dec 15, 2023

View reviewed changes

src/utils/index_split.cpp Show resolved Hide resolved

lroberts36 approved these changes Dec 16, 2023

View reviewed changes

doc/sphinx/src/nested_par_for.rst Outdated Show resolved Hide resolved

doc/sphinx/src/nested_par_for.rst Show resolved Hide resolved

doc/sphinx/src/nested_par_for.rst Outdated Show resolved Hide resolved

Yurlungur and others added 4 commits December 15, 2023 18:07

Update doc/sphinx/src/nested_par_for.rst

85d27cf

Co-authored-by: Luke Roberts <[email protected]>

lroberts comments

15205ad

merge

0f68805

nstreams needs underscore

61819e8

forrestglines approved these changes Dec 17, 2023

View reviewed changes

jonahm-LANL added 7 commits December 18, 2023 17:57

Remove shadowing warning from calculate pi

e9897a4

remove unused variable

5442738

Swap total_work tests to use standard par_reduce, as otherwise the pa…

7885037

…r_for_outer sums up the work for every member of the inner loop

formatting

6c92301

Default loop pattern doesn't work for reductions... Of course.

abbe863

merge

e03e72c

formatting... agian

da1a27d

Yurlungur enabled auto-merge December 19, 2023 01:06

Yurlungur merged commit 8631b8d into develop Dec 19, 2023
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`IndexSplit` for hierarchical parallelism #981

`IndexSplit` for hierarchical parallelism #981

Yurlungur commented Dec 11, 2023 •

edited

Loading

forrestglines left a comment

forrestglines commented Dec 14, 2023

Yurlungur commented Dec 15, 2023

forrestglines commented Dec 15, 2023

Yurlungur left a comment

lroberts36 left a comment

forrestglines left a comment

Yurlungur commented Dec 19, 2023

IndexSplit for hierarchical parallelism #981

IndexSplit for hierarchical parallelism #981

Conversation

Yurlungur commented Dec 11, 2023 • edited Loading

PR Summary

PR Checklist

forrestglines left a comment

Choose a reason for hiding this comment

forrestglines commented Dec 14, 2023

Yurlungur commented Dec 15, 2023

forrestglines commented Dec 15, 2023

Yurlungur left a comment

Choose a reason for hiding this comment

lroberts36 left a comment

Choose a reason for hiding this comment

forrestglines left a comment

Choose a reason for hiding this comment

Yurlungur commented Dec 19, 2023

`IndexSplit` for hierarchical parallelism #981

`IndexSplit` for hierarchical parallelism #981

Yurlungur commented Dec 11, 2023 •

edited

Loading