Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexSplit for hierarchical parallelism #981

Merged
merged 37 commits into from
Dec 19, 2023
Merged

IndexSplit for hierarchical parallelism #981

merged 37 commits into from
Dec 19, 2023

Conversation

Yurlungur
Copy link
Collaborator

@Yurlungur Yurlungur commented Dec 11, 2023

PR Summary

Here I port in a tool written by @jdolence used for balancing the considerations of GPU vs CPU hardware constraints when using hierarchical parallelism. The IndexSplit class lets one automatically fuse different indices in two levels of hierarhical parallelism. For example, put k and j in an outer loop and i in an inner loop, or k in the outer loop and j and i in the inner loop.

I also add some documentation describing best practices for hierarchical parallelism that we got from our experience in performance tuning downstream codes.

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@Yurlungur Yurlungur added documentation Improvements or additions to documentation enhancement New feature or request labels Dec 11, 2023
@Yurlungur Yurlungur self-assigned this Dec 11, 2023
Copy link
Collaborator

@forrestglines forrestglines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very neat! This is intended to increase work per thread for GPUs and decrease number of blocks -- how has this affected performance? Are we planning on using this in production code?

I could see applying the k/j loop to the inner thread loop increase the amount of work on the finest level for GPUs in order to maximize occupancy. This might help performance for blocks smaller than 64~128.

I've added some text to the documentation that might help make better decisions for choosing number of teams/work per team. I'm drawing from my previous experience/knowledge with CUDA, I don't know if it's been in line with your testing.

I also couldn't figure out how the code might handle the case where nkp doesn't divide evenly into the k range so that one team necessarily does less work than the others. It seems the code would just run over the loop bounds as is.

doc/sphinx/src/nested_par_for.rst Show resolved Hide resolved
doc/sphinx/src/nested_par_for.rst Outdated Show resolved Hide resolved
doc/sphinx/src/nested_par_for.rst Outdated Show resolved Hide resolved
doc/sphinx/src/nested_par_for.rst Outdated Show resolved Hide resolved
doc/sphinx/src/nested_par_for.rst Outdated Show resolved Hide resolved
src/interface/mesh_data.hpp Show resolved Hide resolved
tst/unit/test_index_split.cpp Show resolved Hide resolved
tst/unit/test_index_split.cpp Outdated Show resolved Hide resolved
tst/unit/test_index_split.cpp Show resolved Hide resolved
doc/sphinx/src/nested_par_for.rst Outdated Show resolved Hide resolved
@forrestglines
Copy link
Collaborator

Yeah for k at least, the model is to manually loop over the leftover k indices. Like so:

	    const auto krange = idx_sp.GetBoundsK(outer_idx);
	    const auto jrange = idx_sp.GetBoundsJ(outer_idx);
	    const auto irange = idx_sp.GetInnerBounds(jrange);

	    // Whatever part of k is not in the outer loop can be looped over
	    // with a normal for loop here
	    for (int k = krange.s; k <= krange.e; ++k) {

so the "spillover" gets kind of distributed among all the teams. Does this answer your concern satisfactorily? (I assume that's also the remaining unresolved issues in the discussion?)

So krange.e - krange.s will not be equal across different outer_idx this hypothetical case? For some outer_idx that difference will be smaller or greater?

If yes, then I'm satisfied.

@Yurlungur
Copy link
Collaborator Author

@forrestglines thought more about your big-picture question about excess work and how that functionally works out. Also discussed briefly with @jdolence .

Big picture is that mismatched sizes do result in one thread block (for example) doing extra work. There's no special fix for that. The right amount of work gets done, but it might not be distributed perfectly evenly. Obviously that's not ideal, but the IndexSplit machinery does let the user choose the sizing for a choice optimal for that loop (and it depends on the loop). My suggestion is that we document this shortcoming and keep in our minds some improved automation for choosing the right nkp and njp down the line. Would this work for you?

@forrestglines
Copy link
Collaborator

Big picture is that mismatched sizes do result in one thread block (for example) doing extra work.

I agree! But I think I might have miscommunicated my concern. I don't see where in your code that the range of k changes for this last (or for some set) of teams. I think this extra/less work is unaccounted for. I'm suggesting that IndexSplit has a bug that will cause segfaults by running over intended bounds of k, not a performance concern.

I don't see this pattern tested in your unit test so I don't know if it works. I think this should be a test case in your PR before it is merged. It might already be working in your downstream code but I think there should be a test.

I've been busy at the KUG and fixing cylindrical coordinates in AthenaPK (which I think I've finally found the last bug). If you want, I can write the test that I want to see when I'm back at home later today.

Other than that, I'm very happy with the updated documentation and updated choice of parallelization.

Copy link
Collaborator Author

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where in your code that the range of k changes for this last (or for some set) of teams. I think this extra/less work is unaccounted for. I'm suggesting that IndexSplit has a bug that will cause segfaults by running over intended bounds of k, not a performance concern.

Ah I see... Actually we accidentally eliminated this mechanism in review. The static cast (I highlight it below) is incorrect. That value should be a double. The static casting later when the ranges are computed distributes the extra work. (To be honest, I didn't fully understand this. I copied this code from @jdolence 's implementation, and I missed some details.)

I don't see this pattern tested in your unit test so I don't know if it works. I think this should be a test case in your PR before it is merged. It might already be working in your downstream code but I think there should be a test.

That's fair. I added tests---let me know what you think.

src/utils/index_split.cpp Show resolved Hide resolved
Copy link
Collaborator

@lroberts36 lroberts36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. I didn't carefully check the indexing logic, but I trust that the tests and Riot have caught any issues.

doc/sphinx/src/nested_par_for.rst Outdated Show resolved Hide resolved
doc/sphinx/src/nested_par_for.rst Show resolved Hide resolved
doc/sphinx/src/nested_par_for.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

@forrestglines forrestglines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the new changes, looks good to me now! I very much appreciate the new tests.

@Yurlungur
Copy link
Collaborator Author

Thanks @forrestglines @lroberts36 for your detailed reviews. Going to set to auto merge now.

@Yurlungur Yurlungur merged commit 8631b8d into develop Dec 19, 2023
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants