-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IndexSplit
for hierarchical parallelism
#981
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very neat! This is intended to increase work per thread for GPUs and decrease number of blocks -- how has this affected performance? Are we planning on using this in production code?
I could see applying the k/j loop to the inner thread loop increase the amount of work on the finest level for GPUs in order to maximize occupancy. This might help performance for blocks smaller than 64~128.
I've added some text to the documentation that might help make better decisions for choosing number of teams/work per team. I'm drawing from my previous experience/knowledge with CUDA, I don't know if it's been in line with your testing.
I also couldn't figure out how the code might handle the case where nkp
doesn't divide evenly into the k range so that one team necessarily does less work than the others. It seems the code would just run over the loop bounds as is.
Co-authored-by: forrestglines <[email protected]>
Co-authored-by: forrestglines <[email protected]>
Co-authored-by: forrestglines <[email protected]>
Co-authored-by: forrestglines <[email protected]>
Co-authored-by: forrestglines <[email protected]>
Co-authored-by: forrestglines <[email protected]>
So If yes, then I'm satisfied. |
Co-authored-by: forrestglines <[email protected]>
@forrestglines thought more about your big-picture question about excess work and how that functionally works out. Also discussed briefly with @jdolence . Big picture is that mismatched sizes do result in one thread block (for example) doing extra work. There's no special fix for that. The right amount of work gets done, but it might not be distributed perfectly evenly. Obviously that's not ideal, but the |
I agree! But I think I might have miscommunicated my concern. I don't see where in your code that the range of I don't see this pattern tested in your unit test so I don't know if it works. I think this should be a test case in your PR before it is merged. It might already be working in your downstream code but I think there should be a test. I've been busy at the KUG and fixing cylindrical coordinates in AthenaPK (which I think I've finally found the last bug). If you want, I can write the test that I want to see when I'm back at home later today. Other than that, I'm very happy with the updated documentation and updated choice of parallelization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where in your code that the range of k changes for this last (or for some set) of teams. I think this extra/less work is unaccounted for. I'm suggesting that IndexSplit has a bug that will cause segfaults by running over intended bounds of k, not a performance concern.
Ah I see... Actually we accidentally eliminated this mechanism in review. The static cast (I highlight it below) is incorrect. That value should be a double. The static casting later when the ranges are computed distributes the extra work. (To be honest, I didn't fully understand this. I copied this code from @jdolence 's implementation, and I missed some details.)
I don't see this pattern tested in your unit test so I don't know if it works. I think this should be a test case in your PR before it is merged. It might already be working in your downstream code but I think there should be a test.
That's fair. I added tests---let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. I didn't carefully check the indexing logic, but I trust that the tests and Riot have caught any issues.
Co-authored-by: Luke Roberts <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the new changes, looks good to me now! I very much appreciate the new tests.
…r_for_outer sums up the work for every member of the inner loop
Thanks @forrestglines @lroberts36 for your detailed reviews. Going to set to auto merge now. |
PR Summary
Here I port in a tool written by @jdolence used for balancing the considerations of GPU vs CPU hardware constraints when using hierarchical parallelism. The
IndexSplit
class lets one automatically fuse different indices in two levels of hierarhical parallelism. For example, put k and j in an outer loop and i in an inner loop, or k in the outer loop and j and i in the inner loop.I also add some documentation describing best practices for hierarchical parallelism that we got from our experience in performance tuning downstream codes.
PR Checklist