-
Couldn't load subscription status.
- Fork 573
Optimise the gridDim.n * blockDim.m idiom #1468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+@asroy +@deven-amd for awareness.
Although this PR doesn't solve the fundamental issue of hipGridDim_x/y/z but it does address a common pattern seen among all Eigen kernels on ROCm.
|
Not quite sure what the fundamental problem is if I'm honest. If it's the integral division required to get the number of groups, we can probably address that one fairly easily too, although I suspect that this idiom is far more commonly encountered. |
|
Yes, the fundamental issue is the integer division used in the implementation of https://llvm.org/docs/AMDGPUUsage.html#initial-kernel-execution-state Indeed with this PR merged it should already address quite a lot of inefficient codes in many HIP kernels as the idiom is used literally everywhere. |
|
Is it really the fundamental problem? In a separate context, I thought you said that the performance impact is minimal. |
|
Yeah I don't know how fundamental that one is, but "fixing" it should be straightforward. Let us discuss separately. |
|
@b-sumner the performance impact was small, for the kernels I was studying at that moment. So I wasn't pushing for a change. And this PR should nicely optimize away issues found in those kernels. But there are other kernels from other workloads which directly access |
|
@whchung I agree it is a problem if the division is not lifted out of a performance critical loop. But the fundamental issue then is not the specifics of the computation, but rather that an invariant computation is not being identified and lifted out of that loop. |
|
@b-sumner perhaps we are interpreting the term "fundamental issue" differently. To me the "fundamental issue" is that reading Should As a side note, we've been starting to upstream our works on MLIR and the very first step is to enable integrating ROCm-Device-Libs into MLIR, so functions written in GPU dialect could be properly indexed on ROCm with ROCm-Device-Libs. |
|
@whchung I think you're observing side effects of the base or "fundamental" problem which has not been identified. Using readfirstlane in the device libs would simply be a workaround that could affect other optimizations and I would oppose it until found necessary after the fundamental problem has been identified. |
|
@b-sumner I agree we postpone this discussion until further understanding the nature of these new kernels which don't follow idiom fixed in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
No description provided.