Skip to content

Conversation

@AlexVlx
Copy link
Contributor

@AlexVlx AlexVlx commented Sep 26, 2019

No description provided.

@AlexVlx AlexVlx changed the title Divide et impera is not always ideal. Optimise the gridDim.n * blockDim.m idiom Sep 26, 2019
@AlexVlx AlexVlx requested a review from whchung September 26, 2019 15:10
Copy link
Contributor

@whchung whchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@asroy +@deven-amd for awareness.

Although this PR doesn't solve the fundamental issue of hipGridDim_x/y/z but it does address a common pattern seen among all Eigen kernels on ROCm.

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Sep 26, 2019

Not quite sure what the fundamental problem is if I'm honest. If it's the integral division required to get the number of groups, we can probably address that one fairly easily too, although I suspect that this idiom is far more commonly encountered.

@whchung
Copy link
Contributor

whchung commented Sep 26, 2019

Yes, the fundamental issue is the integer division used in the implementation of __ockl_get_num_groups, and also the apparent omission of enable_sgpr_grid _workgroup_count_X/Y/Z

https://llvm.org/docs/AMDGPUUsage.html#initial-kernel-execution-state

Indeed with this PR merged it should already address quite a lot of inefficient codes in many HIP kernels as the idiom is used literally everywhere.

@b-sumner
Copy link
Contributor

Is it really the fundamental problem? In a separate context, I thought you said that the performance impact is minimal.

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Sep 26, 2019

Yeah I don't know how fundamental that one is, but "fixing" it should be straightforward. Let us discuss separately.

@whchung
Copy link
Contributor

whchung commented Sep 26, 2019

@b-sumner the performance impact was small, for the kernels I was studying at that moment. So I wasn't pushing for a change. And this PR should nicely optimize away issues found in those kernels.

But there are other kernels from other workloads which directly access hipGridDim_x/y/z without following this idiom in performance-critical loops. Again it's hard to quantify the potential impact yet without truly dive down into the ISA for those kernels. So I'm not requesting having a change either, yet. We might revisit this issue again in a couple of weeks when we have better understanding on these kernels.

@b-sumner
Copy link
Contributor

@whchung I agree it is a problem if the division is not lifted out of a performance critical loop. But the fundamental issue then is not the specifics of the computation, but rather that an invariant computation is not being identified and lifted out of that loop.

@whchung
Copy link
Contributor

whchung commented Sep 26, 2019

@b-sumner perhaps we are interpreting the term "fundamental issue" differently.

To me the "fundamental issue" is that reading hipGridDim_x/y/z involves integer division and makes derivative values reside in VGPR, not in SGPR. On the other hand, all other indexing values in HIP/ROCm-Device-Libs are nicely stored in SGPRs.

Should v_readlane be used in ROCm-Device-Libs implementation, to put hipGridDim_x/y/z values back into SGPR we may be able to at least keep most derivative values in SGPR, which may reduce some instructions and VGPR register usage.


As a side note, we've been starting to upstream our works on MLIR and the very first step is to enable integrating ROCm-Device-Libs into MLIR, so functions written in GPU dialect could be properly indexed on ROCm with ROCm-Device-Libs.
tensorflow/mlir#146

@b-sumner
Copy link
Contributor

@whchung I think you're observing side effects of the base or "fundamental" problem which has not been identified. Using readfirstlane in the device libs would simply be a workaround that could affect other optimizations and I would oppose it until found necessary after the fundamental problem has been identified.

@whchung
Copy link
Contributor

whchung commented Sep 26, 2019

@b-sumner I agree we postpone this discussion until further understanding the nature of these new kernels which don't follow idiom fixed in this PR.

Copy link
Contributor

@bensander bensander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mangupta mangupta added the pr:needs_updates PR initially approved. But needs rework label Sep 30, 2019
@mangupta mangupta merged commit ab8fe8a into master Sep 30, 2019
@mangupta mangupta deleted the feature_better_grid_dim branch September 30, 2019 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr:needs_updates PR initially approved. But needs rework

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants