-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changed 128 items per CTA to 256 items #13
Conversation
@amathews-amd @dllehr-amd Thanks for submitting the PR. Can you provide the reason as to why this would help improve the performance of TBE? Which cases that this change will improve the performance? Does it improve performance of TBE on both AMD GPUs and NVIDIA GPUs? Thanks |
Hi @sryap, This change is specific to AMD GPUs because of the thread organization (an execution unit consists of 64 threads on AMD GPUs, while there are 32 threads, named warp, on NVDIA GPUs). Therefore, 256 elements are processed by each execution unit on AMD GPUs. This reduces the size of accumulator shown in this link, https://github.com/ROCmSoftwarePlatform/FBGEMM/blob/main/fbgemm_gpu/codegen/embedding_forward_split_template.cu#L238. Most importantly by setting items_per_warp to 256, the unnecessary branches and computation are avoided. Please see the following link, https://github.com/ROCmSoftwarePlatform/FBGEMM/blob/main/fbgemm_gpu/codegen/embedding_forward_split_template.cu#L283-L284. Please let me know if you have any questions. Thank you. |
@sryap BTW, CTA should be replaced with warp (or basic unit of execution) to be more precise. |
Thanks for your explanation @weihanmines. This will reduce the number of registers allocated in the thread block but I don't think it will reduce the number of branches. Do you agree that the performance improvement is from the reduced number of registers which will allow for more thread blocks to be pipelined on a CU? If so, can you please help me do the analysis to verify that this is true (e.g., collecting some performance counter)? If not, can you please show some evidence to support your theory (other than just the improved TBE time)? Since this is the specific change for AMD GPUs, can you please wrap the change with Another ask: could you please revise the description of the PR to be more informative? Thank you very much! |
Yes, it will reduce the number of registers being used. I am not sure how many registers are allocated on AMD GPUs. I could not find a compilation option to report that. We have planed to use MIPerf to profile the new code to take a look at new metrics. However, the server was not available for some reason in the past week. It will reduce the branches and computations. Please take a closer look at the link I sent in my previous reply. |
No worries. Please share your findings when you have them. I think it will change the number of branches because there are two conditions: |
Hi @sryap, we have confirmed that the occupancy stays the same with/without the changes. So I don't think that the performance gain is due to less registers after the change. The reasons should be less branches and computation, I believe. |
… and optimize for ROCm (#1240) Summary: Make weihanmines's PR ROCm#13 upstreamable. sryap, would you please review the PR and consider converting it to a draft? Thank you. Pull Request resolved: #1240 Reviewed By: sryap Differential Revision: D38507621 Pulled By: shintaro-iwasaki fbshipit-source-id: 5b4532c0e79ce49a2f93c2a455a6392a1c7c2f16
c557583
to
4216052
Compare
merged in pytorch#1240 |
No description provided.