Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changed 128 items per CTA to 256 items #13

Closed
wants to merge 1 commit into from
Closed

Conversation

weihanmines
Copy link

No description provided.

@sryap
Copy link

sryap commented Aug 4, 2022

@amathews-amd @dllehr-amd Thanks for submitting the PR. Can you provide the reason as to why this would help improve the performance of TBE? Which cases that this change will improve the performance? Does it improve performance of TBE on both AMD GPUs and NVIDIA GPUs? Thanks

@weihanmines
Copy link
Author

Hi @sryap, This change is specific to AMD GPUs because of the thread organization (an execution unit consists of 64 threads on AMD GPUs, while there are 32 threads, named warp, on NVDIA GPUs). Therefore, 256 elements are processed by each execution unit on AMD GPUs. This reduces the size of accumulator shown in this link, https://github.com/ROCmSoftwarePlatform/FBGEMM/blob/main/fbgemm_gpu/codegen/embedding_forward_split_template.cu#L238. Most importantly by setting items_per_warp to 256, the unnecessary branches and computation are avoided. Please see the following link, https://github.com/ROCmSoftwarePlatform/FBGEMM/blob/main/fbgemm_gpu/codegen/embedding_forward_split_template.cu#L283-L284. Please let me know if you have any questions. Thank you.

@weihanmines
Copy link
Author

@sryap BTW, CTA should be replaced with warp (or basic unit of execution) to be more precise.

@sryap
Copy link

sryap commented Aug 4, 2022

Thanks for your explanation @weihanmines. This will reduce the number of registers allocated in the thread block but I don't think it will reduce the number of branches. Do you agree that the performance improvement is from the reduced number of registers which will allow for more thread blocks to be pipelined on a CU? If so, can you please help me do the analysis to verify that this is true (e.g., collecting some performance counter)? If not, can you please show some evidence to support your theory (other than just the improved TBE time)?

Since this is the specific change for AMD GPUs, can you please wrap the change with #ifdef __HIP_PLATFORM_HCC__?

Another ask: could you please revise the description of the PR to be more informative?

Thank you very much!

@weihanmines
Copy link
Author

weihanmines commented Aug 4, 2022

Yes, it will reduce the number of registers being used. I am not sure how many registers are allocated on AMD GPUs. I could not find a compilation option to report that. We have planed to use MIPerf to profile the new code to take a look at new metrics. However, the server was not available for some reason in the past week. It will reduce the branches and computations. Please take a closer look at the link I sent in my previous reply.
We already have commits which take care of your suggestion and they are waiting to be merged. I will write up a more detailed description once we have profiling data ready.
Thanks.

@sryap
Copy link

sryap commented Aug 4, 2022

No worries. Please share your findings when you have them.

I think it will change the number of branches because there are two conditions: i < kMaxVecsPerThread and 4 * kWarpSize * i + threadIdx.x * 4 < D. The number of branches is controlled by 4 * kWarpSize * i + threadIdx.x * 4 < D. I think i < kMaxVecsPerThread is there for correctness check. For example, for D=256, the number of branches is constant irrespective to the kMaxVecsPerThread value. Please correct me if I'm wrong.

@weihanmines
Copy link
Author

Hi @sryap, we have confirmed that the occupancy stays the same with/without the changes. So I don't think that the performance gain is due to less registers after the change. The reasons should be less branches and computation, I believe.

facebook-github-bot pushed a commit to pytorch/FBGEMM that referenced this pull request Aug 29, 2022
… and optimize for ROCm (#1240)

Summary:
Make weihanmines's PR ROCm#13 upstreamable.
sryap, would you please review the PR and consider converting it to a draft? Thank you.

Pull Request resolved: #1240

Reviewed By: sryap

Differential Revision: D38507621

Pulled By: shintaro-iwasaki

fbshipit-source-id: 5b4532c0e79ce49a2f93c2a455a6392a1c7c2f16
@amathews-amd
Copy link

merged in pytorch#1240

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants