changed 128 items per CTA to 256 items #13

weihanmines · 2022-07-29T16:20:03Z

No description provided.

sryap · 2022-08-04T21:24:07Z

@amathews-amd @dllehr-amd Thanks for submitting the PR. Can you provide the reason as to why this would help improve the performance of TBE? Which cases that this change will improve the performance? Does it improve performance of TBE on both AMD GPUs and NVIDIA GPUs? Thanks

weihanmines · 2022-08-04T21:39:00Z

Hi @sryap, This change is specific to AMD GPUs because of the thread organization (an execution unit consists of 64 threads on AMD GPUs, while there are 32 threads, named warp, on NVDIA GPUs). Therefore, 256 elements are processed by each execution unit on AMD GPUs. This reduces the size of accumulator shown in this link, https://github.com/ROCmSoftwarePlatform/FBGEMM/blob/main/fbgemm_gpu/codegen/embedding_forward_split_template.cu#L238. Most importantly by setting items_per_warp to 256, the unnecessary branches and computation are avoided. Please see the following link, https://github.com/ROCmSoftwarePlatform/FBGEMM/blob/main/fbgemm_gpu/codegen/embedding_forward_split_template.cu#L283-L284. Please let me know if you have any questions. Thank you.

weihanmines · 2022-08-04T21:43:22Z

@sryap BTW, CTA should be replaced with warp (or basic unit of execution) to be more precise.

sryap · 2022-08-04T22:06:28Z

Thanks for your explanation @weihanmines. This will reduce the number of registers allocated in the thread block but I don't think it will reduce the number of branches. Do you agree that the performance improvement is from the reduced number of registers which will allow for more thread blocks to be pipelined on a CU? If so, can you please help me do the analysis to verify that this is true (e.g., collecting some performance counter)? If not, can you please show some evidence to support your theory (other than just the improved TBE time)?

Since this is the specific change for AMD GPUs, can you please wrap the change with #ifdef __HIP_PLATFORM_HCC__?

Another ask: could you please revise the description of the PR to be more informative?

Thank you very much!

weihanmines · 2022-08-04T22:28:17Z

Yes, it will reduce the number of registers being used. I am not sure how many registers are allocated on AMD GPUs. I could not find a compilation option to report that. We have planed to use MIPerf to profile the new code to take a look at new metrics. However, the server was not available for some reason in the past week. It will reduce the branches and computations. Please take a closer look at the link I sent in my previous reply.
We already have commits which take care of your suggestion and they are waiting to be merged. I will write up a more detailed description once we have profiling data ready.
Thanks.

sryap · 2022-08-04T22:37:46Z

No worries. Please share your findings when you have them.

I think it will change the number of branches because there are two conditions: i < kMaxVecsPerThread and 4 * kWarpSize * i + threadIdx.x * 4 < D. The number of branches is controlled by 4 * kWarpSize * i + threadIdx.x * 4 < D. I think i < kMaxVecsPerThread is there for correctness check. For example, for D=256, the number of branches is constant irrespective to the kMaxVecsPerThread value. Please correct me if I'm wrong.

weihanmines · 2022-08-09T17:11:19Z

Hi @sryap, we have confirmed that the occupancy stays the same with/without the changes. So I don't think that the performance gain is due to less registers after the change. The reasons should be less branches and computation, I believe.

… and optimize for ROCm (#1240) Summary: Make weihanmines's PR ROCm#13 upstreamable. sryap, would you please review the PR and consider converting it to a draft? Thank you. Pull Request resolved: #1240 Reviewed By: sryap Differential Revision: D38507621 Pulled By: shintaro-iwasaki fbshipit-source-id: 5b4532c0e79ce49a2f93c2a455a6392a1c7c2f16

amathews-amd · 2024-09-24T19:34:26Z

merged in pytorch#1240

changed 128 items per CTA to 256 items

8ae6c69

amathews-amd assigned amathews-amd and dllehr-amd Jul 30, 2022

liligwu mentioned this pull request Aug 2, 2022

Upstreamable #14

Closed

liligwu mentioned this pull request Aug 5, 2022

Change the hardcoded 128 items per warp in embeddingBag to a variable and optimize for ROCm pytorch/FBGEMM#1240

Closed

liligwu force-pushed the main branch 2 times, most recently from c557583 to 4216052 Compare February 8, 2023 17:26

liligwu force-pushed the main branch from 9f6a233 to ad83687 Compare August 9, 2024 19:41

amathews-amd closed this Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changed 128 items per CTA to 256 items #13

changed 128 items per CTA to 256 items #13

weihanmines commented Jul 29, 2022

sryap commented Aug 4, 2022

weihanmines commented Aug 4, 2022

weihanmines commented Aug 4, 2022

sryap commented Aug 4, 2022

weihanmines commented Aug 4, 2022 •

edited

Loading

sryap commented Aug 4, 2022

weihanmines commented Aug 9, 2022

amathews-amd commented Sep 24, 2024

changed 128 items per CTA to 256 items #13

changed 128 items per CTA to 256 items #13

Conversation

weihanmines commented Jul 29, 2022

sryap commented Aug 4, 2022

weihanmines commented Aug 4, 2022

weihanmines commented Aug 4, 2022

sryap commented Aug 4, 2022

weihanmines commented Aug 4, 2022 • edited Loading

sryap commented Aug 4, 2022

weihanmines commented Aug 9, 2022

amathews-amd commented Sep 24, 2024

weihanmines commented Aug 4, 2022 •

edited

Loading