[TIR] Allreduce broadcast result to each thread in multi-warp case #15373

MasterJH5574 · 2023-07-21T05:50:18Z

PR #15327 introduces the warp-level primitive support in multi-warp allreduce. However, due to the specialty of the two-stage shuffle-down reduction implementation of the allreduce in multi-warp scenarios, PR #15327 did not broadcast the allreduce result to each reduction thread. This behavior does not align with the semantics of allreduce and is not ideal for many use cases. Therefore, this PR completes the implementation by inserting a stage of writing the reduction results to shared memory, so that each reduction thread across all the reduction warps can access the reduction results.

This shared memory write-back stage will only be inserted in multi-warp allreduce cases. In single-warp allreduce, a shfl_sync is used to broadcast the reduction results across reduction threads. Since in multi-warp settings we cannot leverage warp-level primitives to broadcast the value, we can only make use of shared memory.

The numerical correctness are verified locally.

tvm-bot · 2023-07-21T05:50:22Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

cc @Hzfengsy, @junrushao, @quic-sanirudh, @shingjan _{See #10317 for details}

_{Generated by tvm-bot}

MasterJH5574 · 2023-07-21T05:56:04Z

cc @yzh119 @tqchen

junrushao

LGTM! I was curious if there's any performance implication of this change?

MasterJH5574 · 2023-07-21T06:03:42Z

LGTM! I was curious if there's any performance implication of this change?

@junrushao I didn’t measure. For platform like CUDA with warp size 32, the additional shared memory will have at most 16 elements, and I assume this overhead is negligible. Nevertheless, in multi-warp reduction settings, the current impl with warp-level primitive leverage will be at least no slower than the status before #15327, which allocated large shared memory and used naive allreduce implementation over shared memory.

On the other hand, to fulfill the semantics of allreduce, we have to compromise the shared memory here.

PR apache#15327 introduces the warp-level primitive support in multi-warp allreduce. However, due to the specialty of the two-stage shuffle-down reduction implementation of the allreduce in multi-warp scenarios, PR apache#15327 did not broadcast the allreduce result to each reduction thread. This behavior does not align with the semantics of allreduce and is not ideal for many use cases. Therefore, this PR completes the implementation by inserting a stage of writing the reduction results to shared memory, so that each reduction thread across all the reduction warps can access the reduction results. This shared memory write-back stage will only be inserted in multi-warp allreduce cases. In single-warp allreduce, a `shfl_sync` is used to broadcast the reduction results across reduction threads. Since in multi-warp settings we cannot leverage warp-level primitives to broadcast the value, we can only make use of shared memory. The numerical correctness are verified locally.

PR apache#15327 and apache#15373 introduced multi-warp allreduce implementation. At the time of the introduction, I tested the correctness numerically via the workload of "taking a matrix of ones as input, computing the summation over each row". Both PR passed this numerical tess, while I didn't realize that this test is not complete and cannot guarantee the correctness. The previous implementation has bug which can be tested by turning the input matrix from ones to random floating-point numbers. This will expose the issues of the previous implementation. Therefore, this PR fixes the issues, and add the numerical tests for multi-warp allreduce into `test_allreduce_cuda.py`. By reducing some of the redundant tests in that file, we hope this can reduce the testing time a bit while still guarantee the correctness. Sorry for not testing the implementation completely before.

PR #15327 and #15373 introduced multi-warp allreduce implementation. At the time of the introduction, I tested the correctness numerically via the workload of "taking a matrix of ones as input, computing the summation over each row". Both PR passed this numerical tess, while I didn't realize that this test is not complete and cannot guarantee the correctness. The previous implementation has bug which can be tested by turning the input matrix from ones to random floating-point numbers. This will expose the issues of the previous implementation. Therefore, this PR fixes the issues, and add the numerical tests for multi-warp allreduce into `test_allreduce_cuda.py`. By reducing some of the redundant tests in that file, we hope this can reduce the testing time a bit while still guarantee the correctness. Sorry for not testing the implementation completely before.

MasterJH5574 mentioned this pull request Jul 21, 2023

[TIR] Finer predicate handling in cross-thread reduction #15374

Merged

junrushao approved these changes Jul 21, 2023

View reviewed changes

MasterJH5574 force-pushed the tvm-dev/2023-07-20-allreduce-broadcast branch from cd1bcc4 to cd0363b Compare July 21, 2023 07:51

MasterJH5574 force-pushed the tvm-dev/2023-07-20-allreduce-broadcast branch from cd0363b to 38062eb Compare July 21, 2023 16:25

tqchen merged commit 5029477 into apache:main Jul 22, 2023

MasterJH5574 mentioned this pull request Jul 25, 2023

[BugFix][TIR] Fix multi-grouped multi-warp allreduce #15399

Merged

ysh329 mentioned this pull request Oct 18, 2023

[Release] v0.14.0 Release Candidate Notes #15948

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TIR] Allreduce broadcast result to each thread in multi-warp case #15373

[TIR] Allreduce broadcast result to each thread in multi-warp case #15373

Uh oh!

MasterJH5574 commented Jul 21, 2023

Uh oh!

tvm-bot commented Jul 21, 2023

Uh oh!

MasterJH5574 commented Jul 21, 2023

Uh oh!

junrushao left a comment

Uh oh!

MasterJH5574 commented Jul 21, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[TIR] Allreduce broadcast result to each thread in multi-warp case #15373

[TIR] Allreduce broadcast result to each thread in multi-warp case #15373

Uh oh!

Conversation

MasterJH5574 commented Jul 21, 2023

Uh oh!

tvm-bot commented Jul 21, 2023

Uh oh!

MasterJH5574 commented Jul 21, 2023

Uh oh!

junrushao left a comment

Choose a reason for hiding this comment

Uh oh!

MasterJH5574 commented Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MasterJH5574 commented Jul 21, 2023 •

edited

Loading