Skip to content

Conversation

@MasterJH5574
Copy link
Contributor

PR #15327 introduces the warp-level primitive support in multi-warp allreduce. However, due to the specialty of the two-stage shuffle-down reduction implementation of the allreduce in multi-warp scenarios, PR #15327 did not broadcast the allreduce result to each reduction thread. This behavior does not align with the semantics of allreduce and is not ideal for many use cases. Therefore, this PR completes the implementation by inserting a stage of writing the reduction results to shared memory, so that each reduction thread across all the reduction warps can access the reduction results.

This shared memory write-back stage will only be inserted in multi-warp allreduce cases. In single-warp allreduce, a shfl_sync is used to broadcast the reduction results across reduction threads. Since in multi-warp settings we cannot leverage warp-level primitives to broadcast the value, we can only make use of shared memory.

The numerical correctness are verified locally.

@tvm-bot
Copy link
Collaborator

tvm-bot commented Jul 21, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@MasterJH5574
Copy link
Contributor Author

cc @yzh119 @tqchen

Copy link
Member

@junrushao junrushao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I was curious if there's any performance implication of this change?

@MasterJH5574
Copy link
Contributor Author

MasterJH5574 commented Jul 21, 2023

LGTM! I was curious if there's any performance implication of this change?

@junrushao I didn’t measure. For platform like CUDA with warp size 32, the additional shared memory will have at most 16 elements, and I assume this overhead is negligible. Nevertheless, in multi-warp reduction settings, the current impl with warp-level primitive leverage will be at least no slower than the status before #15327, which allocated large shared memory and used naive allreduce implementation over shared memory.

On the other hand, to fulfill the semantics of allreduce, we have to compromise the shared memory here.

@MasterJH5574 MasterJH5574 force-pushed the tvm-dev/2023-07-20-allreduce-broadcast branch from cd1bcc4 to cd0363b Compare July 21, 2023 07:51
PR apache#15327 introduces the warp-level primitive support in multi-warp
allreduce. However, due to the specialty of the two-stage
shuffle-down reduction implementation of the allreduce in multi-warp
scenarios, PR apache#15327 did not broadcast the allreduce result to each
reduction thread. This behavior does not align with the semantics
of allreduce and is not ideal for many use cases. Therefore, this
PR completes the implementation by inserting a stage of writing the
reduction results to shared memory, so that each reduction thread
across all the reduction warps can access the reduction results.

This shared memory write-back stage will only be inserted in
multi-warp allreduce cases. In single-warp allreduce, a `shfl_sync`
is used to broadcast the reduction results across reduction threads.
Since in multi-warp settings we cannot leverage warp-level primitives
to broadcast the value, we can only make use of shared memory.

The numerical correctness are verified locally.
@MasterJH5574 MasterJH5574 force-pushed the tvm-dev/2023-07-20-allreduce-broadcast branch from cd0363b to 38062eb Compare July 21, 2023 16:25
@tqchen tqchen merged commit 5029477 into apache:main Jul 22, 2023
MasterJH5574 added a commit to MasterJH5574/tvm that referenced this pull request Jul 25, 2023
PR apache#15327 and apache#15373 introduced multi-warp allreduce implementation.
At the time of the introduction, I tested the correctness numerically
via the workload of "taking a matrix of ones as input, computing the
summation over each row". Both PR passed this numerical tess, while
I didn't realize that this test is not complete and cannot guarantee
the correctness.

The previous implementation has bug which can be tested by turning
the input matrix from ones to random floating-point numbers. This will
expose the issues of the previous implementation.

Therefore, this PR fixes the issues, and add the numerical tests
for multi-warp allreduce into `test_allreduce_cuda.py`. By reducing
some of the redundant tests in that file, we hope this can reduce the
testing time a bit while still guarantee the correctness.

Sorry for not testing the implementation completely before.
MasterJH5574 added a commit to MasterJH5574/tvm that referenced this pull request Jul 25, 2023
PR apache#15327 and apache#15373 introduced multi-warp allreduce implementation.
At the time of the introduction, I tested the correctness numerically
via the workload of "taking a matrix of ones as input, computing the
summation over each row". Both PR passed this numerical tess, while
I didn't realize that this test is not complete and cannot guarantee
the correctness.

The previous implementation has bug which can be tested by turning
the input matrix from ones to random floating-point numbers. This will
expose the issues of the previous implementation.

Therefore, this PR fixes the issues, and add the numerical tests
for multi-warp allreduce into `test_allreduce_cuda.py`. By reducing
some of the redundant tests in that file, we hope this can reduce the
testing time a bit while still guarantee the correctness.

Sorry for not testing the implementation completely before.
MasterJH5574 added a commit to MasterJH5574/tvm that referenced this pull request Jul 25, 2023
PR apache#15327 and apache#15373 introduced multi-warp allreduce implementation.
At the time of the introduction, I tested the correctness numerically
via the workload of "taking a matrix of ones as input, computing the
summation over each row". Both PR passed this numerical tess, while
I didn't realize that this test is not complete and cannot guarantee
the correctness.

The previous implementation has bug which can be tested by turning
the input matrix from ones to random floating-point numbers. This will
expose the issues of the previous implementation.

Therefore, this PR fixes the issues, and add the numerical tests
for multi-warp allreduce into `test_allreduce_cuda.py`. By reducing
some of the redundant tests in that file, we hope this can reduce the
testing time a bit while still guarantee the correctness.

Sorry for not testing the implementation completely before.
tqchen pushed a commit that referenced this pull request Jul 25, 2023
PR #15327 and #15373 introduced multi-warp allreduce implementation.
At the time of the introduction, I tested the correctness numerically
via the workload of "taking a matrix of ones as input, computing the
summation over each row". Both PR passed this numerical tess, while
I didn't realize that this test is not complete and cannot guarantee
the correctness.

The previous implementation has bug which can be tested by turning
the input matrix from ones to random floating-point numbers. This will
expose the issues of the previous implementation.

Therefore, this PR fixes the issues, and add the numerical tests
for multi-warp allreduce into `test_allreduce_cuda.py`. By reducing
some of the redundant tests in that file, we hope this can reduce the
testing time a bit while still guarantee the correctness.

Sorry for not testing the implementation completely before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants