Support mnnvl all2allv from Flashinfer#21003
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces support for Flashinfer's mnnvl all2allv for Mixture-of-Experts (MoE) layers, which is a significant performance enhancement for distributed inference. The changes are comprehensive, touching custom ops, distributed communicators, the MoE layer implementation, and quantization methods.
The core of the change is the new FlashInferAllToAllManager and its integration into the MoE forward pass. The review focuses on potential issues like hardcoded values, code duplication, and correctness of the communication logic to ensure the new feature is robust and maintainable.
| gpus_per_node: int = 4, #TODO(shuw): remove hardcode | ||
| ): |
| print("xxxx"*100) | ||
| print(all2all_manager) | ||
| print(f"ep_size:{self.ep_size}, {self.ep_rank}") |
| print("xxxx"*100) | ||
| print(all2all_manager) | ||
| print(f"ep_size:{self.ep_size}, {self.ep_rank}") | ||
| assert all2all_manager is not None | ||
| # TODO(shuw): need to consider chunking for global_num_tokens_cpu | ||
| x1, topk_ids1, topk_weights1, alltoall_info = all2all_manager.dispatch( | ||
| get_dp_group().device_communicator, | ||
| global_num_tokens_cpu, | ||
| a1, | ||
| topk_ids, | ||
| topk_weights, | ||
| top_k, | ||
| num_experts, | ||
| self.ep_rank, | ||
| self.ep_size, | ||
| ) | ||
| self.alltoall_info = alltoall_info |
| if enable_flashinfer_fp4_allgather: | ||
| topk_weights, topk_ids, a1q, a1q_scale = \ | ||
| get_dp_group().all_gatherv([topk_weights, topk_ids, a1q, a1q_scale], | ||
| dim=0, | ||
| sizes=get_local_sizes(local_tokens)) | ||
|
|
There was a problem hiding this comment.
The code block if enable_flashinfer_fp4_allgather: seems to perform a redundant communication. The all2all_manager.dispatch call on line 122 already performs a gather operation. This is then followed by another get_dp_group().all_gatherv here. This appears to be redundant and could impact performance. Please verify if both are necessary. If not, the redundant call should be removed.
| # if enable_flashinfer_alltoall: | ||
| # print("all2allcalling"*100) | ||
| # a1q = MnnvlMoe.mnnvl_moe_alltoallv(a1q, self.alltoall_info, | ||
| # self.alltoall_workspace, | ||
| # self.ep_rank, self.ep_size) | ||
| # a1q_scale = MnnvlMoe.mnnvl_moe_alltoallv( | ||
| # a1q_scale, alltoall_info, self.alltoall_workspace, | ||
| # self.ep_rank, self.ep_size) |
888fe50 to
1ead2de
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Shu Wang <shuw@nvidia.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
@wenscarl can you fix the latest merge conflict? |
Signed-off-by: Shu Wang <shuw@nvidia.com>
Head branch was pushed to by a user without write access
Signed-off-by: Shu Wang <shuw@nvidia.com> Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Shu Wang <shuw@nvidia.com> Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Shu Wang <shuw@nvidia.com> Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Shu Wang <shuw@nvidia.com> Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Needs flashinfer-ai/flashinfer#1245
Purpose
Test Plan
vs.
VLLM_ALL2ALL_BACKEND="naive" \
...
Test Result
accuracy:
perf:
Alltoallv(this PR):
allgather-reducescatter
(Optional) Documentation Update