Optimize MoE Token Dispatch for Tensor Parallel Configurations#22993
Optimize MoE Token Dispatch for Tensor Parallel Configurations#22993skyloevil wants to merge 293 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This pull request optimizes MoE token dispatching in tensor parallel configurations by restricting token dispatch to the leader rank. The implementation introduces a new method _get_effective_num_dispatchers to control the number of dispatchers based on the tensor parallel rank, which correctly reduces workspace allocation for non-leader ranks. The change is well-implemented and should deliver the described performance benefits. I have one suggestion to move a local import to the top level for better performance and code style.
There was a problem hiding this comment.
For improved performance and code clarity, it's recommended to move this import to the top of the file. Local imports can introduce overhead, especially if this method is called in a performance-sensitive path. Please remove the local import from this method and add from vllm.distributed import get_tensor_model_parallel_world_size, get_tensor_model_parallel_rank to the file-level imports.
6cad37f to
76782d4
Compare
|
Hi @skyloevil . Thanks you for the fix. AFAICT, the TP ranks still participate in the all2alls no ? If that is the case, then we might end up in a spot where the workspaces aren't big enough to accommodate all the incoming tokens. Can you confirm that this doesn't happen. Ways to test / debug:
besides testing for accuracy, it is quite adept in catching corner cases.
If multiple TP ranks are involved in the all2alls the solution could be as simple as to make only TP=0 participate in the all2all. A slightly complicated but optimal solution would be to dispatch only a part of the tokens from each TP rank. Note that the second approach is required only for DeepEP all2all kernels. PPLX kernels do this automatically when TP > 1. Also, can you share any perf numbers. Thanks 🙌 |
This comment was marked as outdated.
This comment was marked as outdated.
20bd6ed to
533759c
Compare
…h optimization - Add debug logs to track FP8 quantization method configuration and Deep GEMM support detection - Implement detailed logging in BatchedTritonOrDeepGemmExperts for initialization and runtime selection - Add verification logs for _get_effective_num_dispatchers method to validate tensor parallel dispatch optimization - Include environment-controlled logging (VLLM_LOG_MOE_DISPATCH) for PR vllm-project#22993 verification - Enable tracing of complete MoE expert selection pipeline from quantization to execution - All debug logs use appropriate log levels (DEBUG for detailed tracing, INFO for key verification points) These logs enable developers to: 1. Verify MoE dispatch optimization works correctly in TP > 1 scenarios 2. Trace why specific expert implementations are selected 3. Debug expert_num_tokens allocation and workspace sizing issues 4. Validate that leader/non-leader rank dispatch logic functions as expected Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
…llm-project#24362) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
…m-project#24335) Signed-off-by: Mohan Kumar Kumar <mohan.cbein@gmail.com> Signed-off-by: mohankku <mohan.cbein@gmail.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
…x reasoning token count (vllm-project#24285) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
…24370) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
… scopes (vllm-project#24265) Co-authored-by: Bangsheng Tang <bangsheng@meta.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
…m-project#24372) Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Benji Beck <benjibeck@meta.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…llm-project#24380) Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
…4330) Signed-off-by: Saman Keon <samanamp@outlook.com>
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Optimize MoE Token Dispatch for Tensor Parallel Configurations
Summary
This PR implements an optimization for MoE (Mixture of Experts) token dispatching in tensor parallel (TP) configurations to significantly reduce cross-rank communication overhead. The optimization achieves 2x to 8x reduction in communication by implementing leader-only token dispatching when TP > 1.
Problem
In the current implementation, when using tensor parallelism with MoE models, all DP (data parallel) ranks dispatch tokens independently, leading to redundant communication across ranks. This creates unnecessary overhead in distributed training and inference scenarios.
Solution
Core Changes
File:
vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.pyAdded
_get_effective_num_dispatchers()method:Updated
workspace_shapes()method:Algorithm Details
Performance Impact
Benefits
Implementation Features
Testing Considerations
The optimization maintains functional correctness while improving performance: