[MoE] Optimize fused MoE dispatch and workspace allocation with TP leader restriction#24831
[MoE] Optimize fused MoE dispatch and workspace allocation with TP leader restriction#24831skyloevil wants to merge 9 commits intovllm-project:mainfrom
Conversation
…g log for total tokens and update workspace comment Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
…o avoid _resize_cache under-allocation when TP>1; log once when expanding max_num_tokens Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
…e(1) (backend T) and max with config/observed; fixes under-allocation during LL/PPLX and profile runs Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces several key optimizations for Fused MoE layers, primarily by restricting all-to-all dispatch to Tensor Parallel leaders and improving workspace memory allocation. The logic for these optimizations appears correct and well-implemented. The added debug logging is also a welcome addition for better observability.
However, I've identified a significant maintainability concern with code duplication for the TP leader dispatch logic. This logic is copied in both the high-throughput and low-latency DeepEP finalization files. I've left a comment with a suggestion to refactor this into a shared utility to prevent potential future bugs. Addressing this will make the codebase more robust.
| # Only DP leader ranks (tp_rank == 0) should dispatch when TP > 1. | ||
| tp_world_size = get_tensor_model_parallel_world_size() | ||
| tp_rank_in_group = get_tp_group( | ||
| ).rank_in_group if tp_world_size > 1 else 0 | ||
| if tp_world_size > 1 and tp_rank_in_group != 0: | ||
| # Non-leader TP ranks send zero tokens to avoid duplicate dispatch. | ||
| a1 = a1[:0] | ||
| topk_ids = topk_ids[:0] | ||
| topk_weights = topk_weights[:0] |
There was a problem hiding this comment.
This block of code for restricting dispatch to the Tensor Parallel leader is duplicated in vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py (lines 152-160).
Duplicating this logic creates a maintainability issue. If this logic needs to be updated in the future, it's easy to forget to update it in both places, which could lead to subtle and hard-to-debug inconsistencies between the high-throughput and low-latency MoE paths.
To improve maintainability and reduce the risk of future bugs, I recommend refactoring this logic into a shared utility function. For example:
# In a shared utility module
def restrict_dispatch_to_tp_leader(*tensors: torch.Tensor) -> tuple[torch.Tensor, ...]:
"""If the current rank is not a TP leader, returns empty tensors."""
tp_world_size = get_tensor_model_parallel_world_size()
if tp_world_size <= 1:
return tensors
tp_rank_in_group = get_tp_group().rank_in_group
if tp_rank_in_group != 0:
return tuple(t[:0] for t in tensors)
return tensors
# In prepare_async methods
a1, topk_ids, topk_weights = restrict_dispatch_to_tp_leader(
a1, topk_ids, topk_weights
)This would centralize the logic, making the code cleaner and safer to maintain.
Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>
|
cc @tlrmchlsmth @bnellnm - I believe we are already using SP in this case, but correct me if Im wrong. |
We use SP for DeepSeekV3 (ntroduced in #24134), but we need to do it for Qwen and other MoE models. I'll put up a PR today |
|
I think @varun-sundar-rabindranath should check out this PR as well. |
vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py
Outdated
Show resolved
Hide resolved
| # Tokens-per-expert capacity actually used by the backend for this | ||
| # call. For batched formats (DeepEP-LL / PPLX), aq has shape | ||
| # (E, T_backend, K) | ||
| # Prefer using aq.size(1) to avoid under-allocation during dummy/profile | ||
| # runs or when multiple dispatchers/ranks contribute tokens. | ||
| T_backend = aq.size(1) if aq.dim() == 3 else 0 | ||
|
|
||
| # Fallback capacity from configuration/observation. | ||
| num_dispatchers = self.num_dispatchers | ||
| observed_M = a.size(0) | ||
| if self.max_num_tokens is None: | ||
| T_cfg = observed_M * num_dispatchers | ||
| else: | ||
| # Guard with observed_M to avoid under-estimation when TP>1 or | ||
| # during profiling runs. | ||
| max_num_tokens = max(self.max_num_tokens, observed_M) | ||
| if observed_M > self.max_num_tokens: | ||
| with contextlib.suppress(Exception): | ||
| logger.debug_once( | ||
| "[MoE Debug] Increasing workspace max_num_tokens " | ||
| "from configured=%d to observed=%d to avoid OOM. " | ||
| "(num_dispatchers=%d, E=%d, N=%d, K=%d)", | ||
| self.max_num_tokens, observed_M, num_dispatchers, | ||
| num_experts, N, K) | ||
| T_cfg = max_num_tokens * num_dispatchers | ||
|
|
||
| # Final capacity: honor backend's requested T if larger. | ||
| T_eff = max(T_backend, T_cfg) | ||
|
|
||
| workspace13 = (num_experts, T_eff, max(K, N)) | ||
| workspace2 = (num_experts, T_eff, (N // 2)) | ||
| output = (num_experts, T_eff, K) |
There was a problem hiding this comment.
I think these changes might be redundant with another PR that makes the workspaces effectively global. See #23693
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>
Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>
|
@varun-sundar-rabindranath PTAL..Thanks. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@skyloevil this should be covered by #24982, thank you! |
Purpose
Optimize MoE (Mixture of Experts) performance by improving dispatch efficiency and workspace memory allocation. The changes address under-allocation issues during tensor parallel operations.
Changes
- Base capacity calculation on aq.size(1) (backend T dimension)
- Guard against under-allocation when TP > 1 by using observed M values
- Prevent _resize_cache under-allocation during distributed operations
Test Plan
Command:
VLLM_MOE_DP_CHUNK_SIZE=32 VLLM_ALL2ALL_BACKEND="deepep_low_latency" VLLM_USE_DEEP_GEMM=1 vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --data-parallel-size 2 --tensor-parallel-size 2 --enable-expert-parallel --port 9010 --no-enable-prefix-cachingResults:




lm_eval result
4 GPU H100 80GB HBM3
moe log
server log

Files Modified
Related Work
This PR builds upon and is related to: