[Feature]: Implement naive prepare/finalize class to replace naive dispatching in fused_moe/layer.py#30775
[Feature]: Implement naive prepare/finalize class to replace naive dispatching in fused_moe/layer.py#30775teddygood wants to merge 14 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the MoE implementation to better separate concerns, moving the 'naive' EP+DP dispatch/combine logic into its own FusedMoENaivePrepareAndFinalize class. This is a good architectural improvement that increases modularity. However, I've identified two critical issues in vllm/model_executor/layers/fused_moe/layer.py that need to be addressed: a redundant computation of shared expert outputs and a double reduction of the final hidden states. Both issues will lead to incorrect model outputs.
| # If there are shared experts but we are not using a modular kernel, the | ||
| # shared experts must be called here | ||
| if has_separate_shared_experts: | ||
| assert self.shared_experts is not None | ||
|
|
||
| if self.shared_experts_stream is not None: | ||
| # Clone BEFORE switching streams to avoid race condition | ||
| # where routed_expert kernel may mutate hidden_states. | ||
| hidden_states_clone = hidden_states.clone() | ||
| self.shared_experts_stream.wait_stream(current_stream()) | ||
|
|
||
| # Run shared experts in parallel on a separate stream | ||
| with torch.cuda.stream(self.shared_experts_stream): | ||
| shared_output = self.shared_experts(hidden_states_clone) | ||
|
|
||
| # Record that the clone will be used by shared_experts_stream | ||
| # to avoid gc issue from deallocation of hidden_states_clone | ||
| # For more details: https://docs.pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html # noqa: E501 | ||
| # NOTE: we dont need shared_output.record_stream(current_stream()) | ||
| # because we synch the streams before using shared_output. | ||
| hidden_states_clone.record_stream(self.shared_experts_stream) | ||
|
|
||
| else: | ||
| shared_output = self.shared_experts(hidden_states) | ||
| else: | ||
| shared_output = None |
There was a problem hiding this comment.
This block for handling shared experts appears to be redundant and introduces a bug. The logic for computing shared_output is already handled later in this function (lines 1983-1985 and 1997-2007). This new block computes shared_output, which is then overwritten. Additionally, the logic for using a separate stream here is incorrect as it doesn't use the use_shared_experts_stream flag, which depends on the number of tokens. This entire block should be removed to fix the bug.
| if ( | ||
| not self.is_sequence_parallel | ||
| and self.reduce_results | ||
| and (self.tp_size > 1 or self.ep_size > 1) | ||
| ): | ||
| states = self.maybe_all_reduce_tensor_model_parallel(states) | ||
|
|
There was a problem hiding this comment.
This all_reduce operation is redundant. The forward_native method, which calls forward_impl, already performs a reduction on the output. Adding another reduction here will result in a double all_reduce, which is incorrect and will lead to wrong results. This block should be removed to avoid the double reduction.
|
This pull request has merge conflicts that must be resolved before it can be |
77e026b to
ebc1ecb
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
ebc1ecb to
8f637d7
Compare
Signed-off-by: teddygood <ibear6954@gmail.com>
Signed-off-by: teddygood <ibear6954@gmail.com>
Signed-off-by: teddygood <ibear6954@gmail.com>
Signed-off-by: teddygood <ibear6954@gmail.com>
Signed-off-by: teddygood <ibear6954@gmail.com>
Signed-off-by: teddygood <ibear6954@gmail.com>
Signed-off-by: teddygood <ibear6954@gmail.com>
Signed-off-by: teddygood <ibear6954@gmail.com>
Signed-off-by: teddygood <ibear6954@gmail.com>
Signed-off-by: teddygood <ibear6954@gmail.com>
Signed-off-by: teddygood <ibear6954@gmail.com>
Signed-off-by: teddygood <ibear6954@gmail.com>
Signed-off-by: teddygood <ibear6954@gmail.com>
8f637d7 to
8fa8b01
Compare
|
hey @teddygood - thanks for your hard work here. I am also motivated to make this change. I think we can have a simpler approach by instead just doing the dispatch/combine on the In addition, this is also a performance optimization because we send much less data (only the topk experts). Would you like to collaborate with me on it? |
hey @robertgshaw2-redhat thanks for the comment. I agree with dispatching/combining only on |
sounds good, lets discuss via slack. Ill close this PR for now if thats okay? |
sure, thanks. |
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
FusedMoENaivePrepareAndFinalizesubclass and wiring it through the modular kernel hooks.FusedMoEPrepareAndFinalizewith lightweight pre/post hooks so all prepare/finalize implementations can control routing/combine, eliminating the special-case logic fromfused_moe/layer.py.Test Plan
I ran the suggested
lm_evalsanity tests using 4×A100 GPUs, serving Qwen/Qwen1.5-MoE-A2.7B with vLLM configured as TP=2, DP=2, and expert parallelism enabled, and evaluated the model via the local completions endpoint.Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.