[Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni#560
[Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni#560Isotr0py merged 8 commits intovllm-project:mainfrom
Conversation
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
@Isotr0py Could you please help review? Actually, I'm not very familiar with |
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
@amy-why-3459 PTAL |
| Replace Qwen3MoeSparseMoeBlock layers with Qwen3OmniMoeTalkerSparseMoeBlock | ||
| that includes shared expert support via SharedFusedMoE. | ||
| """ | ||
| # Get compilation config to clean up registered layer names | ||
| compilation_config = self.talker_vllm_config.compilation_config | ||
|
|
||
| for layer_idx, layer in enumerate(self.model.layers): | ||
| # Check if this layer has a MoE block (has experts attribute) | ||
| if isinstance(layer.mlp, Qwen3MoeSparseMoeBlock): | ||
| # Remove old layer registration from static_forward_context | ||
| old_experts_prefix = f"{prefix}.model.layers.{layer_idx}.mlp.experts" | ||
| if old_experts_prefix in compilation_config.static_forward_context: | ||
| del compilation_config.static_forward_context[old_experts_prefix] | ||
|
|
||
| # Create new MoE block with shared expert support | ||
| layer.mlp = Qwen3OmniMoeTalkerSparseMoeBlock( | ||
| config=self.config, | ||
| quant_config=self.talker_vllm_config.quant_config, | ||
| prefix=f"{prefix}.model.layers.{layer_idx}.mlp", | ||
| ) |
There was a problem hiding this comment.
This looks a bit hacky for compilation workaround. Perhaps we can upstream the SharedFusedMoE support to vLLM's qwen3_moe.py to avoid this in following PR?
There was a problem hiding this comment.
Yeah. We should make it upstream later.
|
Any progress on this? |
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Yes. I'm updating it today. |
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d0a9346260
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Combine shared and routed expert outputs | ||
| if self._shared_expert_wrapper is not None: | ||
| # SharedFusedMoE returns tuple: (shared_out, fused_out) | ||
| final_hidden_states = final_hidden_states[0] + final_hidden_states[1] | ||
|
|
||
| # Apply tensor parallel reduction if needed | ||
| if self.tp_size > 1: | ||
| final_hidden_states = self.experts.maybe_all_reduce_tensor_model_parallel(final_hidden_states) |
There was a problem hiding this comment.
Reduce routed output before adding shared expert
If the shared expert is instantiated on every tensor-parallel rank (as it is here) and only the routed experts are sharded, the current order will all-reduce the shared expert output along with the routed output. That sums the shared contribution across TP ranks when tp_size > 1, so logits scale with TP size and diverge from HF for multi-GPU runs. A safer order is to all-reduce only the routed output, then add the shared output after reduction (or otherwise prevent the shared output from being summed). This only affects TP>1 with a replicated shared expert.
Useful? React with 👍 / 👎.
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
@Isotr0py Many thanks! |
Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Add
Qwen3OmniMoeTalkerSparseMoeBlockto useSharedFusedMoEinstead ofFusedMoEso that we don't need to write shared fused moe manually and we can leverage the high-performance operators already implemented in vLLM. Since this is a custom op, it can be dispatched based on the hardware platform, enabling better performance on each target hardware.Test Plan
Test Result
execute_model: 177 ms ---> 123 ms
Qwen3MoeDecoderLayer_0: 34 ms ---> 21 ms
Qwen3MoeSparseMoeBlock_0: 24 ms --> 13 ms
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)