[Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni by gcanlin · Pull Request #560 · vllm-project/vllm-omni

gcanlin · 2025-12-31T04:09:37Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Add Qwen3OmniMoeTalkerSparseMoeBlock to use SharedFusedMoE instead of FusedMoE so that we don't need to write shared fused moe manually and we can leverage the high-performance operators already implemented in vLLM. Since this is a custom op, it can be dispatched based on the hardware platform, enabling better performance on each target hardware.

Test Plan

Test Result

execute_model: 177 ms ---> 123 ms
Qwen3MoeDecoderLayer_0: 34 ms ---> 21 ms
Qwen3MoeSparseMoeBlock_0: 24 ms --> 13 ms

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin · 2025-12-31T08:05:52Z

@Isotr0py Could you please help review? Actually, I'm not very familiar with SharedFusedMoE and afraid that I could bring some hidden bugs into modeling.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

hsliuustc0106 · 2025-12-31T09:00:24Z

@amy-why-3459 PTAL

Isotr0py · 2025-12-31T12:58:23Z

+        Replace Qwen3MoeSparseMoeBlock layers with Qwen3OmniMoeTalkerSparseMoeBlock
+        that includes shared expert support via SharedFusedMoE.
+        """
+        # Get compilation config to clean up registered layer names
+        compilation_config = self.talker_vllm_config.compilation_config
+
+        for layer_idx, layer in enumerate(self.model.layers):
+            # Check if this layer has a MoE block (has experts attribute)
+            if isinstance(layer.mlp, Qwen3MoeSparseMoeBlock):
+                # Remove old layer registration from static_forward_context
+                old_experts_prefix = f"{prefix}.model.layers.{layer_idx}.mlp.experts"
+                if old_experts_prefix in compilation_config.static_forward_context:
+                    del compilation_config.static_forward_context[old_experts_prefix]
+
+                # Create new MoE block with shared expert support
+                layer.mlp = Qwen3OmniMoeTalkerSparseMoeBlock(
+                    config=self.config,
+                    quant_config=self.talker_vllm_config.quant_config,
+                    prefix=f"{prefix}.model.layers.{layer_idx}.mlp",
+                )


This looks a bit hacky for compilation workaround. Perhaps we can upstream the SharedFusedMoE support to vLLM's qwen3_moe.py to avoid this in following PR?

Yeah. We should make it upstream later.

ZJY0516 · 2026-01-08T08:42:06Z

Any progress on this?

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin · 2026-01-08T08:48:38Z

Any progress on this?

Yes. I'm updating it today.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d0a9346260

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-01-08T10:01:11Z

+        # Combine shared and routed expert outputs
+        if self._shared_expert_wrapper is not None:
+            # SharedFusedMoE returns tuple: (shared_out, fused_out)
+            final_hidden_states = final_hidden_states[0] + final_hidden_states[1]
+
+        # Apply tensor parallel reduction if needed
+        if self.tp_size > 1:
+            final_hidden_states = self.experts.maybe_all_reduce_tensor_model_parallel(final_hidden_states)


Reduce routed output before adding shared expert

If the shared expert is instantiated on every tensor-parallel rank (as it is here) and only the routed experts are sharded, the current order will all-reduce the shared expert output along with the routed output. That sums the shared contribution across TP ranks when tp_size > 1, so logits scale with TP size and diverge from HF for multi-GPU runs. A safer order is to all-reduce only the routed output, then add the shared output after reduction (or otherwise prevent the shared output from being summed). This only affects TP>1 with a replicated shared expert.

Useful? React with 👍 / 👎.

Isotr0py

LGTM

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

gcanlin · 2026-01-10T05:27:35Z

@Isotr0py Many thanks!

Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

[Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni

52ee5bb

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

hsliuustc0106 linked an issue Dec 31, 2025 that may be closed by this pull request

[RFC]: Qwen3-Omni deployment #409

Open

35 tasks

fix pre-commit lint

dca79ba

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

ZJY0516 requested review from DarkLight1337 and Isotr0py December 31, 2025 08:17

align qwen3moe

11283fa

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

amy-why-3459 reviewed Dec 31, 2025

View reviewed changes

Comment thread vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_talker.py

Isotr0py reviewed Dec 31, 2025

View reviewed changes

hsliuustc0106 requested a review from a team January 6, 2026 14:39

Merge branch 'main' into qwen3-omni-perf

8026bbb

remove duplicate code

e24650e

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

Use vllm's Qwen3Moe

d0a9346

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin marked this pull request as ready for review January 8, 2026 09:50

gcanlin requested a review from hsliuustc0106 as a code owner January 8, 2026 09:50

chatgpt-codex-connector Bot reviewed Jan 8, 2026

View reviewed changes

Isotr0py approved these changes Jan 10, 2026

View reviewed changes

Isotr0py added 2 commits January 10, 2026 11:29

Merge remote-tracking branch 'upstream/main' into qwen3-omni-perf

11bfafb

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

code format

66b686c

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py enabled auto-merge (squash) January 10, 2026 03:36

Isotr0py added the ready label to trigger buildkite CI label Jan 10, 2026

Isotr0py merged commit 9c2f746 into vllm-project:main Jan 10, 2026
6 of 7 checks passed

Isotr0py mentioned this pull request Jan 10, 2026

[Models] Add SharedFusedMoE support to Qwen3MoE vllm-project/vllm#32082

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni#560

[Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni#560
Isotr0py merged 8 commits intovllm-project:mainfrom
gcanlin:qwen3-omni-perf

gcanlin commented Dec 31, 2025 •

edited

Loading

Uh oh!

gcanlin commented Dec 31, 2025

Uh oh!

hsliuustc0106 commented Dec 31, 2025

Uh oh!

Uh oh!

Isotr0py Dec 31, 2025 •

edited

Loading

Uh oh!

gcanlin Jan 8, 2026

Uh oh!

Uh oh!

ZJY0516 commented Jan 8, 2026

Uh oh!

gcanlin commented Jan 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jan 8, 2026

Uh oh!

Isotr0py left a comment

Uh oh!

Uh oh!

gcanlin commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

gcanlin commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gcanlin commented Dec 31, 2025

Uh oh!

hsliuustc0106 commented Dec 31, 2025

Uh oh!

Uh oh!

Isotr0py Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gcanlin Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ZJY0516 commented Jan 8, 2026

Uh oh!

gcanlin commented Jan 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gcanlin commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gcanlin commented Dec 31, 2025 •

edited

Loading

Isotr0py Dec 31, 2025 •

edited

Loading