[ROCm][Perf] Add Fused Shared Expert (FSE) support for Qwen3-Next by nholmber · Pull Request #39280 · vllm-project/vllm

nholmber · 2026-04-08T08:28:25Z

Purpose

Fuse shared expert into the AITER MoE kernel as an extra expert slot when VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1, eliminating the separate shared expert MLP forward pass and greatly improving decode throughput.

The router gate [num_experts, hidden] and shared expert gate [num_shared, hidden] weight matrices are fused into a single [num_experts + num_shared, hidden] matrix at init. One F.linear call produces combined logits, and the topk_softmax kernel applies routing softmax and shared expert activation (sigmoid) in a single launch; no extra kernel launches for the shared expert gate projection, activation, or buffer copy.

Changes:

qwen3_next.py: Model-level FSE wiring (init, weight loading, expert mapping, forward tuple unpack for SharedFusedMoE compatibility)
qwen3_next_mtp.py: MTP weight loading for fused expert slot
moe_runner_base.py: Lazy gate weight fusion in forward_dispatch(); thread num_fused_shared_experts through routing
_aiter_ops.py: Extend topk_softmax with num_shared_experts and shared_expert_scoring_func params; add runtime version check for graceful fallback with older AITER
fused_topk_router.py: Fused kernel dispatch path + non-fused fallback (separate softmax, sigmoid, inject)
base_router.py + router subclasses: Add num_fused_shared_experts param to _compute_routing() interface
rocm_aiter_fused_moe.py: inject_shared_expert_weights() for merging routed topk results with the shared expert buffer

Test Plan

Model: Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
Container: vllm/vllm-openai-rocm:v0.19.0
Hardware: MI355X, ROCm 7.2.1
AITER: 02d8af55e (with 7-arg topk_softmax support), stock version that uses non-fused topk + sigmoid also tested
Accuracy: GSM8K 8-shot flexible-extract (FSE=0 baseline vs FSE=1)
Throughput: vllm bench serve, random 1k input / 1k output at c4/c8/c16/c32

Sample commands

docker run --name fse-test -d \
  --device /dev/dri --device /dev/kfd \
  --group-add video --ipc host --network host \
  --security-opt seccomp=unconfined --shm-size 64G \
  --entrypoint "" \
  -e HIP_VISIBLE_DEVICES=0 \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-rocm:v0.19.0 sleep infinity
# Install this branch
docker exec -it fse-test bash
pip install git+https://github.com/nholmber/vllm.git@pr/fse-qwen3next-v2 --no-build-isolation
pip install lm_eval[api]

# Start the server
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=<0|1> \
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 16384 \
  --max-num-seqs 256 \
  --attention-backend ROCM_AITER_FA \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE"}'

Test Result

Accuracy (lm_eval GSM8K 8-shot, flexible-extract)

Config	Score	Stderr
TP1 FSE=0	0.8537	±0.0097
TP1 FSE=1	0.8567	±0.0097
TP2 FSE=0	0.8484	±0.0099
TP2 FSE=1	0.8522	±0.0098

Verdict: All deltas within standard error. No accuracy regression.

Throughput (output tok/s, 1k input / 1k output)

TP1:

Concurrency	FSE=0 (tok/s)	FSE=1 (tok/s)	Speedup
4	458.5	557.0	+21.5%
8	854.9	1037.4	+21.3%
16	1535.8	1855.6	+20.8%
32	2632.4	3060.3	+16.3%

TP2:

Concurrency	FSE=0 (tok/s)	FSE=1 (tok/s)	Speedup
4	463.5	568.0	+22.5%
8	855.6	1048.8	+22.6%
16	1638.0	2026.8	+23.7%
32	2930.0	3557.1	+21.4%

Verdict: FSE provides +16–24% output throughput improvement across
concurrency levels and TP configurations.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request implements Fused Shared Expert (FSE) support for ROCm AITER MoE kernels, specifically targeting Qwen3Next models. The changes include a new weight injection mechanism in the MoE runner and logic to remap shared expert weights to fused expert slots during model loading. Review feedback highlights two critical issues in qwen3_next.py: a crash-inducing logic error when unpacking the output of SharedFusedMoE in the default case, and a potential TypeError caused by passing None instead of 0 for the number of shared experts.

gemini-code-assist · 2026-04-08T08:34:49Z

        if self.shared_expert is not None:
            final_hidden_states = final_hidden_states[0] + final_hidden_states[1]
+        elif self.is_fse_enabled:
+            _, final_hidden_states = final_hidden_states


The current logic for unpacking the result from SharedFusedMoE is broken for the default case where both shared_expert and is_fse_enabled are False. Since SharedFusedMoE.forward always returns a tuple (either (shared, fused) or (None, fused)), final_hidden_states will remain a tuple if both conditions are False, causing a crash in the subsequent .view() call or all_gather operation. The logic should be simplified to always unpack the second element when shared_expert is None.

Suggested change

if self.shared_expert is not None:

final_hidden_states = final_hidden_states[0] + final_hidden_states[1]

elif self.is_fse_enabled:

_, final_hidden_states = final_hidden_states

if self.shared_expert is not None:

final_hidden_states = final_hidden_states[0] + final_hidden_states[1]

else:

_, final_hidden_states = final_hidden_states

Thanks, it's fixed

gemini-code-assist · 2026-04-08T08:34:49Z

            enable_eplb=self.enable_eplb,
            num_redundant_experts=self.n_redundant_experts,
            is_sequence_parallel=self.is_sequence_parallel,
+            n_shared_experts=1 if self.is_fse_enabled else None,


Passing None for n_shared_experts when FSE is disabled will cause a TypeError in moe_runner_base.py during the comparison if num_fused_shared > 0:. It should default to 0 instead of None to ensure compatibility with the runner's logic and the AITER metadata initialization.

Suggested change

n_shared_experts=1 if self.is_fse_enabled else None,

n_shared_experts=1 if self.is_fse_enabled else 0,

tjtanaa · 2026-04-09T10:03:41Z

            assert shared_experts_input is not None
            self._shared_experts.apply(shared_experts_input, order)

+    def _inject_fse_weights(


I don't quite prefer injection. And this fused expert is not a new feature, it was first introduced in DeepSeekV3

Can you try to implement following the approach taken by DeepSeek
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/deepseek_v2.py
?

Another thing, I would also like @robertgshaw2-redhat feedback regarding to this PR.

Thanks for the feedback. Looking into refactoring this to use the same approach taken in Deepseek

I don't quite prefer injection. And this fused expert is not a new feature, it was first introduced in DeepSeekV3

Can you try to implement following the approach taken by DeepSeek https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/deepseek_v2.py ?

Another thing, I would also like @robertgshaw2-redhat feedback regarding to this PR.

I'd vote for the DeepSeek approach as well.

nholmber · 2026-04-13T11:10:23Z

@tjtanaa the PR has been revised and description updated. Could you review it again? @ChuanLi1101 @dllehr-amd could you also take a look?

The PR now covers:

Fuse gate projection for shared and routed experts
Fuse shared expert scoring function into routed expert topk-softmax (new AITER kernel with fallback)
Fuse shared expert into routed experts for MoE

Coming back to your question about re-using the DeepSeekV3.2 shared expert fusion, the main difference is that Qwen3-Next has a learned shared_expert_gate (a per-token sigmoid gate on the shared expert output), whereas DeepSeek always includes the shared expert with weight 1.0. This gate is why we need the first two optimizations: fusing the gate projection into the router matmul and fusing the sigmoid activation into the topk kernel.

Note on code placement. The changes follow the existing runner/router separation rather than living in FusedMoE.apply():

Gate fusion → runner (moe_runner_base.py): the runner already owns the gate modules
Fused scoring → router (fused_topk_router.py): routing/expert selection is the router's responsibility
Expert computation (apply()) is untouched: it receives the same (topk_weights, topk_ids) interface regardless of whether they came from fused or separate kernels

mergify · 2026-04-23T14:42:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @nholmber.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tjtanaa · 2026-05-04T00:08:02Z

@nholmber can you help to rebase the PR. Thanks.

mergify · 2026-05-04T00:08:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @nholmber.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

nholmber · 2026-05-04T15:45:55Z

Rebased

Signed-off-by: Doug Lehr <douglehr@amd.com>

mergify · 2026-05-08T07:44:08Z

Hi @nholmber, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

…outer Move the import of aiter_topK_meta_data from module level into the _compute_routing method body. The module-level import captured the initial None value and never saw the reassignment by init_aiter_topK_meta_data, causing shared expert weights to be silently dropped and a ~33 point accuracy regression on gsm8k. Also remove unused fse_fuse_gate variable in layer.py and fix E501 line length in router_factory.py. Signed-off-by: Tres Popp <tres.popp@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

mergify · 2026-05-08T10:46:55Z

Hi @nholmber, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Tres Popp <tres.popp@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

dllehr-amd · 2026-05-08T12:27:02Z

            enable_eplb=self.enable_eplb,
            num_redundant_experts=self.n_redundant_experts,
            is_sequence_parallel=self.is_sequence_parallel,
+            n_shared_experts=1,


@tpopp We are binding the n_shared_experts and shared_expert_gate to FusedMoE now without any checks. This may be the cause of the fail?

When FSE is disabled (non-ROCm or env var off), the shared expert is handled by the model's own MLP. Passing shared_expert_gate to FusedMoE in that case caused _fse_fuse_gate to activate, fusing gate weights into [num_experts+1, hidden] and corrupting routing. Set shared_expert_gate=None and n_shared_experts=None in the non-FSE path so FusedMoE does not attempt gate fusion. Fixes test_hybrid[tiny-random/qwen3-next-moe] regression. Signed-off-by: Tres Popp <tres.popp@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

robertgshaw2-redhat · 2026-05-08T14:21:13Z

            if n_shared_experts is not None and self.aiter_fmoe_shared_expert_enabled
            else 0
        )
+        self.shared_expert_gate = shared_expert_gate


seems unnessrary to have this attribute?

I'm going to wait for CI to finish before pushing anything else. I'm happy to remove it. This is consistent with some other attributes that aren't used elsewhere and that was the reason for this. I thought there might be debugging or other reasons that most construction args are saved as attributes.

i know, i hate all those old attrs since it makes it hard to tell what "owns" the object

robertgshaw2-redhat · 2026-05-08T14:23:55Z

+            )
+
+            shared_weights = torch.sigmoid(shared_logits)
+            topk_weights, topk_ids = inject_shared_expert_weights(


seems ot me this inject_shared_experts_weight function should be defined in this file

robertgshaw2-redhat · 2026-05-08T14:24:50Z

        )

+    if (
+        num_fused_shared_experts > 0


what happens if num_fused_shared_experts > 0 and either scoring_func != softmax or is not aiter?

should we just reject?

currently we take FusedTopKRouter. Which is what happened prior as well. So I think we're okay on that front. It's not a change in behavior in the router unless the specific 3 conditions here are met

please open a github issue to audit and guard this for future so we have a clear view of what does and does not work

tpopp · 2026-05-08T15:46:08Z

@robertgshaw2-redhat I've created #42088. Can you or @dllehr-amd assign it to me?

robertgshaw2-redhat · 2026-05-08T19:37:55Z

test failures unreleated. passes all key moe tests

…lm-project#39280) Signed-off-by: nholmber <nholmber@users.noreply.github.com> Signed-off-by: Tres Popp <tres.popp@amd.com> Signed-off-by: Doug Lehr <douglehr@amd.com> Co-authored-by: nholmber <nholmber@users.noreply.github.com> Co-authored-by: Tres <tpopp@users.noreply.github.com> Co-authored-by: Tres Popp <tres.popp@amd.com> Co-authored-by: Doug Lehr <douglehr@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com>

…lm-project#39280) Signed-off-by: nholmber <nholmber@users.noreply.github.com> Signed-off-by: Tres Popp <tres.popp@amd.com> Signed-off-by: Doug Lehr <douglehr@amd.com> Co-authored-by: nholmber <nholmber@users.noreply.github.com> Co-authored-by: Tres <tpopp@users.noreply.github.com> Co-authored-by: Tres Popp <tres.popp@amd.com> Co-authored-by: Doug Lehr <douglehr@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

## Purpose Extend the AITER Fused Shared Expert (FSE) path - originally added for DeepSeek-V2/V3 (vllm-project#28540) and Qwen3-Next (vllm-project#39280) - to the GLM-4 MoE family (GLM-4.5, GLM-4.6, GLM-4.7). When `VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1` the shared expert is folded into the AITER FusedMoE kernel as `n_shared_experts` extra expert slots, eliminating the separate shared-expert MLP forward pass at low/medium concurrency. ## Changes Single-file model wiring in `vllm/model_executor/models/glm4_moe.py`, mirroring the canonical `deepseek_v2.py` FSE pattern: * `Glm4MoE.__init__` - Cache `is_rocm_aiter_moe_enabled` and `is_fusion_moe_shared_experts_enabled` from `rocm_aiter_ops`. - When FSE is enabled, skip building the separate `shared_experts` MLP and pass `n_shared_experts=config.n_shared_experts` to `FusedMoE` so the AITER kernel routes the shared expert(s) as extra slots in the routed tensor. - Switch `apply_routed_scale_to_output` to `not self.is_rocm_aiter_moe_enabled`. AITER applies `routed_scaling_factor` internally, per routed slot; applying it again post-fusion would also scale the FSE shared-expert slot (which the kernel inserts with unit weight), producing a structural magnitude error in every MoE layer. This matches `deepseek_v2.py`. (`routed_scaling_factor=2.5` for GLM-4.7, so the unfixed path showed a ~48 pp gsm8k regression.) * `Glm4MoeModel.get_expert_mapping` - Widen `num_experts` by `config.n_shared_experts` when FSE is on so the weight loader enumerates the appended slots. * `Glm4MoeModel.load_weights` - Treat `mlp.shared_experts.{gate,up,down}_proj.*` as expert-style tensors when FSE is on (skip the stacked QKV/gate_up linear path). - Split each widened shared-expert tensor into `n_shared_experts` chunks along the intermediate-size axis (dim 0 for ColumnParallel gate/up_proj, dim 1 for RowParallel down_proj) and route each chunk to `mlp.experts.{n_routed_experts + j}.*` via the FusedMoE expert-aware weight loader. No changes to FusedMoE / AITER plumbing - all of that landed earlier with vllm-project#39280 (Qwen3-Next FSE). ## Test Plan * Model: `zai-org/GLM-4.7-FP8` * Hardware: 1x MI355X node, TP=4 * Container: ROCm vLLM image (AITER >= v0.1.13.post1, PR vllm-project#44265) * Accuracy: `lm_eval --tasks gsm8k --num_fewshot 5` * Throughput: `vllm bench serve --dataset-name random` sweep over (ISL, OSL, MC) in {1000/100, 5000/500, 10000/1000} x {4, 16, 64} Server launch: ``` VLLM_ROCM_USE_AITER=1 \ VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=<0|1> \ vllm serve zai-org/GLM-4.7-FP8 \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.92 \ --max-model-len 32768 \ --max-num-seqs 256 ``` ## Test Result ### Accuracy (gsm8k, 5-shot, exact_match) | Config | flexible-extract | strict-match | |---------------------|-----------------:|-----------------:| | FSE=0 (baseline) | 0.9469 ± 0.0062 | 0.9439 ± 0.0063 | | FSE=1 | 0.9439 ± 0.0063 | 0.9416 ± 0.0065 | All deltas within standard error. No accuracy regression. ### Throughput (`vllm bench serve`, random) | ISL | OSL | MC | TPOT mean (ms) FSE=0 -> FSE=1 (Δ) | TPOT p99 (ms) FSE=0 -> FSE=1 (Δ) | Output tok/s FSE=0 -> FSE=1 (Δ) | Total tok/s FSE=0 -> FSE=1 (Δ) | |-----:|-----:|---:|----------------------------------:|---------------------------------:|--------------------------------:|-------------------------------:| | 1000| 100| 4| 17.76 -> 14.36 (**-19.2%**) | 19.43 -> 15.93 (**-18.0%**) | 199.4 -> 243.6 (**+22.1%**) | 2193.7 -> 2679.1 (**+22.1%**) | | 1000| 100| 16| 20.96 -> 18.48 (**-11.9%**) | 24.29 -> 22.77 (-6.3%) | 631.0 -> 673.4 (**+6.7%**) | 6940.6 -> 7407.9 (**+6.7%**) | | 1000| 100| 64| 30.74 -> 30.23 (-1.7%) | 42.85 -> 43.44 (+1.4%) | 1452.7 -> 1424.3 (-2.0%) | 15980.1 -> 15667.6 (-2.0%) | | 5000| 500| 4| 17.82 -> 14.50 (**-18.7%**) | 18.63 -> 15.50 (**-16.8%**) | 211.5 -> 253.5 (**+19.9%**) | 2326.1 -> 2788.7 (**+19.9%**) | | 5000| 500| 16| 22.73 -> 20.76 (**-8.7%**) | 25.38 -> 23.07 (**-9.1%**) | 619.1 -> 657.7 (**+6.2%**) | 6810.4 -> 7234.6 (**+6.2%**) | | 5000| 500| 64| 39.79 -> 40.15 (+0.9%) | 46.15 -> 46.78 (+1.4%) | 1363.8 -> 1339.1 (-1.8%) | 15001.9 -> 14730.4 (-1.8%) | | 10000| 1000| 4| 18.00 -> 14.70 (**-18.3%**) | 18.68 -> 15.50 (**-17.0%**) | 210.3 -> 251.8 (**+19.7%**) | 2313.5 -> 2769.4 (**+19.7%**) | | 10000| 1000| 16| 24.47 -> 22.87 (-6.5%) | 26.66 -> 25.56 (-4.1%) | 589.6 -> 615.1 (**+4.3%**) | 6485.6 -> 6766.2 (**+4.3%**) | | 10000| 1000| 64| 46.37 -> 46.33 (-0.1%) | 51.14 -> 51.78 (+1.3%) | 1233.6 -> 1211.9 (-1.8%) | 13570.0 -> 13330.7 (-1.8%) | Verdict: FSE delivers +20-22% output throughput and -18-19% TPOT at low concurrency (MC=4), modest gains at MC=16, and is roughly break-even (<2% regression) at MC=64. No accuracy regression. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Olga Miroshnichenko <olga.miroshnichenko@amd.com>

## Purpose Extend the AITER Fused Shared Expert (FSE) path - originally added for DeepSeek-V2/V3 (vllm-project#28540) and Qwen3-Next (vllm-project#39280) - to the GLM-4 MoE family (GLM-4.5, GLM-4.6, GLM-4.7). When `VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1` the shared expert is folded into the AITER FusedMoE kernel as `n_shared_experts` extra expert slots, eliminating the separate shared-expert MLP forward pass at low/medium concurrency. ## Changes Single-file model wiring in `vllm/model_executor/models/glm4_moe.py`, mirroring the canonical `deepseek_v2.py` FSE pattern: * `Glm4MoE.__init__` - Cache `is_rocm_aiter_moe_enabled` and `is_fusion_moe_shared_experts_enabled` from `rocm_aiter_ops`. - When FSE is enabled, skip building the separate `shared_experts` MLP and pass `n_shared_experts=config.n_shared_experts` to `FusedMoE` so the AITER kernel routes the shared expert(s) as extra slots in the routed tensor. - Switch `apply_routed_scale_to_output` to `not self.is_rocm_aiter_moe_enabled`. AITER applies `routed_scaling_factor` internally, per routed slot; applying it again post-fusion would also scale the FSE shared-expert slot (which the kernel inserts with unit weight), producing a structural magnitude error in every MoE layer. This matches `deepseek_v2.py`. (`routed_scaling_factor=2.5` for GLM-4.7, so the unfixed path showed a ~48 pp gsm8k regression.) * `Glm4MoeModel.get_expert_mapping` - Widen `num_experts` by `config.n_shared_experts` when FSE is on so the weight loader enumerates the appended slots. * `Glm4MoeModel.load_weights` - Treat `mlp.shared_experts.{gate,up,down}_proj.*` as expert-style tensors when FSE is on (skip the stacked QKV/gate_up linear path). - Split each widened shared-expert tensor into `n_shared_experts` chunks along the intermediate-size axis (dim 0 for ColumnParallel gate/up_proj, dim 1 for RowParallel down_proj) and route each chunk to `mlp.experts.{n_routed_experts + j}.*` via the FusedMoE expert-aware weight loader. No changes to FusedMoE / AITER plumbing - all of that landed earlier with vllm-project#39280 (Qwen3-Next FSE). ## Test Plan * Model: `zai-org/GLM-4.7-FP8` * Hardware: 1x MI355X node, TP=4 * Container: ROCm vLLM image (AITER >= v0.1.13.post1, PR vllm-project#44265) * Accuracy: `lm_eval --tasks gsm8k --num_fewshot 5` * Throughput: `vllm bench serve --dataset-name random` sweep over (ISL, OSL, MC) in {1000/100, 5000/500, 10000/1000} x {4, 16, 64} Server launch: ``` VLLM_ROCM_USE_AITER=1 \ VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=<0|1> \ vllm serve zai-org/GLM-4.7-FP8 \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.92 \ --max-model-len 32768 \ --max-num-seqs 256 ``` ## Test Result ### Accuracy (gsm8k, 5-shot, exact_match) | Config | flexible-extract | strict-match | |---------------------|-----------------:|-----------------:| | FSE=0 (baseline) | 0.9469 ± 0.0062 | 0.9439 ± 0.0063 | | FSE=1 | 0.9439 ± 0.0063 | 0.9416 ± 0.0065 | All deltas within standard error. No accuracy regression. ### Throughput (`vllm bench serve`, random) | ISL | OSL | MC | TPOT mean (ms) FSE=0 -> FSE=1 (Δ) | TPOT p99 (ms) FSE=0 -> FSE=1 (Δ) | Output tok/s FSE=0 -> FSE=1 (Δ) | Total tok/s FSE=0 -> FSE=1 (Δ) | |-----:|-----:|---:|----------------------------------:|---------------------------------:|--------------------------------:|-------------------------------:| | 1000| 100| 4| 17.76 -> 14.36 (**-19.2%**) | 19.43 -> 15.93 (**-18.0%**) | 199.4 -> 243.6 (**+22.1%**) | 2193.7 -> 2679.1 (**+22.1%**) | | 1000| 100| 16| 20.96 -> 18.48 (**-11.9%**) | 24.29 -> 22.77 (-6.3%) | 631.0 -> 673.4 (**+6.7%**) | 6940.6 -> 7407.9 (**+6.7%**) | | 1000| 100| 64| 30.74 -> 30.23 (-1.7%) | 42.85 -> 43.44 (+1.4%) | 1452.7 -> 1424.3 (-2.0%) | 15980.1 -> 15667.6 (-2.0%) | | 5000| 500| 4| 17.82 -> 14.50 (**-18.7%**) | 18.63 -> 15.50 (**-16.8%**) | 211.5 -> 253.5 (**+19.9%**) | 2326.1 -> 2788.7 (**+19.9%**) | | 5000| 500| 16| 22.73 -> 20.76 (**-8.7%**) | 25.38 -> 23.07 (**-9.1%**) | 619.1 -> 657.7 (**+6.2%**) | 6810.4 -> 7234.6 (**+6.2%**) | | 5000| 500| 64| 39.79 -> 40.15 (+0.9%) | 46.15 -> 46.78 (+1.4%) | 1363.8 -> 1339.1 (-1.8%) | 15001.9 -> 14730.4 (-1.8%) | | 10000| 1000| 4| 18.00 -> 14.70 (**-18.3%**) | 18.68 -> 15.50 (**-17.0%**) | 210.3 -> 251.8 (**+19.7%**) | 2313.5 -> 2769.4 (**+19.7%**) | | 10000| 1000| 16| 24.47 -> 22.87 (-6.5%) | 26.66 -> 25.56 (-4.1%) | 589.6 -> 615.1 (**+4.3%**) | 6485.6 -> 6766.2 (**+4.3%**) | | 10000| 1000| 64| 46.37 -> 46.33 (-0.1%) | 51.14 -> 51.78 (+1.3%) | 1233.6 -> 1211.9 (-1.8%) | 13570.0 -> 13330.7 (-1.8%) | Verdict: FSE delivers +20-22% output throughput and -18-19% TPOT at low concurrency (MC=4), modest gains at MC=16, and is roughly break-even (<2% regression) at MC=64. No accuracy regression. Signed-off-by: Olga Miroshnichenko <olga.miroshnichenko@amd.com>

nholmber requested review from mgoin, pavanimajety, sighingnow, tjtanaa and vadiklyutiy as code owners April 8, 2026 08:28

mergify Bot added qwen Related to Qwen models rocm Related to AMD ROCm labels Apr 8, 2026

github-project-automation Bot added this to AMD Apr 8, 2026

github-project-automation Bot moved this to Todo in AMD Apr 8, 2026

vadiklyutiy removed their request for review April 8, 2026 08:33

gemini-code-assist Bot reviewed Apr 8, 2026

View reviewed changes

tjtanaa reviewed Apr 9, 2026

View reviewed changes

nholmber force-pushed the pr/fse-qwen3next-v2 branch 2 times, most recently from 6493060 to 554600d Compare April 13, 2026 10:56

nholmber requested a review from tjtanaa April 13, 2026 11:10

nholmber force-pushed the pr/fse-qwen3next-v2 branch from 554600d to c49851e Compare April 23, 2026 14:41

mergify Bot added the needs-rebase label Apr 23, 2026

nholmber force-pushed the pr/fse-qwen3next-v2 branch from c49851e to 9110fd9 Compare April 23, 2026 21:50

mergify Bot removed the needs-rebase label Apr 23, 2026

nholmber force-pushed the pr/fse-qwen3next-v2 branch from 9110fd9 to 3344962 Compare April 23, 2026 21:52

tjtanaa requested a review from robertgshaw2-redhat May 4, 2026 00:07

mergify Bot added the needs-rebase label May 4, 2026

nholmber force-pushed the pr/fse-qwen3next-v2 branch from 3344962 to 5e26cf4 Compare May 4, 2026 14:18

mergify Bot removed the needs-rebase label May 4, 2026

nholmber force-pushed the pr/fse-qwen3next-v2 branch from 2293eae to 66fe572 Compare May 4, 2026 22:01

Refactor attempt to introduce AiterSharedRoutedFusedMoERouter

95f11c1

Signed-off-by: Doug Lehr <douglehr@amd.com>

Format _fse_fuse_gate as single-line expression

6349f10

Signed-off-by: Tres Popp <tres.popp@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

dllehr-amd reviewed May 8, 2026

View reviewed changes

robertgshaw2-redhat reviewed May 8, 2026

View reviewed changes

tpopp mentioned this pull request May 8, 2026

[aiter] Qwen3Next shared expert fusion improvements #42088

Open

5 tasks

Merge branch 'main' into pr/fse-qwen3next-v2

096c8fe

robertgshaw2-redhat approved these changes May 8, 2026

View reviewed changes

robertgshaw2-redhat merged commit 2c6b59b into vllm-project:main May 8, 2026
28 of 80 checks passed

github-project-automation Bot moved this from Todo to Done in AMD May 8, 2026

This was referenced May 20, 2026

[Performance]: Triton fusion for Qwen2/3-MoE shared-expert gate (Qwen2MoeMLP/Qwen3MoeMLP) #43187

Open

[Kernel] Fuse Qwen2/3-MoE shared-expert sigmoid gate into a Triton kernel #43190

Open

peymanr mentioned this pull request May 20, 2026

[RFC] [Kernel] Fuse Qwen2/3-MoE shared-expert sigmoid gate into a Triton kernel peymanr/vllm#25

Open

This was referenced May 28, 2026

[Bug]: [ROCm] MoE inference crashes with older aiter: topk_softmax() expected at most 5 argument(s) but received 7 #43873

Closed

[ROCm][Bugfix] Fix _rocm_aiter_topk_softmax_impl crash on older aiter (5-arg vs 7-arg) #43875

Closed

omirosh mentioned this pull request Jun 2, 2026

[ROCm][Perf] Add Fused Shared Expert (FSE) support for GLM-4.5/6/7 #44313

Open

5 tasks

nholmber mentioned this pull request Jun 3, 2026

[ROCm][Bugfix][Perf] enable shared expert fusion for Qwen3.5 #44434

Open

	n_shared_experts=1 if self.is_fse_enabled else None,
	n_shared_experts=1 if self.is_fse_enabled else 0,

Uh oh!

Conversation

nholmber commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Accuracy (lm_eval GSM8K 8-shot, flexible-extract)

Throughput (output tok/s, 1k input / 1k output)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nholmber commented Apr 13, 2026

Uh oh!

mergify Bot commented Apr 23, 2026

Uh oh!

tjtanaa commented May 4, 2026

Uh oh!

mergify Bot commented May 4, 2026

Uh oh!

nholmber commented May 4, 2026

Uh oh!

mergify Bot commented May 8, 2026

Uh oh!

mergify Bot commented May 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tpopp commented May 8, 2026

Uh oh!

robertgshaw2-redhat commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

nholmber commented Apr 8, 2026 •

edited

Loading