[MoE Refactor][15/N] Apply Refactor to Fp8 by robertgshaw2-redhat · Pull Request #31415 · vllm-project/vllm

robertgshaw2-redhat · 2025-12-27T18:45:58Z

Purpose

SUMMARY:

continue refactoring fp8 moe
CUTLASS MoE --> make CutlassExpertsFp8 manage the strides data
CUTLASS MoE --> remove cutlass_moe_fp8 entrypoint
factor out MoE kernel selection into a new module
factor out process_weights_after_loading utilities to be shared by fp8 and ct
apply mk refactor to ct

Test Plan

unit tests

pytest -v -x tests/kernels/moe/test_cutlass_moe.py

benchmark scripts (note: these don't work on main)

python3 ./benchmark_grouped_gemm_cutlass.py --model "deepseek-ai/DeepSeek-V2-Lite" --tp-sizes 4 --batch-sizes 2 4

python benchmark_cutlass_moe_fp8.py  \
            --model "Llama-4-Maverick-17B-128E-Instruct-FP8"  \
            --tp-sizes 8 \
            --batch-size 2 4 8 \
            --per-act-token-opts false \
            --per-out-ch-opts false

moe refactor integration tests

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Robert Shaw <robshaw@redhat.com>

…after-loading

vllm/model_executor/layers/fused_moe/oracle/fp8.py

vllm/model_executor/layers/fused_moe/triton_cutlass_moe.py

vllm/model_executor/models/llama4.py

Signed-off-by: Robert Shaw <robshaw@redhat.com>

vllm/model_executor/layers/fused_moe/oracle/fp8.py

vllm/model_executor/layers/fused_moe/triton_cutlass_moe.py

Signed-off-by: Robert Shaw <robshaw@redhat.com>

mgoin · 2026-01-07T19:14:26Z

tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-Fp8-ModelOpt-fi-cutlass.yaml

 num_questions: 1319
 num_fewshot: 5
-server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
+server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2 --disable-uvicorn-access-log"


What is this? Do you just want to add it to the eval script here

vllm/tests/evals/gsm8k/test_gsm8k_correctness.py

Lines 60 to 65 in c907d22

# Add standard server arguments

server_args.extend(

[

"--trust-remote-code",

]

)

yeah, thats a good idea. It makes the logs much easier to read by not logging out /completions every request

mgoin · 2026-01-07T19:15:24Z

vllm/model_executor/layers/fused_moe/oracle/fp8.py

+    # Delayed import is required since the oracle is imported
+    # by CPU backends which cannot import all of these experts.
+    # TODO: update the experts to make this not happen.
+    from vllm.model_executor.layers.fused_moe import (
+        TritonExperts,
+        TritonOrDeepGemmExperts,
+    )
+    from vllm.model_executor.layers.fused_moe.cutlass_moe import (
+        CutlassExpertsFp8,
+    )
+    from vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe import (
+        FlashInferExperts,
+    )
+    from vllm.model_executor.layers.fused_moe.fused_marlin_moe import (
+        MarlinExperts,
+    )
+    from vllm.model_executor.layers.fused_moe.prepare_finalize import (
+        MoEPrepareAndFinalizeNoEP,
+    )
+    from vllm.model_executor.layers.fused_moe.rocm_aiter_fused_moe import (
+        AiterExperts,
+    )


I think we should just put each import within each conditional, rather than importing all

bnellnm

Nice refactor!

Signed-off-by: Robert Shaw <robshaw@redhat.com>

zyongye

The overall structure LGTM. But we will need to revisit the fallback experts abstraction in the future as discussed.

zyongye · 2026-01-07T21:20:15Z

vllm/model_executor/layers/quantization/utils/flashinfer_utils.py

+        assert w2_input_scale is not None
+
+        rotate_weights_for_fi_trtllm_fp8_per_tensor_moe(w13, w2)
+        register_scales_for_trtllm_fp8_per_tensor_moe(


Should we register this inside the FusedMoEMethod? instead of in the utils?

there is some work to make TRTLLM a modular kernel. once that is done, we can revisit it

Signed-off-by: Robert Shaw <robshaw@redhat.com>

mgoin · 2026-01-08T00:42:28Z

Great work!

Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Robert Shaw and others added 30 commits December 22, 2025 18:03

cleanup process weights after loading

e5c50db

Signed-off-by: Robert Shaw <robshaw@redhat.com>

removing spurious aiter stuff

b1dddfd

Signed-off-by: Robert Shaw <robshaw@redhat.com>

removing spurious aiter stuff

78e9289

Signed-off-by: Robert Shaw <robshaw@redhat.com>

good codex bot

f1ae727

Signed-off-by: Robert Shaw <robshaw@redhat.com>

revert spurious aiter stuff

1a576b8

Signed-off-by: Robert Shaw <robshaw@redhat.com>

reduce LOC changes

70367ac

Signed-off-by: Robert Shaw <robshaw@redhat.com>

reduce LOC changes

8425200

Signed-off-by: Robert Shaw <robshaw@redhat.com>

further simplification

f414f6c

Signed-off-by: Robert Shaw <robshaw@redhat.com>

updated

b31a8cb

Signed-off-by: Robert Shaw <robshaw@redhat.com>

cleanup comment

f2c70e1

Signed-off-by: Robert Shaw <robshaw@redhat.com>

fix custom routing function for flashinfer

bd2046b

Signed-off-by: Robert Shaw <robshaw@redhat.com>

invalid checks in FP8MoE

6655381

Signed-off-by: Robert Shaw <robshaw@redhat.com>

stashing ... mixtral via flashinfer is not working properly'

a8820d8

Signed-off-by: Robert Shaw <robshaw@redhat.com>

merged aiter

844093c

Signed-off-by: Robert Shaw <robshaw@redhat.com>

updated

cc2df79

Signed-off-by: Robert Shaw <robshaw@redhat.com>

updated

6db374b

Signed-off-by: Robert Shaw <robshaw@redhat.com>

stash

c2cc8e1

Signed-off-by: Robert Shaw <robshaw@redhat.com>

fix bad merge and bad qdq for per-tensor

1cf3b88

Signed-off-by: Robert Shaw <robshaw@redhat.com>

weight rotation should only happen for per-tensor

985a0ab

Signed-off-by: Robert Shaw <robshaw@redhat.com>

improve error message

7defa14

Signed-off-by: Robert Shaw <robshaw@redhat.com>

updated

078fdc4

Signed-off-by: Robert Shaw <robshaw@redhat.com>

update marlin ordering

b55d8e3

Signed-off-by: Robert Shaw <robshaw@redhat.com>

improve comments

8fbae90

Signed-off-by: Robert Shaw <robshaw@redhat.com>

add helper functions to share between online and offline quantization

4bbb70f

Signed-off-by: Robert Shaw <robshaw@redhat.com>

fix up condition

bd72e61

Signed-off-by: Robert Shaw <robshaw@redhat.com>

fix up condition

24d219b

Signed-off-by: Robert Shaw <robshaw@redhat.com>

fix up condition

c290ebb

Signed-off-by: Robert Shaw <robshaw@redhat.com>

update to revert cleanup

550e763

Signed-off-by: Robert Shaw <robshaw@redhat.com>

Merge branch 'main' into clean-up-fp8-process-after-loading

d1fba0e

Merge remote-tracking branch 'origin/main' into clean-up-fp8-process-…

7822c4d

…after-loading

bnellnm reviewed Jan 7, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/oracle/fp8.py Show resolved Hide resolved

bnellnm reviewed Jan 7, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/triton_cutlass_moe.py Outdated Show resolved Hide resolved

bnellnm reviewed Jan 7, 2026

View reviewed changes

vllm/model_executor/models/llama4.py Outdated Show resolved Hide resolved

Robert Shaw added 3 commits January 7, 2026 18:02

revert .to(cuda)

9810137

Signed-off-by: Robert Shaw <robshaw@redhat.com>

updated with bills comments

35c3bc3

Signed-off-by: Robert Shaw <robshaw@redhat.com>

revert change for llama experts not being loaded properly

ce8deb7

Signed-off-by: Robert Shaw <robshaw@redhat.com>

mgoin reviewed Jan 7, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/oracle/fp8.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/fused_moe/oracle/fp8.py Show resolved Hide resolved

vllm/model_executor/layers/fused_moe/triton_cutlass_moe.py Outdated Show resolved Hide resolved

Robert Shaw and others added 2 commits January 7, 2026 19:35

fix marlin comment

08742c5

Signed-off-by: Robert Shaw <robshaw@redhat.com>

Merge branch 'main' into apply-refactor-to-ct

a369a51

mgoin reviewed Jan 7, 2026

View reviewed changes

bnellnm approved these changes Jan 7, 2026

View reviewed changes

Robert Shaw added 4 commits January 7, 2026 20:49

updated the access logs

d4486f8

Signed-off-by: Robert Shaw <robshaw@redhat.com>

updated

be3dc9a

Signed-off-by: Robert Shaw <robshaw@redhat.com>

updated

a927bec

Signed-off-by: Robert Shaw <robshaw@redhat.com>

delay imports

192339d

Signed-off-by: Robert Shaw <robshaw@redhat.com>

zyongye reviewed Jan 7, 2026

View reviewed changes

fix missing

5dccfc6

Signed-off-by: Robert Shaw <robshaw@redhat.com>

mgoin approved these changes Jan 8, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Jan 8, 2026

mgoin merged commit 5dcd7ef into main Jan 8, 2026
65 checks passed

mgoin deleted the apply-refactor-to-ct branch January 8, 2026 00:42

github-project-automation bot moved this from Ready to Done in NVIDIA Jan 8, 2026

github-project-automation bot moved this from In review to Done in MoE Refactor Jan 8, 2026

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026

[MoE Refactor][15/N] Apply Refactor to Fp8 (vllm-project#31415)

cd4326f

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[MoE Refactor][15/N] Apply Refactor to Fp8 (vllm-project#31415)

fa36c11

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[MoE Refactor][15/N] Apply Refactor to Fp8 (vllm-project#31415)

a75ca31

Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[MoE Refactor][15/N] Apply Refactor to Fp8 (vllm-project#31415)

36b4dfd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MoE Refactor][15/N] Apply Refactor to Fp8#31415

[MoE Refactor][15/N] Apply Refactor to Fp8#31415
mgoin merged 213 commits intomainfrom
apply-refactor-to-ct

robertgshaw2-redhat commented Dec 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin Jan 7, 2026

Uh oh!

robertgshaw2-redhat Jan 7, 2026

Uh oh!

mgoin Jan 7, 2026

Uh oh!

bnellnm left a comment

Uh oh!

zyongye left a comment

Uh oh!

zyongye Jan 7, 2026

Uh oh!

robertgshaw2-redhat Jan 7, 2026

Uh oh!

mgoin commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	# Add standard server arguments
	server_args.extend(
	[
	"--trust-remote-code",
	]
	)

Uh oh!

Conversation

robertgshaw2-redhat commented Dec 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

mgoin Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

bnellnm left a comment

Choose a reason for hiding this comment

Uh oh!

zyongye left a comment

Choose a reason for hiding this comment

Uh oh!

zyongye Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

mgoin commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

robertgshaw2-redhat commented Dec 27, 2025 •

edited by github-actions bot

Loading