[MoE Refactor][16/N] Apply Refactor to NVFP4 by robertgshaw2-redhat · Pull Request #31692 · vllm-project/vllm

robertgshaw2-redhat · 2026-01-05T00:23:47Z

Purpose

Apply refactor to nvfp4 integrations, key steps:

support nvfp4 in MarlinExperts for NVFP4
use mks for all kernels except for trtllm kernels
create oracle for centralized kernel selection
factor out process_weights_after_loading for sharing between ct and modelopt
create kernel in process_weights_after_loading + call modular kernel in apply

Test Plan

ci (see MoE refactor jobs which run through all permutations)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Robert Shaw <robshaw@redhat.com>

gemini-code-assist

Code Review

This pull request refactors the NVFP4 MoE implementation by introducing an "oracle" to centralize backend selection and preparation logic. This significantly cleans up ModelOptNvFp4FusedMoE. However, I've found two critical issues in the refactored code that will cause runtime errors. One is an UnboundLocalError in process_weights_after_loading for non-FlashInfer backends, and the other is an incorrect enum comparison in the apply method which will lead to incorrect backend dispatching. I've provided suggestions to fix both issues.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm/model_executor/layers/quantization/modelopt.py (1582-1617)

The variables w13, w13_scale, etc. are only assigned within the if self.nvfp4_backend in FLASHINFER_NVFP4_BACKENDS: block. For other backends like MARLIN or VLLM_CUTLASS, these variables will be undefined, leading to an UnboundLocalError when replace_parameter is called. This will cause a crash when using those backends.

To fix this, the replace_parameter calls should be moved inside the if block where the variables are defined. The other backend paths seem to modify the layer in-place or are yet to be implemented (as per the TODOs), so they shouldn't be calling replace_parameter with undefined variables.

        if self.nvfp4_backend in FLASHINFER_NVFP4_BACKENDS:
            (
                w13,
                w13_scale,
                w13_scale_2,
                a13_scale,
                w2,
                w2_scale,
                a2_scale,
            ) = prepare_nvfp4_moe_layer_for_fi(
                backend=self.nvfp4_backend,
                layer=layer,
                w13=layer.w13_weight,
                w13_scale=layer.w13_weight_scale,
                w13_scale_2=layer.w13_weight_scale_2,
                a13_scale=layer.w13_input_scale,
                w2=layer.w2_weight,
                w2_scale=layer.w2_weight_scale,
                a2_scale=layer.w2_input_scale,
                is_act_and_mul=self.moe.is_act_and_mul,
                is_global_sf=self.use_global_sf,
            )
            replace_parameter(layer, "w13_weight", w13)
            replace_parameter(layer, "w13_weight_scale", w13_scale)
            replace_parameter(layer, "w13_weight_scale_2", w13_scale_2)
            replace_parameter(layer, "w2_weight", w2)
            replace_parameter(layer, "w2_weight_scale", w2_scale)
            replace_parameter(layer, "w13_input_scale", a13_scale)
            replace_parameter(layer, "w2_input_scale", a2_scale)
        elif self.nvfp4_backend == NvFp4MoeBackend.MARLIN:
            # TODO(rob): update marlin prepare to match fp8 moe.
            prepare_moe_fp4_layer_for_marlin(layer)
        else:
            # TODO(rob): need to do the swizzling here.
            pass

vllm/model_executor/layers/quantization/modelopt.py (1726-1755)

There's an incorrect enum comparison here. self.nvfp4_backend is of type NvFp4MoeBackend, but it's being compared with members of FlashinferMoeBackend. This will cause the conditions for CUTLASS and CUTEDSL backends to always evaluate to false, leading to incorrect kernel dispatch.

You should compare against the correct enum members from NvFp4MoeBackend.

        elif self.nvfp4_backend == NvFp4MoeBackend.FLASHINFER_CUTLASS:
            from vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe import (  # noqa: E501
                flashinfer_cutlass_moe_fp4,
            )

            assert self.moe_quant_config is not None
            return flashinfer_cutlass_moe_fp4(
                hidden_states=x,
                w1=layer.w13_weight,
                w2=layer.w2_weight,
                topk_weights=topk_weights,
                topk_ids=topk_ids,
                quant_config=self.moe_quant_config,
                inplace=False,
                activation=layer.activation,
                global_num_experts=layer.global_num_experts,
                expert_map=layer.expert_map,
                apply_router_weight_on_input=layer.apply_router_weight_on_input,
            )

        elif self.nvfp4_backend == NvFp4MoeBackend.FLASHINFER_CUTEDSL:
            from vllm.model_executor.layers.fused_moe.flashinfer_cutedsl_moe import (  # noqa: E501
                flashinfer_cutedsl_moe_fp4,
            )

            assert self.moe_quant_config is not None
            return flashinfer_cutedsl_moe_fp4(
                hidden_states=x,
                w1=layer.w13_weight,
                w2=layer.w2_weight,

Signed-off-by: Robert Shaw <robshaw@redhat.com>

…(it only does batched) Signed-off-by: Robert Shaw <robshaw@redhat.com>

Signed-off-by: Robert Shaw <robshaw@redhat.com>

zyongye

Nice and clean!

vllm/model_executor/layers/fused_moe/oracle/nvfp4.py

mgoin · 2026-01-08T01:06:55Z

vllm/model_executor/layers/quantization/modelopt.py

+        if (
+            not self.moe.is_act_and_mul
+            and not self.nvfp4_backend == NvFp4MoeBackend.FLASHINFER_CUTLASS
+        ):
+            raise NotImplementedError(
+                "Non-gated activations are only supported by FlashInfer "
+                "CUTLASS NvFP4 MoE backend."
            )


Shouldn't we put this check into select_nvfp4_moe_backend itself?

yes, In the end state, the oracle will hold all this logic.

I added a TODO. This is the next phase of work

mergify · 2026-01-08T01:11:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mgoin

LGTM! Appreciate the comments left in places for future work, I think this is a clear improvement now

robertgshaw2-redhat · 2026-01-08T01:24:56Z

thanks, will rebase and merge

Signed-off-by: Robert Shaw <robshaw@redhat.com>

Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

Signed-off-by: Robert Shaw <robshaw@redhat.com>

Co-authored-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Pavani Majety <pmajety@nvidia.com>

Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Pavani Majety <pmajety@nvidia.com>

The MoE refactor (vllm-project#31692) changed expanded input scale tensors from intermediates to stored parameters. torch.expand() creates non-contiguous stride-0 views, which causes EPLB's get_expert_weights() contiguity assertion to fail. Add .contiguous() to the expand() calls since this only copies ~144 bytes per layer at model load time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jason Li <jasonlizhengjian@gmail.com>

Robert Shaw added 2 commits January 4, 2026 16:48

initial commit

60a830a

Signed-off-by: Robert Shaw <robshaw@redhat.com>

refactoring to use the oracle

d77fc66

Signed-off-by: Robert Shaw <robshaw@redhat.com>

mergify bot added the nvidia label Jan 5, 2026

github-project-automation bot added this to NVIDIA Jan 5, 2026

nit

6ec8d1b

Signed-off-by: Robert Shaw <robshaw@redhat.com>

gemini-code-assist bot reviewed Jan 5, 2026

View reviewed changes

Robert Shaw added 24 commits January 4, 2026 19:27

raise error

c0bf3ce

Signed-off-by: Robert Shaw <robshaw@redhat.com>

move not implemented error to __init__ from apply

c0414f8

Signed-off-by: Robert Shaw <robshaw@redhat.com>

minor rename

3f85920

Signed-off-by: Robert Shaw <robshaw@redhat.com>

first working via flashinfer cutlass

e0ade2a

Signed-off-by: Robert Shaw <robshaw@redhat.com>

update

3b59a99

Signed-off-by: Robert Shaw <robshaw@redhat.com>

update

ed04dc1

Signed-off-by: Robert Shaw <robshaw@redhat.com>

actually use apply

4fdca7e

Signed-off-by: Robert Shaw <robshaw@redhat.com>

we have correctness for flashinfer cutlass via the .apply() pathway

b16c0cd

Signed-off-by: Robert Shaw <robshaw@redhat.com>

convert to using the PFNoEP

fd54809

Signed-off-by: Robert Shaw <robshaw@redhat.com>

create the kernel in the oracle

8aad006

Signed-off-by: Robert Shaw <robshaw@redhat.com>

forgot to make the kernel :)

751286a

Signed-off-by: Robert Shaw <robshaw@redhat.com>

last nit?

966b123

Signed-off-by: Robert Shaw <robshaw@redhat.com>

minor comment change

d98adfd

Signed-off-by: Robert Shaw <robshaw@redhat.com>

minor tweak to oracle

e0ba913

Signed-off-by: Robert Shaw <robshaw@redhat.com>

remove flashinfer cutedsl from aply since it does not work via apply …

15876cc

…(it only does batched) Signed-off-by: Robert Shaw <robshaw@redhat.com>

flashinfer trtllm working properly again

7a191ad

Signed-off-by: Robert Shaw <robshaw@redhat.com>

we are able to run with vllm cutlass, but have 0% accuracy score

04e8a13

Signed-off-by: Robert Shaw <robshaw@redhat.com>

got accuracy with vllm cutlass

f943472

Signed-off-by: Robert Shaw <robshaw@redhat.com>

convert to int comparison

715e428

Signed-off-by: Robert Shaw <robshaw@redhat.com>

update comment

ae14963

Signed-off-by: Robert Shaw <robshaw@redhat.com>

update to decouple use_global_sf from ModelOpt

c222df0

Signed-off-by: Robert Shaw <robshaw@redhat.com>

initial attempt to apply this to compressed-tensors

3fb8d01

Signed-off-by: Robert Shaw <robshaw@redhat.com>

flashinfer cutlass working end-to-end with compressed-tensors

c13549c

Signed-off-by: Robert Shaw <robshaw@redhat.com>

updated

5d7422b

Signed-off-by: Robert Shaw <robshaw@redhat.com>

github-project-automation bot moved this to Ready in NVIDIA Jan 7, 2026

robertgshaw2-redhat moved this from In progress to In review in MoE Refactor Jan 7, 2026

zyongye approved these changes Jan 7, 2026

View reviewed changes

mgoin reviewed Jan 8, 2026

View reviewed changes

mergify bot added the needs-rebase label Jan 8, 2026

mgoin approved these changes Jan 8, 2026

View reviewed changes

Robert Shaw and others added 2 commits January 7, 2026 20:48

do the merge

3e4ad7a

Signed-off-by: Robert Shaw <robshaw@redhat.com>

Update vllm/model_executor/layers/fused_moe/oracle/nvfp4.py

5977f00

Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

mergify bot removed the needs-rebase label Jan 8, 2026

Robert Shaw and others added 4 commits January 7, 2026 20:50

updated

c6d7cc8

Signed-off-by: Robert Shaw <robshaw@redhat.com>

updated

a74743a

Signed-off-by: Robert Shaw <robshaw@redhat.com>

clean up --enforce-eager

071f0fb

Signed-off-by: Robert Shaw <robshaw@redhat.com>

Update vllm/model_executor/layers/fused_moe/oracle/nvfp4.py

6e769c4

Co-authored-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

robertgshaw2-redhat enabled auto-merge (squash) January 8, 2026 02:03

robertgshaw2-redhat merged commit 9f6dcb7 into main Jan 8, 2026
67 checks passed

robertgshaw2-redhat deleted the nvfp4-refactor branch January 8, 2026 03:46

github-project-automation bot moved this from In review to Done in MoE Refactor Jan 8, 2026

github-project-automation bot moved this from Ready to Done in NVIDIA Jan 8, 2026

danisereb mentioned this pull request Jan 8, 2026

[Bugfix] Fix vllm serve failure with Nemotron Nano V3 FP8 #31960

Merged

5 tasks

jiahanc mentioned this pull request Jan 14, 2026

[BugFix] Fix TRT-LLM NVFP4 DP/EP #32349

Merged

5 tasks

fxmarty-amd mentioned this pull request Mar 6, 2026

[NVFP4] Support NVFP4 MOE models on AMD Instinct, Nvidia Ampere, Hopper through NVFP4 MOE emulation #35737

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MoE Refactor][16/N] Apply Refactor to NVFP4#31692

[MoE Refactor][16/N] Apply Refactor to NVFP4#31692
robertgshaw2-redhat merged 76 commits intomainfrom
nvfp4-refactor

robertgshaw2-redhat commented Jan 5, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

zyongye left a comment

Uh oh!

Uh oh!

mgoin Jan 8, 2026

Uh oh!

robertgshaw2-redhat Jan 8, 2026

Uh oh!

mergify bot commented Jan 8, 2026

Uh oh!

mgoin left a comment

Uh oh!

robertgshaw2-redhat commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

robertgshaw2-redhat commented Jan 5, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

vllm/model_executor/layers/quantization/modelopt.py (1582-1617)

vllm/model_executor/layers/quantization/modelopt.py (1726-1755)

Uh oh!

zyongye left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgoin Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jan 8, 2026

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

robertgshaw2-redhat commented Jan 5, 2026 •

edited by github-actions bot

Loading