Skip to content

[MoE Refactor][16/N] Apply Refactor to NVFP4#31692

Merged
robertgshaw2-redhat merged 76 commits intomainfrom
nvfp4-refactor
Jan 8, 2026
Merged

[MoE Refactor][16/N] Apply Refactor to NVFP4#31692
robertgshaw2-redhat merged 76 commits intomainfrom
nvfp4-refactor

Conversation

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented Jan 5, 2026

Purpose

Apply refactor to nvfp4 integrations, key steps:

  • support nvfp4 in MarlinExperts for NVFP4
  • use mks for all kernels except for trtllm kernels
  • create oracle for centralized kernel selection
  • factor out process_weights_after_loading for sharing between ct and modelopt
  • create kernel in process_weights_after_loading + call modular kernel in apply

Test Plan

  • ci (see MoE refactor jobs which run through all permutations)

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Robert Shaw added 2 commits January 4, 2026 16:48
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
nit
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the NVFP4 MoE implementation by introducing an "oracle" to centralize backend selection and preparation logic. This significantly cleans up ModelOptNvFp4FusedMoE. However, I've found two critical issues in the refactored code that will cause runtime errors. One is an UnboundLocalError in process_weights_after_loading for non-FlashInfer backends, and the other is an incorrect enum comparison in the apply method which will lead to incorrect backend dispatching. I've provided suggestions to fix both issues.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm/model_executor/layers/quantization/modelopt.py (1582-1617)

critical

The variables w13, w13_scale, etc. are only assigned within the if self.nvfp4_backend in FLASHINFER_NVFP4_BACKENDS: block. For other backends like MARLIN or VLLM_CUTLASS, these variables will be undefined, leading to an UnboundLocalError when replace_parameter is called. This will cause a crash when using those backends.

To fix this, the replace_parameter calls should be moved inside the if block where the variables are defined. The other backend paths seem to modify the layer in-place or are yet to be implemented (as per the TODOs), so they shouldn't be calling replace_parameter with undefined variables.

        if self.nvfp4_backend in FLASHINFER_NVFP4_BACKENDS:
            (
                w13,
                w13_scale,
                w13_scale_2,
                a13_scale,
                w2,
                w2_scale,
                a2_scale,
            ) = prepare_nvfp4_moe_layer_for_fi(
                backend=self.nvfp4_backend,
                layer=layer,
                w13=layer.w13_weight,
                w13_scale=layer.w13_weight_scale,
                w13_scale_2=layer.w13_weight_scale_2,
                a13_scale=layer.w13_input_scale,
                w2=layer.w2_weight,
                w2_scale=layer.w2_weight_scale,
                a2_scale=layer.w2_input_scale,
                is_act_and_mul=self.moe.is_act_and_mul,
                is_global_sf=self.use_global_sf,
            )
            replace_parameter(layer, "w13_weight", w13)
            replace_parameter(layer, "w13_weight_scale", w13_scale)
            replace_parameter(layer, "w13_weight_scale_2", w13_scale_2)
            replace_parameter(layer, "w2_weight", w2)
            replace_parameter(layer, "w2_weight_scale", w2_scale)
            replace_parameter(layer, "w13_input_scale", a13_scale)
            replace_parameter(layer, "w2_input_scale", a2_scale)
        elif self.nvfp4_backend == NvFp4MoeBackend.MARLIN:
            # TODO(rob): update marlin prepare to match fp8 moe.
            prepare_moe_fp4_layer_for_marlin(layer)
        else:
            # TODO(rob): need to do the swizzling here.
            pass

vllm/model_executor/layers/quantization/modelopt.py (1726-1755)

critical

There's an incorrect enum comparison here. self.nvfp4_backend is of type NvFp4MoeBackend, but it's being compared with members of FlashinferMoeBackend. This will cause the conditions for CUTLASS and CUTEDSL backends to always evaluate to false, leading to incorrect kernel dispatch.

You should compare against the correct enum members from NvFp4MoeBackend.

        elif self.nvfp4_backend == NvFp4MoeBackend.FLASHINFER_CUTLASS:
            from vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe import (  # noqa: E501
                flashinfer_cutlass_moe_fp4,
            )

            assert self.moe_quant_config is not None
            return flashinfer_cutlass_moe_fp4(
                hidden_states=x,
                w1=layer.w13_weight,
                w2=layer.w2_weight,
                topk_weights=topk_weights,
                topk_ids=topk_ids,
                quant_config=self.moe_quant_config,
                inplace=False,
                activation=layer.activation,
                global_num_experts=layer.global_num_experts,
                expert_map=layer.expert_map,
                apply_router_weight_on_input=layer.apply_router_weight_on_input,
            )

        elif self.nvfp4_backend == NvFp4MoeBackend.FLASHINFER_CUTEDSL:
            from vllm.model_executor.layers.fused_moe.flashinfer_cutedsl_moe import (  # noqa: E501
                flashinfer_cutedsl_moe_fp4,
            )

            assert self.moe_quant_config is not None
            return flashinfer_cutedsl_moe_fp4(
                hidden_states=x,
                w1=layer.w13_weight,
                w2=layer.w2_weight,

Robert Shaw added 24 commits January 4, 2026 19:27
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
…(it only does batched)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Jan 7, 2026
@robertgshaw2-redhat robertgshaw2-redhat moved this from In progress to In review in MoE Refactor Jan 7, 2026
Copy link
Copy Markdown
Member

@zyongye zyongye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean!

Comment on lines +1453 to 1460
if (
not self.moe.is_act_and_mul
and not self.nvfp4_backend == NvFp4MoeBackend.FLASHINFER_CUTLASS
):
raise NotImplementedError(
"Non-gated activations are only supported by FlashInfer "
"CUTLASS NvFP4 MoE backend."
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we put this check into select_nvfp4_moe_backend itself?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, In the end state, the oracle will hold all this logic.

I added a TODO. This is the next phase of work

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 8, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 8, 2026
Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Appreciate the comments left in places for future work, I think this is a clear improvement now

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator Author

thanks, will rebase and merge

Robert Shaw and others added 2 commits January 7, 2026 20:48
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
@mergify mergify bot removed the needs-rebase label Jan 8, 2026
Robert Shaw and others added 4 commits January 7, 2026 20:50
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) January 8, 2026 02:03
@robertgshaw2-redhat robertgshaw2-redhat merged commit 9f6dcb7 into main Jan 8, 2026
67 checks passed
@robertgshaw2-redhat robertgshaw2-redhat deleted the nvfp4-refactor branch January 8, 2026 03:46
@github-project-automation github-project-automation bot moved this from In review to Done in MoE Refactor Jan 8, 2026
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Jan 8, 2026
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
@jiahanc jiahanc mentioned this pull request Jan 14, 2026
5 tasks
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
jasonlizhengjian added a commit to jasonlizhengjian/vllm that referenced this pull request Feb 25, 2026
The MoE refactor (vllm-project#31692) changed expanded input scale tensors from
intermediates to stored parameters. torch.expand() creates non-contiguous
stride-0 views, which causes EPLB's get_expert_weights() contiguity
assertion to fail. Add .contiguous() to the expand() calls since this
only copies ~144 bytes per layer at model load time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Jason Li <jasonlizhengjian@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation llama Related to Llama models nvidia performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants