[MoE Refactor][16/N] Apply Refactor to NVFP4#31692
[MoE Refactor][16/N] Apply Refactor to NVFP4#31692robertgshaw2-redhat merged 76 commits intomainfrom
Conversation
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the NVFP4 MoE implementation by introducing an "oracle" to centralize backend selection and preparation logic. This significantly cleans up ModelOptNvFp4FusedMoE. However, I've found two critical issues in the refactored code that will cause runtime errors. One is an UnboundLocalError in process_weights_after_loading for non-FlashInfer backends, and the other is an incorrect enum comparison in the apply method which will lead to incorrect backend dispatching. I've provided suggestions to fix both issues.
I am having trouble creating individual review comments. Click here to see my feedback.
vllm/model_executor/layers/quantization/modelopt.py (1582-1617)
The variables w13, w13_scale, etc. are only assigned within the if self.nvfp4_backend in FLASHINFER_NVFP4_BACKENDS: block. For other backends like MARLIN or VLLM_CUTLASS, these variables will be undefined, leading to an UnboundLocalError when replace_parameter is called. This will cause a crash when using those backends.
To fix this, the replace_parameter calls should be moved inside the if block where the variables are defined. The other backend paths seem to modify the layer in-place or are yet to be implemented (as per the TODOs), so they shouldn't be calling replace_parameter with undefined variables.
if self.nvfp4_backend in FLASHINFER_NVFP4_BACKENDS:
(
w13,
w13_scale,
w13_scale_2,
a13_scale,
w2,
w2_scale,
a2_scale,
) = prepare_nvfp4_moe_layer_for_fi(
backend=self.nvfp4_backend,
layer=layer,
w13=layer.w13_weight,
w13_scale=layer.w13_weight_scale,
w13_scale_2=layer.w13_weight_scale_2,
a13_scale=layer.w13_input_scale,
w2=layer.w2_weight,
w2_scale=layer.w2_weight_scale,
a2_scale=layer.w2_input_scale,
is_act_and_mul=self.moe.is_act_and_mul,
is_global_sf=self.use_global_sf,
)
replace_parameter(layer, "w13_weight", w13)
replace_parameter(layer, "w13_weight_scale", w13_scale)
replace_parameter(layer, "w13_weight_scale_2", w13_scale_2)
replace_parameter(layer, "w2_weight", w2)
replace_parameter(layer, "w2_weight_scale", w2_scale)
replace_parameter(layer, "w13_input_scale", a13_scale)
replace_parameter(layer, "w2_input_scale", a2_scale)
elif self.nvfp4_backend == NvFp4MoeBackend.MARLIN:
# TODO(rob): update marlin prepare to match fp8 moe.
prepare_moe_fp4_layer_for_marlin(layer)
else:
# TODO(rob): need to do the swizzling here.
passvllm/model_executor/layers/quantization/modelopt.py (1726-1755)
There's an incorrect enum comparison here. self.nvfp4_backend is of type NvFp4MoeBackend, but it's being compared with members of FlashinferMoeBackend. This will cause the conditions for CUTLASS and CUTEDSL backends to always evaluate to false, leading to incorrect kernel dispatch.
You should compare against the correct enum members from NvFp4MoeBackend.
elif self.nvfp4_backend == NvFp4MoeBackend.FLASHINFER_CUTLASS:
from vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe import ( # noqa: E501
flashinfer_cutlass_moe_fp4,
)
assert self.moe_quant_config is not None
return flashinfer_cutlass_moe_fp4(
hidden_states=x,
w1=layer.w13_weight,
w2=layer.w2_weight,
topk_weights=topk_weights,
topk_ids=topk_ids,
quant_config=self.moe_quant_config,
inplace=False,
activation=layer.activation,
global_num_experts=layer.global_num_experts,
expert_map=layer.expert_map,
apply_router_weight_on_input=layer.apply_router_weight_on_input,
)
elif self.nvfp4_backend == NvFp4MoeBackend.FLASHINFER_CUTEDSL:
from vllm.model_executor.layers.fused_moe.flashinfer_cutedsl_moe import ( # noqa: E501
flashinfer_cutedsl_moe_fp4,
)
assert self.moe_quant_config is not None
return flashinfer_cutedsl_moe_fp4(
hidden_states=x,
w1=layer.w13_weight,
w2=layer.w2_weight,Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
…(it only does batched) Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
| if ( | ||
| not self.moe.is_act_and_mul | ||
| and not self.nvfp4_backend == NvFp4MoeBackend.FLASHINFER_CUTLASS | ||
| ): | ||
| raise NotImplementedError( | ||
| "Non-gated activations are only supported by FlashInfer " | ||
| "CUTLASS NvFP4 MoE backend." | ||
| ) |
There was a problem hiding this comment.
Shouldn't we put this check into select_nvfp4_moe_backend itself?
There was a problem hiding this comment.
yes, In the end state, the oracle will hold all this logic.
I added a TODO. This is the next phase of work
|
This pull request has merge conflicts that must be resolved before it can be |
mgoin
left a comment
There was a problem hiding this comment.
LGTM! Appreciate the comments left in places for future work, I think this is a clear improvement now
|
thanks, will rebase and merge |
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Pavani Majety <pmajety@nvidia.com>
The MoE refactor (vllm-project#31692) changed expanded input scale tensors from intermediates to stored parameters. torch.expand() creates non-contiguous stride-0 views, which causes EPLB's get_expert_weights() contiguity assertion to fail. Add .contiguous() to the expand() calls since this only copies ~144 bytes per layer at model load time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jason Li <jasonlizhengjian@gmail.com>
Purpose
Apply refactor to nvfp4 integrations, key steps:
MarlinExpertsfor NVFP4trtllmkernelsTest Plan
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.