[Bugfix] Fix Fp8 Triton for non-gated MoE (Nemotron)#31983
[Bugfix] Fix Fp8 Triton for non-gated MoE (Nemotron)#31983danisereb wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a fix for non-gated Mixture-of-Experts (MoE) models, specifically for the ModelOptFp8MoEMethod. The changes correctly propagate the is_act_and_mul flag from the FusedMoE layer configuration down to the quantization configuration and backend selection logic. This ensures that models like Nemotron, which use non-fused activations, are handled correctly by the FP8 MoE kernels, particularly when using the Triton backend. The changes are well-contained and maintain backward compatibility by defaulting is_act_and_mul to True. The implementation appears correct and addresses the reported issue.
|
Can you please add this model to the CI/CD. For example: |
|
Hi @danisereb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Manual Update about pre-commit failure: |
4533a69 to
85de95e
Compare
|
Hi @danisereb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
85de95e to
360a708
Compare
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
0110619 to
7d08280
Compare
| @@ -0,0 +1,5 @@ | |||
| model_name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" | |||
There was a problem hiding this comment.
This looks like a BF16 model, not FP8?
There was a problem hiding this comment.
Was discussed in slack channel with us.
This PR can be closed.
|
This pull request has merge conflicts that must be resolved before it can be |
| block_quant=False, | ||
| tp_size=moe_config.moe_parallel_config.tp_size, | ||
| with_lora_support=self.moe.is_lora_enabled, | ||
| is_act_and_mul=self.moe.is_act_and_mul, |
Purpose
vLLM serve command that fails:
The following backend should be used:
But vLLM fails:
Note
Backend
VLLM_USE_FLASHINFER_MOE_FP8=1was fixed in this PR:#31960
Test Plan
Run basic
lm_evaltest with two configs:Config based on recipe (https://docs.vllm.ai/projects/recipes/en/latest/NVIDIA/Nemotron-3-Nano-30B-A3B.html#launch-the-vllm-server):
And with Triton:
Results should be similiar to an older commit that did not fail/crash (1ab055e).
Test Result
Test results are OK.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Note
Ensures FP8 MoE works for non-gated (no gate-up fusion) paths and fixes Triton execution shape assumptions.
is_act_and_mulintofp8_w8a8_moe_quant_config,make_fp8_moe_quant_config, andselect_fp8_moe_backendso kernels/quant config align with gated vs non-gated layoutsis_act_and_mulinto backend selection and quant-config creation to avoid mismatched dimensions at runtimeNVIDIA-Nemotron-3-Nano-30B-A3B-BF16-triton.yamland references it inconfig-b200.txtfor GSM8K evalsWritten by Cursor Bugbot for commit ed92436863414a4572046962938013b858a8b51e. This will update automatically on new commits. Configure here.
Note
Ensures FP8 MoE handles non-gated paths correctly and avoids kernel shape mismatches.
is_act_and_mulintoselect_fp8_moe_backendandmake_fp8_moe_quant_configviaModelOptFp8MoEMethod(constructor and quant-config), aligning backend choice and scales for gated vs non-gated MoEfp8.pyquant-config API to acceptis_act_and_mulNVIDIA-Nemotron-3-Nano-30B-A3B-BF16-triton.yamland references it inconfig-b200.txtfor GSM8K evalsWritten by Cursor Bugbot for commit 4533a693dd9967ada13043ab43b0f1a93f74a2be. This will update automatically on new commits. Configure here.
Note
Ensures FP8 MoE handles non‑gated (no gate-up fusion) paths correctly and avoids Triton kernel shape mismatches.
is_act_and_mulintoselect_fp8_moe_backendandmake_fp8_moe_quant_configviaModelOptFp8MoEMethod(constructor and quant-config), aligning backend selection and scales for gated vs non‑gated MoENVIDIA-Nemotron-3-Nano-30B-A3B-BF16-triton.yamland references it inconfig-b200.txtfor GSM8K evalsWritten by Cursor Bugbot for commit 85de95e27264d6c17ecdcc05a96b98dce062a041. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 360a708cab806d79e792e0a50383849be21589a6. Configure here.