[Bugfix] Fix Fp8 Triton for non-gated MoE (Nemotron) by danisereb · Pull Request #31983 · vllm-project/vllm

danisereb · 2026-01-08T17:09:11Z

Purpose

vLLM serve command that fails:

export MODEL_PATH=/my_home/hf_models/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/

export VLLM_USE_FLASHINFER_MOE_FP8=0

vllm serve $MODEL_PATH --served-model-name my_model \
--trust-remote-code --async-scheduling --kv-cache-dtype fp8 --tensor-parallel-size 1

The following backend should be used:

Detected ModelOpt fp8 checkpoint (quant_algo=FP8).
Using Triton backend for FP8 MoE

But vLLM fails:

(EngineCore_DP0 pid=49841)   File "/my_home/workspace/my_vllm/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 603, in activation
(EngineCore_DP0 pid=49841)     assert output.size(-1) == input.size(-1)

Note

Backend VLLM_USE_FLASHINFER_MOE_FP8=1 was fixed in this PR:
#31960

Test Plan

Run basic lm_eval test with two configs:

Config based on recipe (https://docs.vllm.ai/projects/recipes/en/latest/NVIDIA/Nemotron-3-Nano-30B-A3B.html#launch-the-vllm-server):

VLLM_USE_FLASHINFER_MOE_FP8=1
VLLM_FLASHINFER_MOE_BACKEND=throughput

And with Triton:

VLLM_USE_FLASHINFER_MOE_FP8=0

Results should be similiar to an older commit that did not fail/crash (1ab055e).

Test Result

Test results are OK.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Ensures FP8 MoE works for non-gated (no gate-up fusion) paths and fixes Triton execution shape assumptions.

Plumbs is_act_and_mul into fp8_w8a8_moe_quant_config, make_fp8_moe_quant_config, and select_fp8_moe_backend so kernels/quant config align with gated vs non-gated layouts
ModelOpt FP8 MoE: passes is_act_and_mul into backend selection and quant-config creation to avoid mismatched dimensions at runtime
Adds NVIDIA-Nemotron-3-Nano-30B-A3B-BF16-triton.yaml and references it in config-b200.txt for GSM8K evals

^{Written by Cursor Bugbot for commit ed92436863414a4572046962938013b858a8b51e. This will update automatically on new commits. Configure here.}

Note

Ensures FP8 MoE handles non-gated paths correctly and avoids kernel shape mismatches.

Plumbs is_act_and_mul into select_fp8_moe_backend and make_fp8_moe_quant_config via ModelOptFp8MoEMethod (constructor and quant-config), aligning backend choice and scales for gated vs non-gated MoE
Updates fp8.py quant-config API to accept is_act_and_mul
Adds NVIDIA-Nemotron-3-Nano-30B-A3B-BF16-triton.yaml and references it in config-b200.txt for GSM8K evals

^{Written by Cursor Bugbot for commit 4533a693dd9967ada13043ab43b0f1a93f74a2be. This will update automatically on new commits. Configure here.}

Note

Ensures FP8 MoE handles non‑gated (no gate-up fusion) paths correctly and avoids Triton kernel shape mismatches.

Plumbs is_act_and_mul into select_fp8_moe_backend and make_fp8_moe_quant_config via ModelOptFp8MoEMethod (constructor and quant-config), aligning backend selection and scales for gated vs non‑gated MoE
Adds NVIDIA-Nemotron-3-Nano-30B-A3B-BF16-triton.yaml and references it in config-b200.txt for GSM8K evals

^{Written by Cursor Bugbot for commit 85de95e27264d6c17ecdcc05a96b98dce062a041. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit 360a708cab806d79e792e0a50383849be21589a6. Configure here.}

gemini-code-assist

Code Review

This pull request introduces a fix for non-gated Mixture-of-Experts (MoE) models, specifically for the ModelOptFp8MoEMethod. The changes correctly propagate the is_act_and_mul flag from the FusedMoE layer configuration down to the quantization configuration and backend selection logic. This ensures that models like Nemotron, which use non-fused activations, are handled correctly by the FP8 MoE kernels, particularly when using the Triton backend. The changes are well-contained and maintain backward compatibility by defaulting is_act_and_mul to True. The implementation appears correct and addresses the reported issue.

robertgshaw2-redhat · 2026-01-08T17:26:52Z

Can you please add this model to the CI/CD. For example:

[Bugfix] Fix Typo from NVFP4 Refactor #31977

missed Rob's comment

mergify · 2026-01-10T18:13:44Z

Hi @danisereb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

danisereb · 2026-01-10T19:24:55Z

Manual pre-commit run --all-files does not fail.
New eval test also works (tested only the new model in config-b200.txt.

Update about pre-commit failure:
Found the issue, it's related to recent reverted change.
I will rebase and fix.

mergify · 2026-01-11T15:53:10Z

Hi @danisereb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

vllm/model_executor/layers/quantization/modelopt.py

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

mgoin · 2026-01-14T16:04:18Z

tests/evals/gsm8k/configs/moe-refactor/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16-triton.yaml

@@ -0,0 +1,5 @@
+model_name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"


This looks like a BF16 model, not FP8?

Was discussed in slack channel with us.
This PR can be closed.

mergify · 2026-01-16T04:34:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @danisereb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

danisereb · 2026-01-18T07:16:33Z

vllm/model_executor/layers/quantization/modelopt.py

            block_quant=False,
            tp_size=moe_config.moe_parallel_config.tp_size,
            with_lora_support=self.moe.is_lora_enabled,
+            is_act_and_mul=self.moe.is_act_and_mul,


This was fixed in another PR:
#32257

gemini-code-assist bot reviewed Jan 8, 2026

View reviewed changes

danisereb marked this pull request as ready for review January 8, 2026 17:14

danisereb requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners January 8, 2026 17:14

robertgshaw2-redhat changed the title ~~Fix ModelOptFp8MoEMethod for non-gated MoE (Nemotron)~~ [Bugfix] Fix ModelOptFp8MoEMethod for non-gated MoE (Nemotron) Jan 8, 2026

robertgshaw2-redhat changed the title ~~[Bugfix] Fix ModelOptFp8MoEMethod for non-gated MoE (Nemotron)~~ [Bugfix] Fix Fp8 Triton for non-gated MoE (Nemotron) Jan 8, 2026

This comment was marked as outdated.

Sign in to view

danisereb mentioned this pull request Jan 11, 2026

Fix select_gemm_impl in ModelOptFp8MoEMethod #32115

Closed

5 tasks

danisereb force-pushed the fix_nemo_bug branch 2 times, most recently from 4533a69 to 85de95e Compare January 11, 2026 15:48

cursor bot reviewed Jan 11, 2026

View reviewed changes

vllm/model_executor/layers/quantization/modelopt.py Outdated Show resolved Hide resolved

danisereb force-pushed the fix_nemo_bug branch from 85de95e to 360a708 Compare January 11, 2026 15:57

danisereb mentioned this pull request Jan 13, 2026

[Bug]: Nemotron Nano V3 FP8 - Expected 'silu' activation but got relu2_no_mul #31957

Closed

1 task

mergify bot added the bug Something isn't working label Jan 13, 2026

danisereb added 2 commits January 14, 2026 10:08

Fix ModelOptFp8MoEMethod for non-gated MoE

5e51a9d

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Add Nemotron Nano V3 BF16 to moe-refactor evals

7d08280

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

danisereb force-pushed the fix_nemo_bug branch from 0110619 to 7d08280 Compare January 14, 2026 08:08

mgoin reviewed Jan 14, 2026

View reviewed changes

mergify bot added the needs-rebase label Jan 16, 2026

danisereb commented Jan 18, 2026

View reviewed changes

danisereb closed this Jan 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix Fp8 Triton for non-gated MoE (Nemotron)#31983

[Bugfix] Fix Fp8 Triton for non-gated MoE (Nemotron)#31983
danisereb wants to merge 2 commits intovllm-project:mainfrom
danisereb:fix_nemo_bug

danisereb commented Jan 8, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

robertgshaw2-redhat commented Jan 8, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

mergify bot commented Jan 10, 2026

Uh oh!

danisereb commented Jan 10, 2026 •

edited

Loading

Uh oh!

mergify bot commented Jan 11, 2026

Uh oh!

Uh oh!

mgoin Jan 14, 2026

Uh oh!

danisereb Jan 18, 2026

Uh oh!

mergify bot commented Jan 16, 2026

Uh oh!

danisereb Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -0,0 +1,5 @@
		model_name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"

Uh oh!

Conversation

danisereb commented Jan 8, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Note

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

robertgshaw2-redhat commented Jan 8, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

mergify bot commented Jan 10, 2026

Uh oh!

danisereb commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Jan 11, 2026

Uh oh!

Uh oh!

mgoin Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

danisereb Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jan 16, 2026

Uh oh!

danisereb Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danisereb commented Jan 8, 2026 •

edited by github-actions bot

Loading

danisereb commented Jan 10, 2026 •

edited

Loading