Skip to content

[Bugfix] Fix Fp8 Triton for non-gated MoE (Nemotron)#31983

Closed
danisereb wants to merge 2 commits intovllm-project:mainfrom
danisereb:fix_nemo_bug
Closed

[Bugfix] Fix Fp8 Triton for non-gated MoE (Nemotron)#31983
danisereb wants to merge 2 commits intovllm-project:mainfrom
danisereb:fix_nemo_bug

Conversation

@danisereb
Copy link
Copy Markdown
Contributor

@danisereb danisereb commented Jan 8, 2026

Purpose

vLLM serve command that fails:

export MODEL_PATH=/my_home/hf_models/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/

export VLLM_USE_FLASHINFER_MOE_FP8=0

vllm serve $MODEL_PATH --served-model-name my_model \
--trust-remote-code --async-scheduling --kv-cache-dtype fp8 --tensor-parallel-size 1

The following backend should be used:

Detected ModelOpt fp8 checkpoint (quant_algo=FP8).
Using Triton backend for FP8 MoE

But vLLM fails:

(EngineCore_DP0 pid=49841)   File "/my_home/workspace/my_vllm/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 603, in activation
(EngineCore_DP0 pid=49841)     assert output.size(-1) == input.size(-1)

Note

Backend VLLM_USE_FLASHINFER_MOE_FP8=1 was fixed in this PR:
#31960

Test Plan

Run basic lm_eval test with two configs:

Config based on recipe (https://docs.vllm.ai/projects/recipes/en/latest/NVIDIA/Nemotron-3-Nano-30B-A3B.html#launch-the-vllm-server):

VLLM_USE_FLASHINFER_MOE_FP8=1
VLLM_FLASHINFER_MOE_BACKEND=throughput

And with Triton:

VLLM_USE_FLASHINFER_MOE_FP8=0

Results should be similiar to an older commit that did not fail/crash (1ab055e).

Test Result

Test results are OK.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Ensures FP8 MoE works for non-gated (no gate-up fusion) paths and fixes Triton execution shape assumptions.

  • Plumbs is_act_and_mul into fp8_w8a8_moe_quant_config, make_fp8_moe_quant_config, and select_fp8_moe_backend so kernels/quant config align with gated vs non-gated layouts
  • ModelOpt FP8 MoE: passes is_act_and_mul into backend selection and quant-config creation to avoid mismatched dimensions at runtime
  • Adds NVIDIA-Nemotron-3-Nano-30B-A3B-BF16-triton.yaml and references it in config-b200.txt for GSM8K evals

Written by Cursor Bugbot for commit ed92436863414a4572046962938013b858a8b51e. This will update automatically on new commits. Configure here.


Note

Ensures FP8 MoE handles non-gated paths correctly and avoids kernel shape mismatches.

  • Plumbs is_act_and_mul into select_fp8_moe_backend and make_fp8_moe_quant_config via ModelOptFp8MoEMethod (constructor and quant-config), aligning backend choice and scales for gated vs non-gated MoE
  • Updates fp8.py quant-config API to accept is_act_and_mul
  • Adds NVIDIA-Nemotron-3-Nano-30B-A3B-BF16-triton.yaml and references it in config-b200.txt for GSM8K evals

Written by Cursor Bugbot for commit 4533a693dd9967ada13043ab43b0f1a93f74a2be. This will update automatically on new commits. Configure here.


Note

Ensures FP8 MoE handles non‑gated (no gate-up fusion) paths correctly and avoids Triton kernel shape mismatches.

  • Plumbs is_act_and_mul into select_fp8_moe_backend and make_fp8_moe_quant_config via ModelOptFp8MoEMethod (constructor and quant-config), aligning backend selection and scales for gated vs non‑gated MoE
  • Adds NVIDIA-Nemotron-3-Nano-30B-A3B-BF16-triton.yaml and references it in config-b200.txt for GSM8K evals

Written by Cursor Bugbot for commit 85de95e27264d6c17ecdcc05a96b98dce062a041. This will update automatically on new commits. Configure here.


Note

Cursor Bugbot is generating a summary for commit 360a708cab806d79e792e0a50383849be21589a6. Configure here.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for non-gated Mixture-of-Experts (MoE) models, specifically for the ModelOptFp8MoEMethod. The changes correctly propagate the is_act_and_mul flag from the FusedMoE layer configuration down to the quantization configuration and backend selection logic. This ensures that models like Nemotron, which use non-fused activations, are handled correctly by the FP8 MoE kernels, particularly when using the Triton backend. The changes are well-contained and maintain backward compatibility by defaulting is_act_and_mul to True. The implementation appears correct and addresses the reported issue.

@danisereb danisereb marked this pull request as ready for review January 8, 2026 17:14
@robertgshaw2-redhat robertgshaw2-redhat changed the title Fix ModelOptFp8MoEMethod for non-gated MoE (Nemotron) [Bugfix] Fix ModelOptFp8MoEMethod for non-gated MoE (Nemotron) Jan 8, 2026
@robertgshaw2-redhat robertgshaw2-redhat changed the title [Bugfix] Fix ModelOptFp8MoEMethod for non-gated MoE (Nemotron) [Bugfix] Fix Fp8 Triton for non-gated MoE (Nemotron) Jan 8, 2026
@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

Can you please add this model to the CI/CD. For example:

LucasWilkinson

This comment was marked as outdated.

@LucasWilkinson LucasWilkinson dismissed their stale review January 10, 2026 02:12

missed Rob's comment

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 10, 2026

Hi @danisereb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@danisereb
Copy link
Copy Markdown
Contributor Author

danisereb commented Jan 10, 2026

Manual pre-commit run --all-files does not fail.
New eval test also works (tested only the new model in config-b200.txt.

Update about pre-commit failure:
Found the issue, it's related to recent reverted change.
I will rebase and fix.

@danisereb danisereb force-pushed the fix_nemo_bug branch 2 times, most recently from 4533a69 to 85de95e Compare January 11, 2026 15:48
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 11, 2026

Hi @danisereb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
@@ -0,0 +1,5 @@
model_name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a BF16 model, not FP8?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was discussed in slack channel with us.
This PR can be closed.

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 16, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @danisereb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 16, 2026
block_quant=False,
tp_size=moe_config.moe_parallel_config.tp_size,
with_lora_support=self.moe.is_lora_enabled,
is_act_and_mul=self.moe.is_act_and_mul,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was fixed in another PR:
#32257

@danisereb danisereb closed this Jan 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-rebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants