Skip to content

[BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1#34476

Merged
vllm-bot merged 2 commits intovllm-project:mainfrom
CentML:vadim/nemotron-fix
Feb 15, 2026
Merged

[BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1#34476
vllm-bot merged 2 commits intovllm-project:mainfrom
CentML:vadim/nemotron-fix

Conversation

@vadiklyutiy
Copy link
Collaborator

@vadiklyutiy vadiklyutiy commented Feb 13, 2026

Purpose

Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1.

Alternative to #34151.

What happened

PR #33257 (a372f3f40) was a squash-merge of two logical changes:

  1. Adding TP support for quantized Mamba models with n_groups=1 (e.g., Falcon-H1R-7B with FP8) — correct and needed.
  2. Unifying both code paths (n_groups % tp_size == 0 and n_groups == 1) to always use ColumnParallelLinear with a custom mamba_v2_sharded_weight_loaderincorrect for quantized models.

The second change (d5d6d0b88) replaced MergedColumnParallelLinear with ColumnParallelLinear + custom weight loader for ALL cases, but only overrode the weight_loader on in_proj.weight. For quantized models (NVFP4, FP8), there are also scale parameters (weight_scale, input_scale, weight_scale_2) that still use ColumnParallelLinear's default contiguous sharding.

The mamba in_proj has a composite weight layout [gate, intermediate, B_groups, C_groups, dt_heads]. The custom mamba_loader shards each component separately, but the scale parameters do simple contiguous sharding, causing weight-scale misalignment — the dequantization scale at row N corresponds to a different model component than the weight at row N.

MergedColumnParallelLinear handles this correctly because it knows the per-component output sizes and shards ALL parameters (weights and scales) accordingly.

Fix

Revert d5d6d0b88 ("Unify MambaMixer2 TP sharding to use custom weight loader"), the last commit from #33257. This restores MergedColumnParallelLinear for the n_groups % tp_size == 0 case while preserving the n_groups == 1 quantized TP support.

Test Results

Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4, TP=8, gsm8k 5-shot (limit=500), 8×B200

After fix (3 runs):

Run flexible-extract strict-match Status
1 0.430 0.760 GOOD
2 0.416 0.746 GOOD
3 0.424 0.730 GOOD

3/3 runs correct.

@mergify mergify bot added nvidia bug Something isn't working labels Feb 13, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an accuracy regression for quantized Mamba models with tensor parallelism greater than one. The root cause was a previous change that incorrectly unified weight loading logic, leading to a misalignment between weights and their quantization scales. This fix correctly reverts the problematic part of that change by restoring the use of MergedColumnParallelLinear when the number of groups is divisible by the tensor-parallel size. This is the right approach, as MergedColumnParallelLinear correctly handles the sharding of all parameters, including quantization scales, resolving the issue. The special handling for n_groups=1, which was already correct, is preserved. The change is well-justified, clearly explained, and effectively fixes the bug.

@vadiklyutiy vadiklyutiy self-assigned this Feb 13, 2026
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
@vadiklyutiy vadiklyutiy added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 13, 2026
@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Feb 13, 2026
@vllm-bot vllm-bot merged commit 604b9ea into vllm-project:main Feb 15, 2026
52 of 56 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Feb 15, 2026
athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Feb 16, 2026
…VFP4 with TP>1 (vllm-project#34476)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: athrael-soju <athrael-soju@users.noreply.github.com>
wzhao18 pushed a commit to wzhao18/vllm that referenced this pull request Feb 18, 2026
…VFP4 with TP>1 (vllm-project#34476)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
eldarkurtic pushed a commit to eldarkurtic/vllm that referenced this pull request Feb 19, 2026
…VFP4 with TP>1 (vllm-project#34476)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
ZJY0516 pushed a commit to ZJY0516/vllm that referenced this pull request Feb 23, 2026
…VFP4 with TP>1 (vllm-project#34476)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
…VFP4 with TP>1 (vllm-project#34476)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
…VFP4 with TP>1 (vllm-project#34476)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
@vadiklyutiy vadiklyutiy deleted the vadim/nemotron-fix branch March 11, 2026 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants