[BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1 by vadiklyutiy · Pull Request #34476 · vllm-project/vllm

vadiklyutiy · 2026-02-13T00:27:21Z

Purpose

Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1.

Alternative to #34151.

What happened

PR #33257 (a372f3f40) was a squash-merge of two logical changes:

Adding TP support for quantized Mamba models with n_groups=1 (e.g., Falcon-H1R-7B with FP8) — correct and needed.
Unifying both code paths (n_groups % tp_size == 0 and n_groups == 1) to always use ColumnParallelLinear with a custom mamba_v2_sharded_weight_loader — incorrect for quantized models.

The second change (d5d6d0b88) replaced MergedColumnParallelLinear with ColumnParallelLinear + custom weight loader for ALL cases, but only overrode the weight_loader on in_proj.weight. For quantized models (NVFP4, FP8), there are also scale parameters (weight_scale, input_scale, weight_scale_2) that still use ColumnParallelLinear's default contiguous sharding.

The mamba in_proj has a composite weight layout [gate, intermediate, B_groups, C_groups, dt_heads]. The custom mamba_loader shards each component separately, but the scale parameters do simple contiguous sharding, causing weight-scale misalignment — the dequantization scale at row N corresponds to a different model component than the weight at row N.

MergedColumnParallelLinear handles this correctly because it knows the per-component output sizes and shards ALL parameters (weights and scales) accordingly.

Fix

Revert d5d6d0b88 ("Unify MambaMixer2 TP sharding to use custom weight loader"), the last commit from #33257. This restores MergedColumnParallelLinear for the n_groups % tp_size == 0 case while preserving the n_groups == 1 quantized TP support.

Test Results

Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4, TP=8, gsm8k 5-shot (limit=500), 8×B200

After fix (3 runs):

Run	flexible-extract	strict-match	Status
1	0.430	0.760	GOOD
2	0.416	0.746	GOOD
3	0.424	0.730	GOOD

3/3 runs correct.

gemini-code-assist

Code Review

This pull request addresses an accuracy regression for quantized Mamba models with tensor parallelism greater than one. The root cause was a previous change that incorrectly unified weight loading logic, leading to a misalignment between weights and their quantization scales. This fix correctly reverts the problematic part of that change by restoring the use of MergedColumnParallelLinear when the number of groups is divisible by the tensor-parallel size. This is the right approach, as MergedColumnParallelLinear correctly handles the sharding of all parameters, including quantization scales, resolving the issue. The special handling for n_groups=1, which was already correct, is preserved. The change is well-justified, clearly explained, and effectively fixes the bug.

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

…VFP4 with TP>1 (vllm-project#34476) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: athrael-soju <athrael-soju@users.noreply.github.com>

…VFP4 with TP>1 (vllm-project#34476) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

…VFP4 with TP>1 (vllm-project#34476) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

…VFP4 with TP>1 (vllm-project#34476) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

…VFP4 with TP>1 (vllm-project#34476) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy requested a review from tdoublep as a code owner February 13, 2026 00:27

mergify bot added nvidia bug Something isn't working labels Feb 13, 2026

github-project-automation bot added this to NVIDIA Feb 13, 2026

gemini-code-assist bot reviewed Feb 13, 2026

View reviewed changes

vadiklyutiy self-assigned this Feb 13, 2026

revert d5d6d0b

7936233

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy force-pushed the vadim/nemotron-fix branch from f6b04e0 to 7936233 Compare February 13, 2026 00:30

vadiklyutiy mentioned this pull request Feb 13, 2026

Revert "[MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1 (#33257)" #34151

Closed

5 tasks

vadiklyutiy added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 13, 2026

vadiklyutiy requested a review from tlrmchlsmth February 13, 2026 00:47

tlrmchlsmth approved these changes Feb 13, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Feb 13, 2026

Merge branch 'main' into vadim/nemotron-fix

5e48db5

vllm-bot merged commit 604b9ea into vllm-project:main Feb 15, 2026
52 of 56 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Feb 15, 2026

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

[BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-N…

48f29b6

…VFP4 with TP>1 (vllm-project#34476) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-N…

30fa711

…VFP4 with TP>1 (vllm-project#34476) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy deleted the vadim/nemotron-fix branch March 11, 2026 08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1#34476

[BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1#34476
vllm-bot merged 2 commits intovllm-project:mainfrom
CentML:vadim/nemotron-fix

vadiklyutiy commented Feb 13, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

vadiklyutiy commented Feb 13, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

What happened

Fix

Test Results

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vadiklyutiy commented Feb 13, 2026 •

edited by github-actions bot

Loading