Skip to content

[MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1#33257

Merged
tlrmchlsmth merged 3 commits intovllm-project:mainfrom
CentML:vadim/fix-falcon-fp8-tp
Feb 3, 2026
Merged

[MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1#33257
tlrmchlsmth merged 3 commits intovllm-project:mainfrom
CentML:vadim/fix-falcon-fp8-tp

Conversation

@vadiklyutiy
Copy link
Collaborator

@vadiklyutiy vadiklyutiy commented Jan 28, 2026

Summary

Enable tensor parallelism (TP > 1) for quantized hybrid Mamba models (e.g., Falcon-H1R-7B with FP8) when n_groups=1.

Root Cause

Custom weight loaders for group replication were only implemented for non-quantized layers. This PR extends support to quantized layers by leveraging the weight_loader property on ModelWeightParameter (extends BasevLLMParameter).

Test

vllm serve tiiuae/Falcon-H1R-7B -tp 2 --quantization fp8

Previously failed, now works

Validation (lm_eval)

lm_eval --model local-chat-completions \
  --model_args model=tiiuae/Falcon-H1R-7B,base_url=http://localhost:8000/v1/chat/completions,num_concurrent=250 \
  --tasks gsm8k --apply_chat_template --num_fewshot 5

Results with TP=2 (this PR):

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.3283 ± 0.0129
strict-match 5 exact_match 0.0963 ± 0.0081

Results with TP=1 (baseline):

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.3154 ± 0.0128
strict-match 5 exact_match 0.1031 ± 0.0084

Results are consistent within statistical error margins, confirming correctness.

Related Issues

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
@vadiklyutiy vadiklyutiy requested a review from tdoublep as a code owner January 28, 2026 13:24
@vadiklyutiy vadiklyutiy self-assigned this Jan 28, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables tensor parallelism for quantized Mamba models with n_groups=1 by extending custom weight loader support to quantized layers. The changes are well-structured and address the issue effectively. The removal of the restrictive assertion and the new logic for applying the mamba_v2_sharded_weight_loader to both quantized and non-quantized weights are correct. The implementation correctly distinguishes between BasevLLMParameter subclasses and standard torch.nn.Parameter to set the weight loader. The changes look good and improve support for quantized Mamba models.

@vadiklyutiy
Copy link
Collaborator Author

cc @tomeras91

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Copy link
Contributor

@tomeras91 tomeras91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@vadiklyutiy vadiklyutiy force-pushed the vadim/fix-falcon-fp8-tp branch 2 times, most recently from 84f3cc8 to b3878ef Compare February 1, 2026 20:52
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
@vadiklyutiy
Copy link
Collaborator Author

want to cover with wider testing. Add ready for that

@vadiklyutiy vadiklyutiy added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 1, 2026
@vadiklyutiy
Copy link
Collaborator Author

@tomeras91 @tdoublep
CI passed, review comment fixed, failed Falcon-H1R-7B is fixed
Could take a look?

@tomeras91
Copy link
Contributor

LGTM
As a final validation, can you double check that NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 still works and produces similar GSM8K results on main and on this PR?

@vadiklyutiy
Copy link
Collaborator Author

vadiklyutiy commented Feb 2, 2026

LGTM As a final validation, can you double check that NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 still works and produces similar GSM8K results on main and on this PR?

For NVIDIA-Nemotron-3-Nano-30B-A3B-FP8

Baseline

| Tasks | Version | Filter           | n-shot | Metric      |   | Value  |   | Stderr |
|-------|--------:|------------------|-------:|-------------|---|-------:|---|-------:|
| gsm8k |       3 | flexible-extract |      5 | exact_match | ↑ | 0.3169 | ± | 0.0128 |
|       |         | strict-match     |      5 | exact_match | ↑ | 0.4792 | ± | 0.0138 |

With PR

| Tasks | Version | Filter           | n-shot | Metric      |   | Value  |   | Stderr |
|-------|--------:|------------------|-------:|-------------|---|-------:|---|-------:|
| gsm8k |       3 | flexible-extract |      5 | exact_match | ↑ | 0.3017 | ± | 0.0126 |
|       |         | strict-match     |      5 | exact_match | ↑ | 0.4693 | ± | 0.0137 |

@mgoin mgoin added the bug Something isn't working label Feb 3, 2026
@tlrmchlsmth tlrmchlsmth merged commit a372f3f into vllm-project:main Feb 3, 2026
45 checks passed
gameofdimension pushed a commit to gameofdimension/vllm that referenced this pull request Feb 5, 2026
…s=1 (vllm-project#33257)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: felix01.yu <felix01.yu@vipshop.com>
amitz-nv added a commit to amitz-nv/vllm that referenced this pull request Feb 9, 2026
amitz-nv added a commit to amitz-nv/vllm that referenced this pull request Feb 10, 2026
… n_groups=1 (vllm-project#33257)"

This reverts commit a372f3f.

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…s=1 (vllm-project#33257)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
@vadiklyutiy vadiklyutiy deleted the vadim/fix-falcon-fp8-tp branch March 11, 2026 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Quantization In MambaMixer2 Not Supported when Tensor Parallel is enabled

4 participants