[MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1#33257
[MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1#33257tlrmchlsmth merged 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request enables tensor parallelism for quantized Mamba models with n_groups=1 by extending custom weight loader support to quantized layers. The changes are well-structured and address the issue effectively. The removal of the restrictive assertion and the new logic for applying the mamba_v2_sharded_weight_loader to both quantized and non-quantized weights are correct. The implementation correctly distinguishes between BasevLLMParameter subclasses and standard torch.nn.Parameter to set the weight loader. The changes look good and improve support for quantized Mamba models.
|
cc @tomeras91 |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
84f3cc8 to
b3878ef
Compare
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
|
want to cover with wider testing. Add |
|
@tomeras91 @tdoublep |
|
LGTM |
For Baseline With PR |
…s=1 (vllm-project#33257) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: felix01.yu <felix01.yu@vipshop.com>
… n_groups=1 (vllm-project#33257)" This reverts commit a372f3f.
… n_groups=1 (vllm-project#33257)" This reverts commit a372f3f. Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…s=1 (vllm-project#33257) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Summary
Enable tensor parallelism (TP > 1) for quantized hybrid Mamba models (e.g., Falcon-H1R-7B with FP8) when
n_groups=1.Root Cause
Custom weight loaders for group replication were only implemented for non-quantized layers. This PR extends support to quantized layers by leveraging the
weight_loaderproperty onModelWeightParameter(extendsBasevLLMParameter).Test
Previously failed, now works
Validation (lm_eval)
Results with TP=2 (this PR):
Results with TP=1 (baseline):
Results are consistent within statistical error margins, confirming correctness.
Related Issues
n_groups % tp_size == 0#24593