[Bugfix] Fix Qwen3.5 LoRA IndexError in packed_modules_mapping#36825
[Bugfix] Fix Qwen3.5 LoRA IndexError in packed_modules_mapping#36825hallerite wants to merge 5 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses an IndexError that occurs when using LoRA with Qwen3.5 models. The fix is two-fold: first, it correctly aligns the packed_modules_mapping for in_proj_qkvz in Qwen3.5 models to have four entries, which now matches the layer's four output_sizes. Second, it generalizes the layer replacement logic in MergedColumnParallelLinearWithLoRA to be more robust by dynamically checking against the number of output sizes instead of a hardcoded value. These changes are well-reasoned, directly fix the bug, and improve the code's maintainability.
d179859 to
e38dbbd
Compare
|
Hi @hallerite, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
alvinttang
left a comment
There was a problem hiding this comment.
This is a correct two-part fix: the packed_modules_mapping was misrepresenting in_proj_qkvz as 2 sub-modules when it actually has 4 (matching the 4 output_sizes in create_qkvz_proj), and can_replace_layer was hardcoded to len == 2 instead of dynamically checking against the layer's actual output_sizes. The dynamic check in can_replace_layer is the more important improvement since it makes the validation self-consistent and prevents future regressions if the packing changes again. One thing worth double-checking: are there any serialized/saved LoRA adapters in the wild that used the old 2-module mapping that would now silently fail to load against this updated definition? Overall this is a well-reasoned fix and both changes are necessary together.
27fb13a to
67a242a
Compare
|
this has unblocked the seq len error but stopped working for me when i load lora with 4bit quant bitsandbytes |
could you give me a stack trace & ideally a way to reproduce this? |
|
https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/lora_with_quantization_inference.py, you can run this example with base model as qwen 3.5, 9b, any weight size |
Signed-off-by: hallerite <git@hallerite.com>
Signed-off-by: hallerite <git@hallerite.com>
Revert packed_modules_mapping to real HF weight names (in_proj_qkv, in_proj_z) to fix bitsandbytes quant state stacking, and extend MergedColumnParallelLinearVariableSliceWithLoRA to handle the mismatch between packed module count (2) and output_sizes count (4) in GDN layers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hallerite <git@hallerite.com>
40cf2b0 to
eb7439f
Compare
Pushed a potential fix and at least worked locally for me.The issue was that the Feel free to try again and tell me if it works now. |
|
Hi @hallerite, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: hallerite <git@hallerite.com>
Assert that consumed dimensions exactly match lora_b's shape after greedily matching output_sizes. Prevents silent data corruption if dimensions don't align. Signed-off-by: hallerite <git@hallerite.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Closing in favor of #37019 which is a superset of this PR — it includes all the changes here (VariableSlice |
Summary
Fixes
IndexError: list index out of rangewhen enabling LoRA with Qwen3.5 models (Qwen3_5ForCausalLMBaseandQwen3_5ForConditionalGeneration).Root cause: Qwen3.5's
create_qkvz_projoverrides the parent (Qwen3Next) to use 4output_sizes[key_dim, key_dim, value_dim, value_dim]for correct per-slice TP sharding. However,packed_modules_mappingonly lists 2 entries["in_proj_qkv", "in_proj_z"]. During LoRA initialization,MergedColumnParallelLinearWithLoRAsetsn_slices = len(output_sizes)(4) but only createslen(packed_modules)(2) adapters, so accessinglora_a[2]/lora_a[3]crashes.Fix:
packed_modules_mappingforin_proj_qkvzfrom 2 to 4 entries:["in_proj_q", "in_proj_k", "in_proj_v", "in_proj_z"]— matching the 4 output_sizesMergedColumnParallelLinearWithLoRA.can_replace_layerfromlen(packed_modules_list) == 2tolen(packed_modules_list) == len(source_layer.output_sizes)— so it works for any N-way merged column parallel linear, not just 2-wayThis works for any TP size because each of the 4 packed modules maps to one output_size, preserving correct per-slice sharding.
Note: The parent class
Qwen3Nextdoesn't have this issue because it usesoutput_sizes=[sum(key_dim, key_dim, value_dim, value_dim)](1 entry) withpacked_modules=["in_proj_qkvz"](1 entry) — they match.Note: This may not be the globally optimal solution. The 4 packed module names (
in_proj_q,in_proj_k,in_proj_v,in_proj_z) are synthetic — the actual HF weight names arein_proj_qkv(fused Q+K+V) andin_proj_z. This means LoRA adapter weights targeting the GDN projections by their real HF names wouldn't be found during loading. In practice this isn't an issue today because nobody LoRAs the GDN layers — only standard attention and MLP layers are targeted. A more complete fix would be to support M packed modules mapping to N output sizes (2 weights → 4 sharding slices) inMergedColumnParallelLinearWithLoRA, but that's a larger refactor.Related: #36372, #36478
Test plan