[BugFix] Fix Qwen3.5 LoRA IndexError in GDN fused projections#36309
[BugFix] Fix Qwen3.5 LoRA IndexError in GDN fused projections#36309JWriter20 wants to merge 7 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request addresses an IndexError that occurs when using LoRA with Qwen3.5 models. The fix correctly aligns the output_sizes of the GDN fused projection layers with the corresponding packed_modules_mapping by changing the number of output slices from four to two. The stacked_params_mapping for weight loading is also updated to match the new slice configuration. Additionally, a comprehensive set of regression tests has been added to prevent this issue from recurring. The changes appear correct and effectively resolve the bug.
|
Hi @JWriter20, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
cdae214 to
c959452
Compare
Fix IndexError: list index out of range when using LoRA adapters with Qwen3.5 models (dense, MoE, and multimodal variants). The root cause was a mismatch between the number of output_sizes slices in the MergedColumnParallelLinear for in_proj_qkvz (4 slices) and the number of entries in packed_modules_mapping (2 entries). When LoRA's set_lora iterated over n_slices=4, it only had 2 LoRA weights available, causing an IndexError at index 2. Changes: - Change create_qkvz_proj output_sizes from [key_dim, key_dim, value_dim, value_dim] to [key_dim * 2 + value_dim, value_dim] to match the 2-entry packed_modules_mapping ["in_proj_qkv", "in_proj_z"] - Update stacked_params_mapping shard IDs from (0,1,2)/3 to 0/1 to match the new 2-slice layout - Add regression tests validating the alignment between output_sizes and packed_modules_mapping for all Qwen3.5 variants The fix preserves the total projection dimension (key_dim*2 + value_dim*2) and matches the HuggingFace checkpoint structure where in_proj_qkv and in_proj_z are stored as separate weight tensors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Jake Writer <writer.j@northeastern.edu>
c959452 to
e421959
Compare
|
this PR should merged ASAP. we're already using it in production, and it's absolutely necessary for any reasonable usage of Qwen3.5 |
We also need this fix. cc: @jeejeelee |
jeejeelee
left a comment
There was a problem hiding this comment.
Have you tested this PR with the real LoRA adapter?
|
I have tested this with a LoRA adapter, with a base as Qwen/Qwen3.5-397B-A17B-FP8, but have gotten garbage output (random characters) unfortunately. (The adapter works if --enforce-eager is enabled) |
|
@jeejeelee @musab-mk interesting, we have been able to finetune lora adapters and get improved performance on Qwen3.5 (dense) as expected. We've deployed this PR with both our batched inference API and online inference, both work very well on our side. In both cases, enforce eager was false. |
|
@musab-mk @1dividedby0 could you please test this PR: #36976 |
|
@jeejeelee Yes, I just tested that PR with Qwen/Qwen3.5-397B-A17B-FP8 as a base, It worked perfectly. |
Head branch was pushed to by a user without write access
…with LoRA The `gdn_in_proj` custom op (introduced in f174000 / PR vllm-project#36795) uses `self.in_proj_qkvz.weight.shape[0]` to communicate the output tensor size to torch.compile's fake implementation. With LoRA + AWQ/GPTQ quantization, `.weight` returns the quantized `qweight` whose shape is packed (e.g. input_size // 8 for 4-bit), causing a dimension mismatch in the subsequent `.split()` call. Fix: compute output sizes analytically from model dimensions (key_dim, value_dim, num_v_heads, tp_size) instead of reading from the weight tensor shape. These computed values are identical to weight.shape[0] for non-quantized models, so there is no regression. Tested with: - cyankiwi/Qwen3.5-9B-AWQ-4bit + LoRA adapters (torch.compile) - Qwen/Qwen3.5-9B without quantization (torch.compile) - Qwen/Qwen3.5-9B + LoRA adapters without quantization (eager) - Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 (torch.compile) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jake Writer <writer.j@northeastern.edu>
f47c8b7 to
ce82f3d
Compare
Add two tests to prevent future regressions: 1. test_qwen3_5_forward_does_not_use_weight_shape_for_gdn_in_proj: Verifies the forward method computes gdn_in_proj output sizes from model dimensions instead of .weight.shape[0], which returns wrong values for quantized models (AWQ/GPTQ) with LoRA. 2. test_qwen3_5_gdn_output_sizes_match_model_dims: Validates the computed output size formulas against known Qwen3.5-9B dimensions, including TP sharding correctness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jake Writer <writer.j@northeastern.edu>
d399f88 to
08d7b09
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
this is moving a bit slow, just wanted to flag @jeejeelee |
|
Since we have already #36976, I think this PR should no longer be neces |
Got it, closed. Thank you! |
Fixes #35286
Purpose
Fixes an
IndexError: list index out of rangecrash when using LoRA adapters with Qwen3.5 models (dense, MoE, and multimodal variants).The crash occurs in
MergedColumnParallelLinearWithLoRA.set_lorabecause there is a mismatch betweenoutput_sizes(4 slices) andpacked_modules_mapping(2 entries) for the GDN (Gated Delta Network) fused projection layers (in_proj_qkvzandin_proj_ba).Root Cause (Original Fix)
In
Qwen3_5GatedDeltaNet.create_qkvz_proj, theMergedColumnParallelLinearwas created with:But
packed_modules_mappingdefined only 2 entries:The LoRA system requires these to match:
set_loraiterates overn_slices = len(output_sizes)and indexes intolora_a[i], but only 2 LoRA weight tensors are created (one perpacked_modules_mappingentry), causingIndexErrorat index 2.Fix (Original)
output_sizes: Changed from[key_dim, key_dim, value_dim, value_dim](4 slices) to[key_dim * 2 + value_dim, value_dim](2 slices). This correctly reflects the HuggingFace checkpoint structure wherein_proj_qkvis a single fused tensor of shape[key_dim * 2 + value_dim, hidden_size]andin_proj_zis[value_dim, hidden_size].stacked_params_mappingshard IDs: Changed from tuple shard ID(0, 1, 2)and index3to simple integer shard IDs0and1, matching the 2-sliceoutput_sizes.The formula
key_dim * 2 + value_dimwas verified against the HuggingFace checkpoint for Qwen3.5-9B:in_proj_qkv.weighthas shape[8192, 4096]wherekey_dim=2048andvalue_dim=4096, so2048*2 + 4096 = 8192.Additional Fix:
gdn_in_projoutput size for quantized modelsAfter merging with upstream
main, commitf1740006e([Perf] Enable dual stream execution of input projection for Qwen3 #36795) refactored the GDN forward pass to use agdn_in_projcustom op fortorch.compilecompatibility. This introduced a new bug for quantized models (AWQ/GPTQ) with LoRA.What broke
The new forward method passes
self.in_proj_qkvz.weight.shape[0]togdn_in_projas the output tensor size fortorch.compile's fake implementation:With LoRA + AWQ,
self.in_proj_qkvzis aMergedColumnParallelLinearWithLoRAwrapper. Its.weightproperty (BaseLinearLayerWithLoRA.weight) returnsself.base_layer.qweightfor AWQ models. The AWQqweightis a packed 4-bit representation withshape[0] = input_size // 8(e.g.,2048 // 8 = 256), not the actual output dimension (12288).This causes
torch.compileto trace the graph with wrong tensor shapes, and the subsequent.split([8192, 4096])fails:Fix
Compute output sizes analytically from model dimensions instead of reading from the weight tensor:
These computed values are identical to
weight.shape[0]for non-quantized models (MergedColumnParallelLinearstores weight as(output_size_per_partition, input_size)), so there is no regression. For quantized models, this bypasses the packed weight shape entirely.Tested with
cyankiwi/Qwen3.5-9B-AWQ-4bitQwen/Qwen3.5-9BQwen/Qwen3.5-9BQwen/Qwen3.5-35B-A3B-GPTQ-Int4AI assistance was used for this fix. All changes reviewed and tested manually.