[BugFix] Fix Qwen3.5 LoRA IndexError in GDN fused projections by JWriter20 · Pull Request #36309 · vllm-project/vllm

JWriter20 · 2026-03-07T04:48:05Z

Purpose

Fixes an IndexError: list index out of range crash when using LoRA adapters with Qwen3.5 models (dense, MoE, and multimodal variants).

The crash occurs in MergedColumnParallelLinearWithLoRA.set_lora because there is a mismatch between output_sizes (4 slices) and packed_modules_mapping (2 entries) for the GDN (Gated Delta Network) fused projection layers (in_proj_qkvz and in_proj_ba).

Root Cause (Original Fix)

In Qwen3_5GatedDeltaNet.create_qkvz_proj, the MergedColumnParallelLinear was created with:

output_sizes=[key_dim, key_dim, value_dim, value_dim]  # 4 slices

But packed_modules_mapping defined only 2 entries:

"in_proj_qkvz": ["in_proj_qkv", "in_proj_z"]  # 2 entries

The LoRA system requires these to match: set_lora iterates over n_slices = len(output_sizes) and indexes into lora_a[i], but only 2 LoRA weight tensors are created (one per packed_modules_mapping entry), causing IndexError at index 2.

Fix (Original)

output_sizes: Changed from [key_dim, key_dim, value_dim, value_dim] (4 slices) to [key_dim * 2 + value_dim, value_dim] (2 slices). This correctly reflects the HuggingFace checkpoint structure where in_proj_qkv is a single fused tensor of shape [key_dim * 2 + value_dim, hidden_size] and in_proj_z is [value_dim, hidden_size].
stacked_params_mapping shard IDs: Changed from tuple shard ID (0, 1, 2) and index 3 to simple integer shard IDs 0 and 1, matching the 2-slice output_sizes.

The formula key_dim * 2 + value_dim was verified against the HuggingFace checkpoint for Qwen3.5-9B: in_proj_qkv.weight has shape [8192, 4096] where key_dim=2048 and value_dim=4096, so 2048*2 + 4096 = 8192.

Additional Fix: `gdn_in_proj` output size for quantized models

After merging with upstream main, commit f1740006e ([Perf] Enable dual stream execution of input projection for Qwen3 #36795) refactored the GDN forward pass to use a gdn_in_proj custom op for torch.compile compatibility. This introduced a new bug for quantized models (AWQ/GPTQ) with LoRA.

What broke

The new forward method passes self.in_proj_qkvz.weight.shape[0] to gdn_in_proj as the output tensor size for torch.compile's fake implementation:

mixed_qkvz, ba = torch.ops.vllm.gdn_in_proj(
    hidden_states,
    self.in_proj_qkvz.weight.shape[0],  # BUG: wrong for AWQ/GPTQ + LoRA
    self.in_proj_ba.weight.shape[0],
    self.prefix,
)

With LoRA + AWQ, self.in_proj_qkvz is a MergedColumnParallelLinearWithLoRA wrapper. Its .weight property (BaseLinearLayerWithLoRA.weight) returns self.base_layer.qweight for AWQ models. The AWQ qweight is a packed 4-bit representation with shape[0] = input_size // 8 (e.g., 2048 // 8 = 256), not the actual output dimension (12288).

This causes torch.compile to trace the graph with wrong tensor shapes, and the subsequent .split([8192, 4096]) fails:

TorchRuntimeError: Split sizes add up to 12288 but got the tensor's size of 256

Fix

Compute output sizes analytically from model dimensions instead of reading from the weight tensor:

qkv_size = (self.key_dim * 2 + self.value_dim) // self.tp_size
z_size = self.value_dim // self.tp_size
ba_size = (self.num_v_heads * 2) // self.tp_size
mixed_qkvz, ba = torch.ops.vllm.gdn_in_proj(
    hidden_states,
    qkv_size + z_size,
    ba_size,
    self.prefix,
)

These computed values are identical to weight.shape[0] for non-quantized models (MergedColumnParallelLinear stores weight as (output_size_per_partition, input_size)), so there is no regression. For quantized models, this bypasses the packed weight shape entirely.

Tested with

Model	Quantization	LoRA	torch.compile	Result
`cyankiwi/Qwen3.5-9B-AWQ-4bit`	AWQ 4-bit	✅ 3 adapters	✅	✅ Pass
`Qwen/Qwen3.5-9B`	None	❌	✅	✅ Pass
`Qwen/Qwen3.5-9B`	None	✅ 3 adapters	❌ (eager)	✅ Pass
`Qwen/Qwen3.5-35B-A3B-GPTQ-Int4`	GPTQ 4-bit	❌	✅	✅ Pass

AI assistance was used for this fix. All changes reviewed and tested manually.

github-actions · 2026-03-07T04:48:14Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request addresses an IndexError that occurs when using LoRA with Qwen3.5 models. The fix correctly aligns the output_sizes of the GDN fused projection layers with the corresponding packed_modules_mapping by changing the number of output slices from four to two. The stacked_params_mapping for weight loading is also updated to match the new slice configuration. Additionally, a comprehensive set of regression tests has been added to prevent this issue from recurring. The changes appear correct and effectively resolve the bug.

mergify · 2026-03-07T04:52:36Z

Hi @JWriter20, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Fix IndexError: list index out of range when using LoRA adapters with Qwen3.5 models (dense, MoE, and multimodal variants). The root cause was a mismatch between the number of output_sizes slices in the MergedColumnParallelLinear for in_proj_qkvz (4 slices) and the number of entries in packed_modules_mapping (2 entries). When LoRA's set_lora iterated over n_slices=4, it only had 2 LoRA weights available, causing an IndexError at index 2. Changes: - Change create_qkvz_proj output_sizes from [key_dim, key_dim, value_dim, value_dim] to [key_dim * 2 + value_dim, value_dim] to match the 2-entry packed_modules_mapping ["in_proj_qkv", "in_proj_z"] - Update stacked_params_mapping shard IDs from (0,1,2)/3 to 0/1 to match the new 2-slice layout - Add regression tests validating the alignment between output_sizes and packed_modules_mapping for all Qwen3.5 variants The fix preserves the total projection dimension (key_dim*2 + value_dim*2) and matches the HuggingFace checkpoint structure where in_proj_qkv and in_proj_z are stored as separate weight tensors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Jake Writer <writer.j@northeastern.edu>

1dividedby0 · 2026-03-11T00:51:40Z

this PR should merged ASAP. we're already using it in production, and it's absolutely necessary for any reasonable usage of Qwen3.5

dcmaddix · 2026-03-11T17:44:37Z

this PR should merged ASAP. we're already using it in production, and it's absolutely necessary for any reasonable usage of Qwen3.5

We also need this fix. cc: @jeejeelee

jeejeelee

Have you tested this PR with the real LoRA adapter?

musab-mk · 2026-03-12T14:45:56Z

I have tested this with a LoRA adapter, with a base as Qwen/Qwen3.5-397B-A17B-FP8, but have gotten garbage output (random characters) unfortunately. (The adapter works if --enforce-eager is enabled)

1dividedby0 · 2026-03-12T17:58:37Z

@jeejeelee @musab-mk interesting, we have been able to finetune lora adapters and get improved performance on Qwen3.5 (dense) as expected. We've deployed this PR with both our batched inference API and online inference, both work very well on our side. In both cases, enforce eager was false.

jeejeelee · 2026-03-13T11:23:37Z

@musab-mk @1dividedby0 could you please test this PR: #36976

musab-mk · 2026-03-14T02:45:42Z

@jeejeelee Yes, I just tested that PR with Qwen/Qwen3.5-397B-A17B-FP8 as a base, It worked perfectly.

…with LoRA The `gdn_in_proj` custom op (introduced in f174000 / PR vllm-project#36795) uses `self.in_proj_qkvz.weight.shape[0]` to communicate the output tensor size to torch.compile's fake implementation. With LoRA + AWQ/GPTQ quantization, `.weight` returns the quantized `qweight` whose shape is packed (e.g. input_size // 8 for 4-bit), causing a dimension mismatch in the subsequent `.split()` call. Fix: compute output sizes analytically from model dimensions (key_dim, value_dim, num_v_heads, tp_size) instead of reading from the weight tensor shape. These computed values are identical to weight.shape[0] for non-quantized models, so there is no regression. Tested with: - cyankiwi/Qwen3.5-9B-AWQ-4bit + LoRA adapters (torch.compile) - Qwen/Qwen3.5-9B without quantization (torch.compile) - Qwen/Qwen3.5-9B + LoRA adapters without quantization (eager) - Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 (torch.compile) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jake Writer <writer.j@northeastern.edu>

Add two tests to prevent future regressions: 1. test_qwen3_5_forward_does_not_use_weight_shape_for_gdn_in_proj: Verifies the forward method computes gdn_in_proj output sizes from model dimensions instead of .weight.shape[0], which returns wrong values for quantized models (AWQ/GPTQ) with LoRA. 2. test_qwen3_5_gdn_output_sizes_match_model_dims: Validates the computed output size formulas against known Qwen3.5-9B dimensions, including TP sharding correctness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jake Writer <writer.j@northeastern.edu>

mergify · 2026-03-20T09:21:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @JWriter20.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

1dividedby0 · 2026-03-20T20:00:56Z

this is moving a bit slow, just wanted to flag @jeejeelee

jeejeelee · 2026-03-21T06:36:46Z

Since we have already #36976, I think this PR should no longer be neces

JWriter20 · 2026-03-23T20:57:49Z

Since we have already #36976, I think this PR should no longer be neces

Got it, closed. Thank you!

JWriter20 requested review from jeejeelee and sighingnow as code owners March 7, 2026 04:48

mergify bot added qwen Related to Qwen models bug Something isn't working labels Mar 7, 2026

gemini-code-assist bot reviewed Mar 7, 2026

View reviewed changes

JWriter20 mentioned this pull request Mar 7, 2026

[Bug]: Qwen3.5-MoE failed with enable_lora #35286

Closed

1 task

JWriter20 force-pushed the qwen3_5_lora_fix branch from cdae214 to c959452 Compare March 7, 2026 05:05

JWriter20 force-pushed the qwen3_5_lora_fix branch from c959452 to e421959 Compare March 7, 2026 05:07

Merge branch 'vllm-project:main' into qwen3_5_lora_fix

50dcec3

simon-mo approved these changes Mar 11, 2026

View reviewed changes

simon-mo enabled auto-merge (squash) March 11, 2026 18:51

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 11, 2026

mgoin approved these changes Mar 11, 2026

View reviewed changes

Merge branch 'main' into qwen3_5_lora_fix

664ecdf

jeejeelee reviewed Mar 12, 2026

View reviewed changes

Merge branch 'main' into qwen3_5_lora_fix

6ad3a25

auto-merge was automatically disabled March 19, 2026 04:22
Head branch was pushed to by a user without write access

JWriter20 force-pushed the qwen3_5_lora_fix branch from f47c8b7 to ce82f3d Compare March 19, 2026 04:29

JWriter20 force-pushed the qwen3_5_lora_fix branch from d399f88 to 08d7b09 Compare March 19, 2026 04:34

Merge remote-tracking branch 'origin/main' into qwen3_5_lora_fix

adb4e00

mergify bot added the needs-rebase label Mar 20, 2026

JWriter20 closed this Mar 23, 2026

Uh oh!

Conversation

JWriter20 commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Root Cause (Original Fix)

Fix (Original)

Additional Fix: gdn_in_proj output size for quantized models

What broke

Fix

Tested with

Uh oh!

github-actions bot commented Mar 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Mar 7, 2026

Uh oh!

1dividedby0 commented Mar 11, 2026

Uh oh!

dcmaddix commented Mar 11, 2026

Uh oh!

jeejeelee left a comment

Choose a reason for hiding this comment

Uh oh!

musab-mk commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

1dividedby0 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeejeelee commented Mar 13, 2026

Uh oh!

musab-mk commented Mar 14, 2026

Uh oh!

mergify bot commented Mar 20, 2026

Uh oh!

1dividedby0 commented Mar 20, 2026

Uh oh!

jeejeelee commented Mar 21, 2026

Uh oh!

JWriter20 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

JWriter20 commented Mar 7, 2026 •

edited

Loading

Additional Fix: `gdn_in_proj` output size for quantized models

musab-mk commented Mar 12, 2026 •

edited

Loading

1dividedby0 commented Mar 12, 2026 •

edited

Loading