Skip to content

[BugFix] Fix Qwen3.5 LoRA IndexError in GDN fused projections#36309

Closed
JWriter20 wants to merge 7 commits intovllm-project:mainfrom
JWriter20:qwen3_5_lora_fix
Closed

[BugFix] Fix Qwen3.5 LoRA IndexError in GDN fused projections#36309
JWriter20 wants to merge 7 commits intovllm-project:mainfrom
JWriter20:qwen3_5_lora_fix

Conversation

@JWriter20
Copy link

@JWriter20 JWriter20 commented Mar 7, 2026

Fixes #35286

Purpose

Fixes an IndexError: list index out of range crash when using LoRA adapters with Qwen3.5 models (dense, MoE, and multimodal variants).

The crash occurs in MergedColumnParallelLinearWithLoRA.set_lora because there is a mismatch between output_sizes (4 slices) and packed_modules_mapping (2 entries) for the GDN (Gated Delta Network) fused projection layers (in_proj_qkvz and in_proj_ba).

Root Cause (Original Fix)

In Qwen3_5GatedDeltaNet.create_qkvz_proj, the MergedColumnParallelLinear was created with:

output_sizes=[key_dim, key_dim, value_dim, value_dim]  # 4 slices

But packed_modules_mapping defined only 2 entries:

"in_proj_qkvz": ["in_proj_qkv", "in_proj_z"]  # 2 entries

The LoRA system requires these to match: set_lora iterates over n_slices = len(output_sizes) and indexes into lora_a[i], but only 2 LoRA weight tensors are created (one per packed_modules_mapping entry), causing IndexError at index 2.

Fix (Original)

  1. output_sizes: Changed from [key_dim, key_dim, value_dim, value_dim] (4 slices) to [key_dim * 2 + value_dim, value_dim] (2 slices). This correctly reflects the HuggingFace checkpoint structure where in_proj_qkv is a single fused tensor of shape [key_dim * 2 + value_dim, hidden_size] and in_proj_z is [value_dim, hidden_size].

  2. stacked_params_mapping shard IDs: Changed from tuple shard ID (0, 1, 2) and index 3 to simple integer shard IDs 0 and 1, matching the 2-slice output_sizes.

The formula key_dim * 2 + value_dim was verified against the HuggingFace checkpoint for Qwen3.5-9B: in_proj_qkv.weight has shape [8192, 4096] where key_dim=2048 and value_dim=4096, so 2048*2 + 4096 = 8192.


Additional Fix: gdn_in_proj output size for quantized models

After merging with upstream main, commit f1740006e ([Perf] Enable dual stream execution of input projection for Qwen3 #36795) refactored the GDN forward pass to use a gdn_in_proj custom op for torch.compile compatibility. This introduced a new bug for quantized models (AWQ/GPTQ) with LoRA.

What broke

The new forward method passes self.in_proj_qkvz.weight.shape[0] to gdn_in_proj as the output tensor size for torch.compile's fake implementation:

mixed_qkvz, ba = torch.ops.vllm.gdn_in_proj(
    hidden_states,
    self.in_proj_qkvz.weight.shape[0],  # BUG: wrong for AWQ/GPTQ + LoRA
    self.in_proj_ba.weight.shape[0],
    self.prefix,
)

With LoRA + AWQ, self.in_proj_qkvz is a MergedColumnParallelLinearWithLoRA wrapper. Its .weight property (BaseLinearLayerWithLoRA.weight) returns self.base_layer.qweight for AWQ models. The AWQ qweight is a packed 4-bit representation with shape[0] = input_size // 8 (e.g., 2048 // 8 = 256), not the actual output dimension (12288).

This causes torch.compile to trace the graph with wrong tensor shapes, and the subsequent .split([8192, 4096]) fails:

TorchRuntimeError: Split sizes add up to 12288 but got the tensor's size of 256

Fix

Compute output sizes analytically from model dimensions instead of reading from the weight tensor:

qkv_size = (self.key_dim * 2 + self.value_dim) // self.tp_size
z_size = self.value_dim // self.tp_size
ba_size = (self.num_v_heads * 2) // self.tp_size
mixed_qkvz, ba = torch.ops.vllm.gdn_in_proj(
    hidden_states,
    qkv_size + z_size,
    ba_size,
    self.prefix,
)

These computed values are identical to weight.shape[0] for non-quantized models (MergedColumnParallelLinear stores weight as (output_size_per_partition, input_size)), so there is no regression. For quantized models, this bypasses the packed weight shape entirely.

Tested with

Model Quantization LoRA torch.compile Result
cyankiwi/Qwen3.5-9B-AWQ-4bit AWQ 4-bit ✅ 3 adapters ✅ Pass
Qwen/Qwen3.5-9B None ✅ Pass
Qwen/Qwen3.5-9B None ✅ 3 adapters ❌ (eager) ✅ Pass
Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 GPTQ 4-bit ✅ Pass

AI assistance was used for this fix. All changes reviewed and tested manually.

@github-actions
Copy link

github-actions bot commented Mar 7, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added qwen Related to Qwen models bug Something isn't working labels Mar 7, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an IndexError that occurs when using LoRA with Qwen3.5 models. The fix correctly aligns the output_sizes of the GDN fused projection layers with the corresponding packed_modules_mapping by changing the number of output slices from four to two. The stacked_params_mapping for weight loading is also updated to match the new slice configuration. Additionally, a comprehensive set of regression tests has been added to prevent this issue from recurring. The changes appear correct and effectively resolve the bug.

@mergify
Copy link

mergify bot commented Mar 7, 2026

Hi @JWriter20, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Fix IndexError: list index out of range when using LoRA adapters with
Qwen3.5 models (dense, MoE, and multimodal variants).

The root cause was a mismatch between the number of output_sizes slices
in the MergedColumnParallelLinear for in_proj_qkvz (4 slices) and the
number of entries in packed_modules_mapping (2 entries). When LoRA's
set_lora iterated over n_slices=4, it only had 2 LoRA weights available,
causing an IndexError at index 2.

Changes:
- Change create_qkvz_proj output_sizes from [key_dim, key_dim, value_dim,
  value_dim] to [key_dim * 2 + value_dim, value_dim] to match the 2-entry
  packed_modules_mapping ["in_proj_qkv", "in_proj_z"]
- Update stacked_params_mapping shard IDs from (0,1,2)/3 to 0/1 to match
  the new 2-slice layout
- Add regression tests validating the alignment between output_sizes and
  packed_modules_mapping for all Qwen3.5 variants

The fix preserves the total projection dimension (key_dim*2 + value_dim*2)
and matches the HuggingFace checkpoint structure where in_proj_qkv and
in_proj_z are stored as separate weight tensors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Jake Writer <writer.j@northeastern.edu>
@1dividedby0
Copy link

this PR should merged ASAP. we're already using it in production, and it's absolutely necessary for any reasonable usage of Qwen3.5

@dcmaddix
Copy link
Contributor

this PR should merged ASAP. we're already using it in production, and it's absolutely necessary for any reasonable usage of Qwen3.5

We also need this fix. cc: @jeejeelee

@simon-mo simon-mo enabled auto-merge (squash) March 11, 2026 18:51
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 11, 2026
Copy link
Collaborator

@jeejeelee jeejeelee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this PR with the real LoRA adapter?

@musab-mk
Copy link

musab-mk commented Mar 12, 2026

I have tested this with a LoRA adapter, with a base as Qwen/Qwen3.5-397B-A17B-FP8, but have gotten garbage output (random characters) unfortunately. (The adapter works if --enforce-eager is enabled)

@1dividedby0
Copy link

1dividedby0 commented Mar 12, 2026

@jeejeelee @musab-mk interesting, we have been able to finetune lora adapters and get improved performance on Qwen3.5 (dense) as expected. We've deployed this PR with both our batched inference API and online inference, both work very well on our side. In both cases, enforce eager was false.

@jeejeelee
Copy link
Collaborator

@musab-mk @1dividedby0 could you please test this PR: #36976

@musab-mk
Copy link

@jeejeelee Yes, I just tested that PR with Qwen/Qwen3.5-397B-A17B-FP8 as a base, It worked perfectly.

auto-merge was automatically disabled March 19, 2026 04:22

Head branch was pushed to by a user without write access

…with LoRA

The `gdn_in_proj` custom op (introduced in f174000 / PR vllm-project#36795) uses
`self.in_proj_qkvz.weight.shape[0]` to communicate the output tensor
size to torch.compile's fake implementation. With LoRA + AWQ/GPTQ
quantization, `.weight` returns the quantized `qweight` whose shape is
packed (e.g. input_size // 8 for 4-bit), causing a dimension mismatch
in the subsequent `.split()` call.

Fix: compute output sizes analytically from model dimensions
(key_dim, value_dim, num_v_heads, tp_size) instead of reading from
the weight tensor shape. These computed values are identical to
weight.shape[0] for non-quantized models, so there is no regression.

Tested with:
- cyankiwi/Qwen3.5-9B-AWQ-4bit + LoRA adapters (torch.compile)
- Qwen/Qwen3.5-9B without quantization (torch.compile)
- Qwen/Qwen3.5-9B + LoRA adapters without quantization (eager)
- Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 (torch.compile)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jake Writer <writer.j@northeastern.edu>
Add two tests to prevent future regressions:

1. test_qwen3_5_forward_does_not_use_weight_shape_for_gdn_in_proj:
   Verifies the forward method computes gdn_in_proj output sizes from
   model dimensions instead of .weight.shape[0], which returns wrong
   values for quantized models (AWQ/GPTQ) with LoRA.

2. test_qwen3_5_gdn_output_sizes_match_model_dims:
   Validates the computed output size formulas against known Qwen3.5-9B
   dimensions, including TP sharding correctness.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jake Writer <writer.j@northeastern.edu>
@mergify
Copy link

mergify bot commented Mar 20, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @JWriter20.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 20, 2026
@1dividedby0
Copy link

this is moving a bit slow, just wanted to flag @jeejeelee

@jeejeelee
Copy link
Collaborator

Since we have already #36976, I think this PR should no longer be neces

@JWriter20 JWriter20 closed this Mar 23, 2026
@JWriter20
Copy link
Author

Since we have already #36976, I think this PR should no longer be neces

Got it, closed. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-rebase qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Qwen3.5-MoE failed with enable_lora

7 participants