Fix lora buffer sizing for tp>num_kv cases by opherlieber · Pull Request #24239 · sgl-project/sglang

opherlieber · 2026-05-01T18:51:32Z

Motivation

LoRA loading failed (or silently dropped V) at TP sizes that require KV-head replication. Fixes both the buffer allocation and the slice-time offset arithmetic so adapters load correctly at any TP.
Concrete trigger: Qwen3.5-35B-A3B (num_kv_heads=2) at TP=8 errored with LoRA buffer shape [1152, 32] does not match weight shape [1280, 32]; at TP=4 it loaded but the V slice was empty (0 rows), silently producing wrong logprobs.

Modifications

QKVParallelLinearWithLoRA.slice_lora_b_weights: compute q_size / k_size from the pre-replication head counts (base_layer.total_num_heads * head_size, base_layer.total_num_kv_heads * head_size) instead of output_sizes.
New helper lora.utils.get_qkv_lora_kv_total(num_key_value_heads): returns the kv-head count the qkv LoRA buffer must reserve for after replication. Asserts attn_tp_size == tp_size since the rest of the LoRA path (esp. mem_pool.py) sizes by global tp_size — DP-attention / context-parallel are not yet supported with LoRA.
Use get_qkv_lora_kv_total(...) in the qkv_proj branch of get_hidden_dim in lora/utils.py, models/nemotron_h.py, and models/qwen3_5.py.

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request addresses issues with LoRA weight slicing and buffer sizing when tensor parallelism size exceeds the number of KV heads. It introduces a utility function get_qkv_lora_kv_total to correctly calculate the replicated KV head count and updates the get_hidden_dim logic across several model implementations and utility layers to ensure consistent memory allocation and weight slicing. I have no feedback to provide.

jybsuper · 2026-05-06T21:52:34Z

Fixed in #24420

Fix lora buffer sizing for tp>num_kv cases

228f7cc

opherlieber requested review from Fridge003, Ying1123, lifuhuang and yushengsu-thu as code owners May 1, 2026 18:51

gemini-code-assist Bot reviewed May 1, 2026

View reviewed changes

yushengsu-thu self-assigned this May 1, 2026

jybsuper closed this May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix lora buffer sizing for tp>num_kv cases#24239

Fix lora buffer sizing for tp>num_kv cases#24239
opherlieber wants to merge 1 commit into
sgl-project:mainfrom
opherlieber:lora-buffer-sizing-tp-gt-kv

opherlieber commented May 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

jybsuper commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

opherlieber commented May 1, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

jybsuper commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants