Skip to content

Fix lora buffer sizing for tp>num_kv cases#24239

Closed
opherlieber wants to merge 1 commit into
sgl-project:mainfrom
opherlieber:lora-buffer-sizing-tp-gt-kv
Closed

Fix lora buffer sizing for tp>num_kv cases#24239
opherlieber wants to merge 1 commit into
sgl-project:mainfrom
opherlieber:lora-buffer-sizing-tp-gt-kv

Conversation

@opherlieber
Copy link
Copy Markdown
Contributor

Motivation

  • LoRA loading failed (or silently dropped V) at TP sizes that require KV-head replication. Fixes both the buffer allocation and the slice-time offset arithmetic so adapters load correctly at any TP.
  • Concrete trigger: Qwen3.5-35B-A3B (num_kv_heads=2) at TP=8 errored with LoRA buffer shape [1152, 32] does not match weight shape [1280, 32]; at TP=4 it loaded but the V slice was empty (0 rows), silently producing wrong logprobs.

Modifications

  • QKVParallelLinearWithLoRA.slice_lora_b_weights: compute q_size / k_size from the pre-replication head counts (base_layer.total_num_heads * head_size, base_layer.total_num_kv_heads * head_size) instead of output_sizes.
  • New helper lora.utils.get_qkv_lora_kv_total(num_key_value_heads): returns the kv-head count the qkv LoRA buffer must reserve for after replication. Asserts attn_tp_size == tp_size since the rest of the LoRA path (esp. mem_pool.py) sizes by global tp_size — DP-attention / context-parallel are not yet supported with LoRA.
  • Use get_qkv_lora_kv_total(...) in the qkv_proj branch of get_hidden_dim in lora/utils.py, models/nemotron_h.py, and models/qwen3_5.py.

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses issues with LoRA weight slicing and buffer sizing when tensor parallelism size exceeds the number of KV heads. It introduces a utility function get_qkv_lora_kv_total to correctly calculate the replicated KV head count and updates the get_hidden_dim logic across several model implementations and utility layers to ensure consistent memory allocation and weight slicing. I have no feedback to provide.

@yushengsu-thu yushengsu-thu self-assigned this May 1, 2026
@jybsuper
Copy link
Copy Markdown
Collaborator

jybsuper commented May 6, 2026

Fixed in #24420

@jybsuper jybsuper closed this May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants