[Disagg][NIXL] Fix heterogeneous TP KV transfer for non-MLA models (same logic with mooncake, Step 1/2 for Qwen3.5 support) by YAMY1234 · Pull Request #22145 · sgl-project/sglang

YAMY1234 · 2026-04-05T08:21:04Z

Motivation

NIXL disaggregated serving with heterogeneous TP (prefill TP ≠ decode TP) on non-MLA models hangs indefinitely due to two bugs in nixl/conn.py:

Notification key collision: _process_kvcache_transfer uses pp_rank in RDMA notification tags. With PP=1, all prefill ranks share pp_rank=0, so TransferStatus.received_kvs_per_pp only records one key while num_pp_ranks_expected > 1 → is_done() never returns True → decode hangs.
Wrong head distribution: send_kvcache_slice uses per-rank kv_head_num instead of total_kv_head_num, losing precision under GQA (total_kv_heads < tp_size). It also misses GQA replication handling, causing incorrect dst_head_start_offset when multiple prefill ranks share the same KV heads.

Modifications

send_kvcache_slice(): derive head counts from total_kv_head_num with max(1, ...) guards; add src_replication / unique_head_idx for GQA replication, aligned with Mooncake's implementation.
_process_kvcache_transfer(): replace pp_rank with engine_rank in KV/state notification tags.

Accuracy Tests

Setup

Model: Qwen3-32B
Platform: GB200
Topology: 1P4D (prefill TP4 → decode TP1×4)
Backend: NIXL
Frontend: Dynamo
Eval: GSM8K 8-shot
Examples: 1311

With fix

100%|██████████| 1311/1311 [08:52<00:00,  2.46it/s]
Total latency: 532.396 s
Score: 0.961
Output throughput: 568960.187 token/s
[METRIC] gsm8k_score=0.9610983981693364 labels={"model": "Qwen/Qwen3-32B", "eval": "gsm8k"}
[METRIC] gsm8k_latency=532.3955084759946 labels={"model": "Qwen/Qwen3-32B", "eval": "gsm8k"}
{'score:std': np.float64(0.19336045926112227), 'score': np.float64(0.9610983981693364), 'latency': 532.3955084759946, 'output_throughput': 568960.1868864341}

Without fix

Same config as above

Result

Decode hangs indefinitely
0 completions
Manually cancelled after ~50 min

0%| | 0/1311 [00:00<?, ?it/s]

Prefill drained all requests but they stayed permanently in-flight (#inflight-req never drops to 0):

Prefill batch, #new-seq: 1, #new-token: 1536, #cached-token: 0, token usage: 0.00,
  #running-req: 0, #queue-req: 0, #prealloc-req: 0, #inflight-req: 2

Decode workers had zero decode activity after startup; they only exited due to manual cancellation:

WARN  Performance is NOT guaranteed when using different TP sizes for non-MLA models.
* STEP 1383563.6 ON lyris0105 CANCELLED AT 2026-04-05T00:43:19 DUE to SIGNAL Terminated *

Config details:

  name: "qwen3-32b-hetero-tp4-tp1-nixl-verify"
  model:
    path: "Qwen/Qwen3-32B"
    precision: "bf16"
  resources:
    gpus_per_node: 4
    prefill_nodes: 1      # 1 prefill worker @ TP4 = 4 GPUs
    decode_nodes: 1       # 4 decode workers @ TP1 = 4 GPUs
    prefill_workers: 1
    decode_workers: 4
  backend:
    type: sglang
    sglang_config:
      prefill:
        served-model-name: "Qwen/Qwen3-32B"
        trust-remote-code: true
        tensor-parallel-size: 4
        disaggregation-mode: "prefill"
        disaggregation-transfer-backend: "nixl"
        mem-fraction-static: 0.85
        context-length: 32768
        page-size: 64
        disable-radix-cache: true
        watchdog-timeout: 1000000
      decode:
        served-model-name: "Qwen/Qwen3-32B"
        trust-remote-code: true
        tensor-parallel-size: 1
        disaggregation-mode: "decode"
        disaggregation-transfer-backend: "nixl"
        mem-fraction-static: 0.85
        context-length: 32768
        page-size: 64
        disable-radix-cache: true
        watchdog-timeout: 1000000
  benchmark:
    type: "gsm8k"
    num_examples: 1319
    max_tokens: 16000
    num_threads: 512
    num_shots: 8

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

…geneous TP send_kvcache_slice used per-rank kv_head_num instead of total_kv_head_num for head distribution, which loses precision under GQA (total_heads < tp_size). It also lacked GQA replication handling, causing multiple prefill ranks sharing the same KV heads to write to wrong dst offsets. This resulted in corrupted KV cache data on the decode side, producing 0% accuracy on GPQA while staging path (which uses compute_head_slice_params) was correct. Fix: use total_kv_head_num with max(1,...) guards and add src_replication / unique_head_idx logic, aligned with Mooncake's send_kvcache_slice.

gemini-code-assist

Code Review

This pull request refines the KV cache slicing logic by utilizing the total KV head count to ensure accurate head distribution and handle GQA replication. It also updates transfer notification strings to use engine_rank instead of pp_rank to prevent collisions. A potential ZeroDivisionError was found in the head distribution logic, and a suggestion was made to ensure the total head count is positive before use.

python/sglang/srt/disaggregation/nixl/conn.py

Under heterogeneous TP (prefill TP > decode TP) with PP=1, all prefill ranks share pp_rank=0, causing RDMA notifications to collapse into a single key in TransferStatus.received_kvs_per_pp. Since num_pp_ranks_expected equals the number of prefill ranks (e.g. 4), but only 1 unique pp_rank key is ever recorded, is_done() never returns true and decode hangs indefinitely. Fix by using engine_rank (which is unique per prefill rank) instead of pp_rank in kv and state notification tags. This is a pre-existing bug that affects any NIXL + heterogeneous TP (prefill TP > decode TP) + PP=1 configuration with non-MLA models. Made-with: Cursor

ShangmingCai

LGTM

ShangmingCai · 2026-04-05T16:58:00Z

/tag-and-rerun-ci

ShangmingCai · 2026-04-06T06:00:42Z

/rerun-failed-ci

ShangmingCai · 2026-04-06T18:08:23Z

/rerun-failed-ci

ShangmingCai · 2026-04-07T06:51:44Z

PD CI has passed.

YAMY1234 requested review from ByronHsu, ShangmingCai and hnyls2002 as code owners April 5, 2026 08:21

gemini-code-assist bot reviewed Apr 5, 2026

View reviewed changes

python/sglang/srt/disaggregation/nixl/conn.py Show resolved Hide resolved

YAMY1234 force-pushed the fix/nixl-heterogeneous-tp-non-mla branch from 410e7e6 to 2c7b29e Compare April 5, 2026 09:02

ShangmingCai approved these changes Apr 5, 2026

View reviewed changes

github-actions bot added the run-ci label Apr 5, 2026

Merge branch 'main' into fix/nixl-heterogeneous-tp-non-mla

05a3dcf

YAMY1234 mentioned this pull request Apr 7, 2026

[Disagg][NIXL] Support Mamba state slice transfer for heterogeneous TP (Step 2/2 for Qwen3.5) #22240

Merged

5 tasks

ShangmingCai merged commit 3148742 into sgl-project:main Apr 7, 2026
249 of 299 checks passed

nvpohanh mentioned this pull request Apr 8, 2026

[Tracking] Qwen3.5-397B (G)B200 Functional Support and Optimizations #20024

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Disagg][NIXL] Fix heterogeneous TP KV transfer for non-MLA models (same logic with mooncake, Step 1/2 for Qwen3.5 support)#22145

[Disagg][NIXL] Fix heterogeneous TP KV transfer for non-MLA models (same logic with mooncake, Step 1/2 for Qwen3.5 support)#22145
ShangmingCai merged 3 commits intosgl-project:mainfrom
YAMY1234:fix/nixl-heterogeneous-tp-non-mla

YAMY1234 commented Apr 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

ShangmingCai left a comment

Uh oh!

ShangmingCai commented Apr 5, 2026

Uh oh!

ShangmingCai commented Apr 6, 2026

Uh oh!

ShangmingCai commented Apr 6, 2026

Uh oh!

ShangmingCai commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

YAMY1234 commented Apr 5, 2026

Motivation

Modifications

Accuracy Tests

With fix

Without fix

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai commented Apr 5, 2026

Uh oh!

ShangmingCai commented Apr 6, 2026

Uh oh!

ShangmingCai commented Apr 6, 2026

Uh oh!

ShangmingCai commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants