[Disagg][NIXL] Fix heterogeneous TP KV transfer for non-MLA models (same logic with mooncake, Step 1/2 for Qwen3.5 support)#22145
Merged
ShangmingCai merged 3 commits intosgl-project:mainfrom Apr 7, 2026
Conversation
…geneous TP send_kvcache_slice used per-rank kv_head_num instead of total_kv_head_num for head distribution, which loses precision under GQA (total_heads < tp_size). It also lacked GQA replication handling, causing multiple prefill ranks sharing the same KV heads to write to wrong dst offsets. This resulted in corrupted KV cache data on the decode side, producing 0% accuracy on GPQA while staging path (which uses compute_head_slice_params) was correct. Fix: use total_kv_head_num with max(1,...) guards and add src_replication / unique_head_idx logic, aligned with Mooncake's send_kvcache_slice.
Contributor
There was a problem hiding this comment.
Code Review
This pull request refines the KV cache slicing logic by utilizing the total KV head count to ensure accurate head distribution and handle GQA replication. It also updates transfer notification strings to use engine_rank instead of pp_rank to prevent collisions. A potential ZeroDivisionError was found in the head distribution logic, and a suggestion was made to ensure the total head count is positive before use.
Under heterogeneous TP (prefill TP > decode TP) with PP=1, all prefill ranks share pp_rank=0, causing RDMA notifications to collapse into a single key in TransferStatus.received_kvs_per_pp. Since num_pp_ranks_expected equals the number of prefill ranks (e.g. 4), but only 1 unique pp_rank key is ever recorded, is_done() never returns true and decode hangs indefinitely. Fix by using engine_rank (which is unique per prefill rank) instead of pp_rank in kv and state notification tags. This is a pre-existing bug that affects any NIXL + heterogeneous TP (prefill TP > decode TP) + PP=1 configuration with non-MLA models. Made-with: Cursor
410e7e6 to
2c7b29e
Compare
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/rerun-failed-ci |
Collaborator
|
/rerun-failed-ci |
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Motivation
NIXL disaggregated serving with heterogeneous TP (prefill TP ≠ decode TP) on non-MLA models hangs indefinitely due to two bugs in
nixl/conn.py:Notification key collision:
_process_kvcache_transferusespp_rankin RDMA notification tags. With PP=1, all prefill ranks sharepp_rank=0, soTransferStatus.received_kvs_per_pponly records one key whilenum_pp_ranks_expected > 1→is_done()never returnsTrue→ decode hangs.Wrong head distribution:
send_kvcache_sliceuses per-rankkv_head_numinstead oftotal_kv_head_num, losing precision under GQA (total_kv_heads < tp_size). It also misses GQA replication handling, causing incorrectdst_head_start_offsetwhen multiple prefill ranks share the same KV heads.Modifications
send_kvcache_slice(): derive head counts fromtotal_kv_head_numwithmax(1, ...)guards; addsrc_replication/unique_head_idxfor GQA replication, aligned with Mooncake's implementation._process_kvcache_transfer(): replacepp_rankwithengine_rankin KV/state notification tags.Accuracy Tests
Setup
Qwen3-32BGB2001P4D(prefill TP4 → decode TP1×4)NIXLDynamoGSM8K 8-shot1311With fix
Without fix
Same config as above
Result
Decode hangs indefinitely
0completionsManually cancelled after ~50 min
0%| | 0/1311 [00:00<?, ?it/s]
Prefill drained all requests but they stayed permanently in-flight (
#inflight-reqnever drops to 0):Decode workers had zero decode activity after startup; they only exited due to manual cancellation:
Config details:
Checklist