[PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait for all ranks by YAMY1234 · Pull Request #15027 · sgl-project/sglang

YAMY1234 · 2025-12-13T02:31:51Z

Collaborated with @hlu1 on root cause analysis and fix

Motivation

Fix NIXL PP mode correctness bug: decode server prematurely considers KV transfer "complete" after receiving chunks from only one PP rank (instead of all ranks), causing accuracy drop.

Root cause: TransferStatus used Set[int] for chunk IDs without distinguishing PP ranks. Overlapping chunk IDs (0,1,2...) from different PP ranks got deduplicated.

Modifications

Track chunks per PP rank: received_kvs_per_pp: Dict[int, Set[int]]
Record expected count per PP rank: expected_kvs_per_pp: Dict[int, int]
Update is_done(): wait for all PP ranks to complete all chunks
Include pp_rank in notification format (backward compatible)
Replace -1 sentinel with is_failure: bool

Accuracy Tests

GSM8K, PP=4 TP=4:

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MODEL=nvidia/DeepSeek-R1-0528-NVFP4-v2
export SERVED_NAME=dsr1
export HOST_IP=127.0.0.1

python3 -m sglang.launch_server \
  --model ${MODEL} \
  --served-model-name ${SERVED_NAME} \
  --host ${HOST_IP} \
  --port 12347 \
  --trust-remote-code \
  --disaggregation-mode prefill \
  --context-length 131072 \
  --attention-backend trtllm_mla \
  --moe-runner-backend flashinfer_trtllm \
  --quantization modelopt_fp4 \
  --kv-cache-dtype fp8_e4m3 \
  --page-size 64 \
  --decode-log-interval 1 \
  --disaggregation-transfer-backend nixl \
   --tensor-parallel-size 1 --pipeline-parallel-size 4 --expert-parallel-size 1 --chunked-prefill-size 1024 --cuda-graph-max-bs 32 --max-running-requests 36 --disable-radix-cache

#!/bin/bash
export CUDA_VISIBLE_DEVICES=4,5,6,7
export MODEL=nvidia/DeepSeek-R1-0528-NVFP4-v2
export SERVED_NAME=dsr1
export HOST_IP=127.0.0.1


python3 -m sglang.launch_server \
  --model ${MODEL} \
  --served-model-name ${SERVED_NAME} \
  --host ${HOST_IP} \
  --port 12346 \
  --trust-remote-code \
  --disaggregation-mode decode \
  --context-length 131072 \
  --attention-backend trtllm_mla \
  --moe-runner-backend flashinfer_trtllm \
  --quantization modelopt_fp4 \
  --kv-cache-dtype fp8_e4m3 \
  --page-size 64 \
  --decode-log-interval 1 \
  --disaggregation-transfer-backend nixl \
   --tensor-parallel-size 4 --pipeline-parallel-size 1 --expert-parallel-size 1 --chunked-prefill-size 1024 --cuda-graph-max-bs 32 --max-running-requests 36 --disable-radix-cache

Metric	Before	After
Accuracy	0.358	0.959

Checklist

Format code with pre-commit
Accuracy benchmark provided
Follow SGLang code style

gemini-code-assist · 2025-12-13T02:31:55Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ShangmingCai

Logic looks reasonable.
CC: @ishandhanani @shaharmor98

ShangmingCai · 2025-12-13T04:54:15Z

/tag-and-rerun-ci

…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (25 commits) [NPU] perf update with kvcache nz & w4a8 quant (sgl-project#14423) [PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait for all ranks (sgl-project#15027) Fix GLM-4.6 tool calls don't support streaming output for arguments i… (sgl-project#13989) feature: adding nightly wheel workflow and indexer (sgl-project#14924) [diffusion] feat: Improve LoRA compatibility by adding unified format detection and diffusers-based normalization (sgl-project#14659) [Fix] Disable trtllm moe backend for draft model for a qucik fix (sgl-project#15002) [diffusion] fix: use NDRotaryEmbedding in flux_2 (sgl-project#15034) Mistral Large 3 NVFP4 support (sgl-project#14485) call check_quantized_moe_compatibility after initialize (sgl-project#13876) Add sgl_router_attempt_http_responses_total for single attempt information (sgl-project#15037) Add error code in prometheus metrics and add X-SMG-Error-Code header (sgl-project#15036) Provide more fine grained error reason for reqwest error (sgl-project#15032) Tiny change http router response format to unify (sgl-project#15031) Tiny unify grpc existing error responses into new format (sgl-project#15030) Add `code` field and unify error responses for router (sgl-project#15028) Super tiny remove unused log_request (sgl-project#15035) Fix decode OOM caused by retraction (sgl-project#14939) [CI]Add gb200 runner back (sgl-project#15024) Add a special label for b200 CI runner that can run kernel tests (sgl-project#15033) Fix regression caused by fa3 block_table (sgl-project#15009) ... # Conflicts: # python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py

…or all ranks (sgl-project#15027)

YAMY1234 added 2 commits December 12, 2025 18:14

fix pp prefill issue with nixl

90e8017

format

f931293

YAMY1234 requested review from ByronHsu, ShangmingCai and hnyls2002 as code owners December 13, 2025 02:31

YAMY1234 marked this pull request as draft December 13, 2025 02:32

Fridge003 assigned YAMY1234 Dec 13, 2025

Fridge003 added the high priority label Dec 13, 2025

YAMY1234 marked this pull request as ready for review December 13, 2025 04:03

ShangmingCai approved these changes Dec 13, 2025

View reviewed changes

github-actions bot added the run-ci label Dec 13, 2025

Fridge003 merged commit 0e7d796 into sgl-project:main Dec 13, 2025
325 of 352 checks passed

Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 17, 2025

[PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait f…

cf6ea85

…or all ranks (sgl-project#15027)

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

[PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait f…

7405372

…or all ranks (sgl-project#15027)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait for all ranks#15027

[PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait for all ranks#15027
Fridge003 merged 2 commits intosgl-project:mainfrom
YAMY1234:nixl_recov_clean

YAMY1234 commented Dec 13, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 13, 2025

Uh oh!

ShangmingCai left a comment

Uh oh!

ShangmingCai commented Dec 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

YAMY1234 commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Checklist

Uh oh!

gemini-code-assist bot commented Dec 13, 2025

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai commented Dec 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YAMY1234 commented Dec 13, 2025 •

edited

Loading