Skip to content

[PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait for all ranks#15027

Merged
Fridge003 merged 2 commits intosgl-project:mainfrom
YAMY1234:nixl_recov_clean
Dec 13, 2025
Merged

[PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait for all ranks#15027
Fridge003 merged 2 commits intosgl-project:mainfrom
YAMY1234:nixl_recov_clean

Conversation

@YAMY1234
Copy link
Contributor

@YAMY1234 YAMY1234 commented Dec 13, 2025

Collaborated with @hlu1 on root cause analysis and fix

Motivation

Fix NIXL PP mode correctness bug: decode server prematurely considers KV transfer "complete" after receiving chunks from only one PP rank (instead of all ranks), causing accuracy drop.

Root cause: TransferStatus used Set[int] for chunk IDs without distinguishing PP ranks. Overlapping chunk IDs (0,1,2...) from different PP ranks got deduplicated.

Modifications

  • Track chunks per PP rank: received_kvs_per_pp: Dict[int, Set[int]]
  • Record expected count per PP rank: expected_kvs_per_pp: Dict[int, int]
  • Update is_done(): wait for all PP ranks to complete all chunks
  • Include pp_rank in notification format (backward compatible)
  • Replace -1 sentinel with is_failure: bool

Accuracy Tests

GSM8K, PP=4 TP=4:

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MODEL=nvidia/DeepSeek-R1-0528-NVFP4-v2
export SERVED_NAME=dsr1
export HOST_IP=127.0.0.1

python3 -m sglang.launch_server \
  --model ${MODEL} \
  --served-model-name ${SERVED_NAME} \
  --host ${HOST_IP} \
  --port 12347 \
  --trust-remote-code \
  --disaggregation-mode prefill \
  --context-length 131072 \
  --attention-backend trtllm_mla \
  --moe-runner-backend flashinfer_trtllm \
  --quantization modelopt_fp4 \
  --kv-cache-dtype fp8_e4m3 \
  --page-size 64 \
  --decode-log-interval 1 \
  --disaggregation-transfer-backend nixl \
   --tensor-parallel-size 1 --pipeline-parallel-size 4 --expert-parallel-size 1 --chunked-prefill-size 1024 --cuda-graph-max-bs 32 --max-running-requests 36 --disable-radix-cache

#!/bin/bash
export CUDA_VISIBLE_DEVICES=4,5,6,7
export MODEL=nvidia/DeepSeek-R1-0528-NVFP4-v2
export SERVED_NAME=dsr1
export HOST_IP=127.0.0.1


python3 -m sglang.launch_server \
  --model ${MODEL} \
  --served-model-name ${SERVED_NAME} \
  --host ${HOST_IP} \
  --port 12346 \
  --trust-remote-code \
  --disaggregation-mode decode \
  --context-length 131072 \
  --attention-backend trtllm_mla \
  --moe-runner-backend flashinfer_trtllm \
  --quantization modelopt_fp4 \
  --kv-cache-dtype fp8_e4m3 \
  --page-size 64 \
  --decode-log-interval 1 \
  --disaggregation-transfer-backend nixl \
   --tensor-parallel-size 4 --pipeline-parallel-size 1 --expert-parallel-size 1 --chunked-prefill-size 1024 --cuda-graph-max-bs 32 --max-running-requests 36 --disable-radix-cache

Metric Before After
Accuracy 0.358 0.959

Checklist

  • Format code with pre-commit
  • Accuracy benchmark provided
  • Follow SGLang code style

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@YAMY1234 YAMY1234 marked this pull request as draft December 13, 2025 02:32
@YAMY1234 YAMY1234 marked this pull request as ready for review December 13, 2025 04:03
Copy link
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks reasonable.
CC: @ishandhanani @shaharmor98

@ShangmingCai
Copy link
Collaborator

/tag-and-rerun-ci

@Fridge003 Fridge003 merged commit 0e7d796 into sgl-project:main Dec 13, 2025
325 of 352 checks passed
Liwansi added a commit to iforgetmyname/sglang that referenced this pull request Dec 13, 2025
…n_eagle3_npu

* 'main' of https://github.com/sgl-project/sglang: (25 commits)
  [NPU] perf update with kvcache nz & w4a8 quant (sgl-project#14423)
  [PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait for all ranks (sgl-project#15027)
  Fix GLM-4.6 tool calls don't support streaming output for arguments i… (sgl-project#13989)
  feature: adding nightly wheel workflow and indexer (sgl-project#14924)
  [diffusion] feat: Improve LoRA compatibility by adding unified format detection and diffusers-based normalization (sgl-project#14659)
  [Fix] Disable trtllm moe backend for draft model for a qucik fix (sgl-project#15002)
  [diffusion] fix: use NDRotaryEmbedding in flux_2   (sgl-project#15034)
  Mistral Large 3 NVFP4 support (sgl-project#14485)
  call check_quantized_moe_compatibility after initialize (sgl-project#13876)
  Add sgl_router_attempt_http_responses_total for single attempt information (sgl-project#15037)
  Add error code in prometheus metrics and add X-SMG-Error-Code header (sgl-project#15036)
  Provide more fine grained error reason for reqwest error (sgl-project#15032)
  Tiny change http router response format to unify (sgl-project#15031)
  Tiny unify grpc existing error responses into new format (sgl-project#15030)
  Add `code` field and unify error responses for router (sgl-project#15028)
  Super tiny remove unused log_request (sgl-project#15035)
  Fix decode OOM caused by retraction (sgl-project#14939)
  [CI]Add gb200 runner back (sgl-project#15024)
  Add a special label for b200 CI runner that can run kernel tests (sgl-project#15033)
  Fix regression caused by fa3 block_table (sgl-project#15009)
  ...

# Conflicts:
#	python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py
Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 17, 2025
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants