Skip to content

[PD] Rate limit prefill inflight polling warnings#24967

Merged
ShangmingCai merged 3 commits into
sgl-project:mainfrom
tangcy98:prefill-inflight-warning
May 12, 2026
Merged

[PD] Rate limit prefill inflight polling warnings#24967
ShangmingCai merged 3 commits into
sgl-project:mainfrom
tangcy98:prefill-inflight-warning

Conversation

@tangcy98
Copy link
Copy Markdown
Contributor

@tangcy98 tangcy98 commented May 11, 2026

Motivation

In disaggregated prefill, process_disagg_prefill_inflight_queue polls inflight KV senders on every scheduler loop. If a request remains in a transient non-terminal polling state, the same warning can be emitted repeatedly for the same rid within a short time.

The request is still treated as undone and can complete normally later, so this warning is useful as a signal but too noisy when printed every loop.

This is the output of the current code in this case.
image

Modifications

  • Use logger.warning_once for prefill inflight unexpected polling state warnings.
  • Keep the existing warning semantics, but suppress duplicate identical warning messages.
  • Apply the same warning_once behavior to both the PP consensus path and the generic prefill inflight unexpected-state path.

Accuracy Tests

N/A. This PR only changes warning/logging behavior.

Speed Tests and Profiling

N/A. No inference hot-path behavior is changed beyond de-duplicating repeated warning logs.

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Signed-off-by: zhangzhang <tangchenyu@xiaohongshu.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to suppress redundant warning logs during prefill inflight polling by adding a _log_prefill_inflight_poll_warning helper. This helper tracks warning states per request ID on the scheduler to throttle repeated messages. The review feedback suggests improving the readability of the cleanup logic by replacing a getattr call with an explicit hasattr check when removing completed requests from the warning state.

Comment thread python/sglang/srt/disaggregation/prefill.py Outdated
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
self.metrics_collector.increment_transfer_failed_reqs()
else:
logger.warning(
_log_prefill_inflight_poll_warning(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use logger.warning_once?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that logger.warning_once would be more concise, but I'm unsure whether it is needed here to inform the user about the real-time status of this request. Using warning_once would lose track of the request's status. What do you think? I can update it to a more suitable implementation.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, in process_disagg_prefill_inflight_queue, the kvpoll state should never beKVPoll.Bootstrapping, I have never seen this warning log before. I assume you are using nixl/mori backend?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using Dynamo+SGLang+Qwen3.5-122B-A10B+Mooncake in a 4p1d scenario. Here is my start command:

python3 -m dynamo.sglang
--model-path "$PREFILL_MODEL" \
--served-model-name "$MODEL_NAME" \
--trust-remote-code \
--enable-metrics \
--collect-tokens-histogram \
--tp-size 1 \
--quantization w4afp8 \
--kv-cache-dtype fp8_e4m3 \
--mem-fraction-static 0.875 \
--context-length 32768 \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--speculative-algorithm NEXTN \
--speculative-num-steps 2 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 3 \
--chunked-prefill-size 16384 \
--max-running-requests 4 \
--max-mamba-cache-size 20 \
--mamba-scheduler-strategy extra_buffer \
--tokenizer-backend fastokens \
--linear-attn-backend flashinfer \
--page-size 64 \
--disable-cuda-graph \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-bootstrap-port "$bootstrap_port" \
"${ib_args[@]}" \
--host 0.0.0.0 \
--port "$sglang_port"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think warning_once should be fine. Now we need to figure out why a req could be KVPoll.Bootstrapping status when in process_disagg_prefill_inflight_queue, it shouldn't be at this status anymore when it leaves pop_bootstrapped.

Signed-off-by: zhangzhang <tangchenyu@xiaohongshu.com>
@ShangmingCai ShangmingCai merged commit 6bfb365 into sgl-project:main May 12, 2026
63 of 73 checks passed
@ShangmingCai
Copy link
Copy Markdown
Collaborator

Please open an issue to explore the bug with the unexpected poll state in process_disagg_prefill_inflight_queue

@tangcy98
Copy link
Copy Markdown
Contributor Author

Please open an issue to explore the bug with the unexpected poll state in process_disagg_prefill_inflight_queue

Done. #25063

LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026
Signed-off-by: zhangzhang <tangchenyu@xiaohongshu.com>
Co-authored-by: zhangzhang <tangchenyu@xiaohongshu.com>
SpencerGarnets added a commit to ai-blaise/optimization-playground that referenced this pull request May 12, 2026
…ack)

Brings in upstream sgl-project/sglang main commits since
096ad02 (merge base, Laguna-XS.2 model support).
Total: 28 upstream commits composed.

Custom-stack files preserved intact (entirely-ours, byte-identical to
origin/main):
  - Blackwell CuTe kernel suite (warp_decode_cute, g1_attention_cute,
    gated_norm_cute, layersplit_cute, fused_store_index_cache)
  - TurboQuant 2.5-bit dense KV cache path
  - HIGGS 2-bit dense KV cache path (with split-K decode)
  - NVFP4 IndexCache dispatcher (active gate)
  - quantization_config_dispatch (HF-config-driven runtime routing)
  - All custom server-args flags and runtime methods preserved

Verification:
  - 200+ merged Python files compile cleanly
  - Dispatcher symbol presence verified
  - HIGGS pool / TurboQuant pool classes present at expected lines
  - compressed_tensors_w4a4_nvfp4_moe imports clean
  - All custom server-args flags present (enable_higgs_dense_2bit_kv_cache,
    enable_turboquant_dense_kv_cache, turboquant_dense_kv_preset,
    indexer_quantization_declared, higgs_mla_decode_num_splits, etc.)

Manual-merged shared files (auto-merge gave broken/mixed output; cleaned
up post-merge):
  - python/sglang/srt/disaggregation/mooncake/conn.py: upstream's PR#24932
    refactored maybe_send_extra into a state-types-loop. Replayed our
    LayerSplit NSA state-index-length-mismatch check inside the SWA/NSA
    branch of the new loop body.
  - sgl-kernel/python/sgl_kernel/__init__.py: upstream's PR#23449 (Apple
    Silicon Metal kernel) wrapped the entire module body in
    `if darwin/arm64: from sgl_kernel.metal import * else: ...`. The
    auto-merge duplicated the file body; rewrote cleanly with upstream's
    structure and re-injected our `g1_gate_forward`,
    `warp_decode_cute_moe_forward`, and
    `warp_decode_cute_moe_packed_forward` imports plus `g1_gate_forward`
    in _DEBUG_EXPORT_NAMES.
  - python/sglang/srt/managers/scheduler_output_processor_mixin.py: line
    628 still referenced `result.num_accepted_drafts` (renamed by PR
    sgl-project#25038 to `num_correct_drafts`). Renamed in place.
  - python/sglang/srt/observability/scheduler_metrics_mixin.py: a block
    around the spec-decode logging path had mixed old/new names from
    auto-merge (lines 553/557/560). Renamed `spec_num_accepted_tokens`
    -> `spec_num_accept_tokens` and local `num_accepted_drafts` ->
    `num_correct_drafts` to match the rest of the file.
  - test/test_smc_info.py: stub Req mock used the old field names
    `spec_accepted_drafts` and `update_spec_acceptance_histogram`.
    Renamed to `spec_num_correct_drafts` and
    `update_spec_correct_drafts_histogram` per PR sgl-project#24081.

Auto-merge cleanly integrated upstream changes to:
  - server_args.py (new fields: prefill_only_disable_kv_cache,
    weight_loader_drop_cache_after_load, prefill_delayer_queue_min_ratio,
    prefill_delayer_max_delay_ms, speculative_draft_window_size, etc.)
  - mem_cache/memory_pool.py (new NoOpMHATokenToKVPool)
  - model_executor/model_runner_kv_cache_mixin.py (NoOpMHATokenToKVPool
    pool factory + _validate_prefill_only_disable_kv_cache_pool_family)
  - layers/attention/nsa_backend.py (spec rename
    num_accepted_drafts -> num_correct_drafts;
    num_accepted_tokens -> num_accept_tokens)
  - layers/attention/nsa/nsa_indexer.py (new _apply_q_scale_and_softmax_scale
    compile method; torch.mm replaces deep_gemm wrapper)
  - 28+ disaggregation/spec/runner files with mostly clean
    upstream-side-only integration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

----- upstream commit subjects (28) -----
fd3eb77 [Cookbook]: add Laguna-XS.2 (Poolside) (sgl-project#24730)
6be1a45 Fix swa component host hit (sgl-project#25085)
693f497 [NPU] use causal_conv1d_update_v2 for performance (sgl-project#24595)
1efe9e2 [Bug Fix] Reject incompatible combination of --disable-cuda-graph-padding and --enable-torch-compile (sgl-project#23903)
8d27ce7 Optimize uvicorn startup command (sgl-project#25041)
b35fd5f [fix] skip legacy minicpmv conv template for MiniCPM-V 4.6 (sgl-project#24998)
7582237 [Tiny Fix] Disable BCG when inner layer_model unresolved (sgl-project#25021)
ca3bc05 Deepseek-v4-Pro share expert tp1 (sgl-project#24949)
a72d3ae [Spec] Multi-layer mamba scatter cleanup; fix positional call bug (sgl-project#25030)
7128533 Revert "Migrate Intel CPU cases to the test/registered." (sgl-project#25044)
1f985c5 [Spec] Rename `accepted_indices` -> `accept_indices`; drop `_token_id` suffix per Rule 5 (sgl-project#25038)
ecf5d84 Migrate Intel CPU cases to the test/registered. (sgl-project#22670)
d7f4761 [PD] Refactor hybrid state transfer (sgl-project#24932)
91907b7 [UnifiedTree]: Fix Unified HiCache tombstone lock release replay (sgl-project#24972)
4ad63ad [Spec] Rename `accepted_drafts` -> `correct_drafts` for unambiguous naming (sgl-project#24081)
6bfb365 [PD] Rate limit prefill inflight polling warnings (sgl-project#24967)
6bb79c1 [Linear Attn] Add CUSTOM enum and plugin extensibility for kernel backends (sgl-project#24937)
cfc41d5 Fix kimi k2.5 mla eagle + dp attention (sgl-project#25033)
0f3932c [Fix] Qwen3-ASR config: set thinker_config before super().__init__ (sgl-project#24187)
f526e3f [Spec] Mamba scatter cleanup; fix multi-layer positional bug; dflash naming (sgl-project#25029)
10375a1 [NIXL][XPU] Fix uint64 overflow for mismatched P/D TP sizes (e.g. prefill_tp=1, decode_tp=2) (sgl-project#24648)
0a37d24 [diffusion] hardware: support sage attention backend on MUSA (attn backend, 21/N) (sgl-project#24752)
5495026 [HiCache] feat: default storage prefetch timeout (sgl-project#23309)
186eb42 Feat: Support SWA (Sliding Window Attention) for EAGLE-3 drafter (sgl-project#24664)
a75b79e Feat: Support newer EAGLE-3 drafters (sgl-project#24663)
f3a8189 [Spec] Internal rename per N2 v2 naming rule (sgl-project#25014)
bfc2eda [MUSA] Use MUSA-optimized operators in piecewise CUDA graph (sgl-project#23633)
74d70af [Apple Silicon] Add Metal kernel support in sgl-kernel (sgl-project#23449)
xjpang pushed a commit to xjpang/sglang that referenced this pull request May 13, 2026
Signed-off-by: zhangzhang <tangchenyu@xiaohongshu.com>
Co-authored-by: zhangzhang <tangchenyu@xiaohongshu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants