Skip to content

[SPEC V2] fix: skip stale state updates in spec-v2 overlap#23456

Merged
Qiaolin-Yu merged 5 commits intosgl-project:mainfrom
alphabetc1:fix/spec_v2_fix
May 10, 2026
Merged

[SPEC V2] fix: skip stale state updates in spec-v2 overlap#23456
Qiaolin-Yu merged 5 commits intosgl-project:mainfrom
alphabetc1:fix/spec_v2_fix

Conversation

@alphabetc1
Copy link
Copy Markdown
Collaborator

Motivation

In spec-v2 overlap scheduling, decode results can arrive after a request has already
finished or been retracted. The old post-processing still applied accept_lens and
speculative acceptance accounting (spec_verify_ct, accepted draft tokens,
spec_accepted_tokens) to those stale requests, corrupting KV bookkeeping and per-request
speculative metrics.
This change skips stale state updates and only rolls back the pre-
claimed bonus slot for finished requests.

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the speculative token resolution logic in scheduler_output_processor_mixin.py to properly handle retracted and finished requests, ensuring kv_committed_len is correctly managed. The review feedback suggests updating the global num_accepted_tokens metric when requests are retracted or finished to maintain accurate batch-level speculative metrics and prevent inflated efficiency statistics.

Comment on lines +369 to +371
if req.is_retracted:
# reset_for_retract() already zeroes committed/allocated KV.
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When skipping a retracted request, its contribution to the global result.num_accepted_tokens (calculated at line 358) should also be removed. This ensures that the batch-level speculative metrics (used in update_spec_metrics and report_decode_stats) accurately reflect only the tokens that were actually committed to active requests, maintaining consistency between global and per-request statistics.

Suggested change
if req.is_retracted:
# reset_for_retract() already zeroes committed/allocated KV.
continue
if req.is_retracted:
# reset_for_retract() already zeroes committed/allocated KV.
result.num_accepted_tokens -= result.accept_length_per_req_cpu[i]
continue

Comment on lines +373 to +376
if req.finished():
# -1 because prepare_for_decode pre-claimed the bonus slot.
req.kv_committed_len -= 1
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similarly to retracted requests, when a request is already finished, its accepted tokens should be excluded from the global result.num_accepted_tokens to ensure that speculative decoding efficiency metrics are not inflated by stale results.

Suggested change
if req.finished():
# -1 because prepare_for_decode pre-claimed the bonus slot.
req.kv_committed_len -= 1
continue
if req.finished():
# -1 because prepare_for_decode pre-claimed the bonus slot.
req.kv_committed_len -= 1
result.num_accepted_tokens -= result.accept_length_per_req_cpu[i]
continue

@Qiaolin-Yu Qiaolin-Yu self-assigned this Apr 22, 2026
@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@alphabetc1
Copy link
Copy Markdown
Collaborator Author

/rerun-group spec

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

1-gpu-5090 (4 tests): View workflow run

cd test/ && python3 registered/spec/dflash/test_dflash.py
cd test/ && python3 registered/spec/eagle/test_eagle3_basic.py
cd test/ && python3 registered/spec/eagle/test_eagle_infer_beta.py
cd test/ && python3 registered/spec/utils/test_build_eagle_tree.py

registered/spec/eagle/test_adaptive_speculative.py, registered/spec/eagle/test_eagle_constrained_decoding.py, registered/spec/eagle/test_eagle_infer_a.py, registered/spec/eagle/test_eagle_infer_b.py, registered/spec/test_ngram_speculative_decoding.py, registered/spec/test_standalone_speculative_decoding.py: Dispatch failed: 422

4-gpu-b200 (2 tests): View workflow run

cd test/ && python3 registered/spec/eagle/test_deepseek_v3_fp4_mtp_small.py
cd test/ && python3 registered/spec/eagle/test_eagle_infer_beta_dp_attention.py

4-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/spec/eagle/test_eagle_dp_attention.py

8-gpu-b200 (1 test): View workflow run

cd test/ && python3 registered/spec/eagle/test_eagle_infer_beta_dp_attention_large.py

2-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/spec/test_constrained_decoding_spec_reasoning.py

@alphabetc1
Copy link
Copy Markdown
Collaborator Author

/rerun-group spec

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

1-gpu-5090 (4 tests): View workflow run

cd test/ && python3 registered/spec/dflash/test_dflash.py
cd test/ && python3 registered/spec/eagle/test_eagle3_basic.py
cd test/ && python3 registered/spec/eagle/test_eagle_infer_beta.py
cd test/ && python3 registered/spec/utils/test_build_eagle_tree.py

1-gpu-h100 (6 tests): View workflow run

cd test/ && python3 registered/spec/eagle/test_adaptive_speculative.py
cd test/ && python3 registered/spec/eagle/test_eagle_constrained_decoding.py
cd test/ && python3 registered/spec/eagle/test_eagle_infer_a.py
cd test/ && python3 registered/spec/eagle/test_eagle_infer_b.py
cd test/ && python3 registered/spec/test_ngram_speculative_decoding.py
cd test/ && python3 registered/spec/test_standalone_speculative_decoding.py

4-gpu-b200 (2 tests): View workflow run

cd test/ && python3 registered/spec/eagle/test_deepseek_v3_fp4_mtp_small.py
cd test/ && python3 registered/spec/eagle/test_eagle_infer_beta_dp_attention.py

4-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/spec/eagle/test_eagle_dp_attention.py

8-gpu-b200 (1 test): View workflow run

cd test/ && python3 registered/spec/eagle/test_eagle_infer_beta_dp_attention_large.py

2-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/spec/test_constrained_decoding_spec_reasoning.py

@alphabetc1
Copy link
Copy Markdown
Collaborator Author

/rerun-test test_mimo_models.py test_step3p5_flash_chain_mtp.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

8-gpu-h200 (2 tests): View workflow run

cd test/ && python3 registered/8-gpu-models/test_mimo_models.py
cd test/ && python3 registered/8-gpu-models/test_step3p5_flash_chain_mtp.py

@alphabetc1
Copy link
Copy Markdown
Collaborator Author

@Qiaolin-Yu Qiaolin-Yu merged commit b4d347e into sgl-project:main May 10, 2026
209 of 224 checks passed
@alphabetc1 alphabetc1 deleted the fix/spec_v2_fix branch May 10, 2026 13:34
ltcs11 added a commit to ltcs11/sglang that referenced this pull request May 11, 2026
* main: (87 commits)
  [Fix] Disable FlashInfer allreduce fusion under deterministic inference (sgl-project#24629)
  fix: STANDALONE spec-decode hidden-size mismatch crash (sgl-project#24217)
  Followup fix for Custom AR V2 in non NVL scenarios (sgl-project#24742)
  Fix reduce_scatterv producer contract for SUM_LEN (sgl-project#24785)
  [NPU]Documentation update for communications quantization feature (sgl-project#24668)
  [Session R3] Add routed_experts_start_len for absolute routing slice control (sgl-project#24851)
  [Model] Add MiniCPM-V 4.6 support (sgl-project#24855)
  Support Intern-S2-Preview (sgl-project#24875)
  [PD] Unify dsv4 dispatch with swa (sgl-project#24888)
  Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (sgl-project#24775)
  Fix PD bootstrap failure handling (sgl-project#24772)
  [Spec] Cleanup idle stub and shape-check patterns (sgl-project#24881)
  [Bug] Add dsv4 state_type branch to mooncake disaggregation (sgl-project#24878)
  [Spec V1] Split draft-extend phase from `EagleDraftInput` into new `EagleDraftExtendInput` (sgl-project#24859)
  [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (sgl-project#24696)
  [spec decoding] support kimi-k2.5-eagle3-mla (sgl-project#24826)
  [SPEC V2] fix: skip stale state updates in spec-v2 overlap (sgl-project#23456)
  [RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM (sgl-project#24854)
  [diffusion] CI: add cache-dit CI tests (sgl-project#19213)
  [Utils] Make request dump robust to unpicklable server_args and large meta_info (sgl-project#24767)
  ...

# Conflicts:
#	python/sglang/srt/utils/common.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants