[SPEC V2] fix: skip stale state updates in spec-v2 overlap#23456
[SPEC V2] fix: skip stale state updates in spec-v2 overlap#23456Qiaolin-Yu merged 5 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies the speculative token resolution logic in scheduler_output_processor_mixin.py to properly handle retracted and finished requests, ensuring kv_committed_len is correctly managed. The review feedback suggests updating the global num_accepted_tokens metric when requests are retracted or finished to maintain accurate batch-level speculative metrics and prevent inflated efficiency statistics.
| if req.is_retracted: | ||
| # reset_for_retract() already zeroes committed/allocated KV. | ||
| continue |
There was a problem hiding this comment.
When skipping a retracted request, its contribution to the global result.num_accepted_tokens (calculated at line 358) should also be removed. This ensures that the batch-level speculative metrics (used in update_spec_metrics and report_decode_stats) accurately reflect only the tokens that were actually committed to active requests, maintaining consistency between global and per-request statistics.
| if req.is_retracted: | |
| # reset_for_retract() already zeroes committed/allocated KV. | |
| continue | |
| if req.is_retracted: | |
| # reset_for_retract() already zeroes committed/allocated KV. | |
| result.num_accepted_tokens -= result.accept_length_per_req_cpu[i] | |
| continue |
| if req.finished(): | ||
| # -1 because prepare_for_decode pre-claimed the bonus slot. | ||
| req.kv_committed_len -= 1 | ||
| continue |
There was a problem hiding this comment.
Similarly to retracted requests, when a request is already finished, its accepted tokens should be excluded from the global result.num_accepted_tokens to ensure that speculative decoding efficiency metrics are not inflated by stale results.
| if req.finished(): | |
| # -1 because prepare_for_decode pre-claimed the bonus slot. | |
| req.kv_committed_len -= 1 | |
| continue | |
| if req.finished(): | |
| # -1 because prepare_for_decode pre-claimed the bonus slot. | |
| req.kv_committed_len -= 1 | |
| result.num_accepted_tokens -= result.accept_length_per_req_cpu[i] | |
| continue |
|
/tag-and-rerun-ci |
|
/rerun-group spec |
|
✅ ❌ ✅ ✅ ✅ ✅ |
|
/rerun-group spec |
|
✅ ✅ ✅ ✅ ✅ ✅ |
|
/rerun-test test_mimo_models.py test_step3p5_flash_chain_mtp.py |
|
✅ |
* main: (87 commits) [Fix] Disable FlashInfer allreduce fusion under deterministic inference (sgl-project#24629) fix: STANDALONE spec-decode hidden-size mismatch crash (sgl-project#24217) Followup fix for Custom AR V2 in non NVL scenarios (sgl-project#24742) Fix reduce_scatterv producer contract for SUM_LEN (sgl-project#24785) [NPU]Documentation update for communications quantization feature (sgl-project#24668) [Session R3] Add routed_experts_start_len for absolute routing slice control (sgl-project#24851) [Model] Add MiniCPM-V 4.6 support (sgl-project#24855) Support Intern-S2-Preview (sgl-project#24875) [PD] Unify dsv4 dispatch with swa (sgl-project#24888) Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (sgl-project#24775) Fix PD bootstrap failure handling (sgl-project#24772) [Spec] Cleanup idle stub and shape-check patterns (sgl-project#24881) [Bug] Add dsv4 state_type branch to mooncake disaggregation (sgl-project#24878) [Spec V1] Split draft-extend phase from `EagleDraftInput` into new `EagleDraftExtendInput` (sgl-project#24859) [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (sgl-project#24696) [spec decoding] support kimi-k2.5-eagle3-mla (sgl-project#24826) [SPEC V2] fix: skip stale state updates in spec-v2 overlap (sgl-project#23456) [RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM (sgl-project#24854) [diffusion] CI: add cache-dit CI tests (sgl-project#19213) [Utils] Make request dump robust to unpicklable server_args and large meta_info (sgl-project#24767) ... # Conflicts: # python/sglang/srt/utils/common.py
Motivation
In spec-v2 overlap scheduling, decode results can arrive after a request has already
finished or been retracted. The old post-processing still applied
accept_lensandspeculative acceptance accounting (
spec_verify_ct, accepted draft tokens,spec_accepted_tokens) to those stale requests, corrupting KV bookkeeping and per-requestspeculative metrics.
This change skips stale state updates and only rolls back the pre-
claimed bonus slot for finished requests.
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci