Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions python/sglang/srt/managers/scheduler_output_processor_mixin.py
Original file line number Diff line number Diff line change
Expand Up @@ -389,11 +389,21 @@ def _resolve_spec_overlap_token_ids(
stride = self.draft_worker.speculative_num_draft_tokens

for i, req in enumerate(batch.reqs):
# -1 because prepare_for_decode pre-claimed the bonus slot.
req.kv_committed_len += accept_lens[i] - 1
predict_tokens.append(
next_token_ids[i * stride : i * stride + accept_lens[i]]
)

if req.is_retracted:
# reset_for_retract() already zeroes committed/allocated KV.
continue
Comment on lines +396 to +398
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When skipping a retracted request, its contribution to the global result.num_accepted_tokens (calculated at line 358) should also be removed. This ensures that the batch-level speculative metrics (used in update_spec_metrics and report_decode_stats) accurately reflect only the tokens that were actually committed to active requests, maintaining consistency between global and per-request statistics.

Suggested change
if req.is_retracted:
# reset_for_retract() already zeroes committed/allocated KV.
continue
if req.is_retracted:
# reset_for_retract() already zeroes committed/allocated KV.
result.num_accepted_tokens -= result.accept_length_per_req_cpu[i]
continue


if req.finished():
# -1 because prepare_for_decode pre-claimed the bonus slot.
req.kv_committed_len -= 1
continue
Comment on lines +400 to +403
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similarly to retracted requests, when a request is already finished, its accepted tokens should be excluded from the global result.num_accepted_tokens to ensure that speculative decoding efficiency metrics are not inflated by stale results.

Suggested change
if req.finished():
# -1 because prepare_for_decode pre-claimed the bonus slot.
req.kv_committed_len -= 1
continue
if req.finished():
# -1 because prepare_for_decode pre-claimed the bonus slot.
req.kv_committed_len -= 1
result.num_accepted_tokens -= result.accept_length_per_req_cpu[i]
continue


# -1 because prepare_for_decode pre-claimed the bonus slot.
req.kv_committed_len += accept_lens[i] - 1
req.spec_verify_ct += 1

accepted_draft_tokens = result.num_accepted_drafts_per_req_cpu[i]
Expand Down
Loading