Skip to content

[perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1#26235

Merged
Qiaolin-Yu merged 1 commit into
mainfrom
qiaolin/draft-topk1-argmax-fastpath
May 25, 2026
Merged

[perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1#26235
Qiaolin-Yu merged 1 commit into
mainfrom
qiaolin/draft-topk1-argmax-fastpath

Conversation

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator

@Qiaolin-Yu Qiaolin-Yu commented May 24, 2026

The greedy spec-decoding path (--speculative-eagle-topk 1) currently runs a full-vocab torch.softmax + torch.max for every draft step and again for _draft_extend_for_decode (both eager and inside the captured draft-extend cuda graph). With topk == 1 the draft tree is a single path, so topk_p does not feed back into any ranking decision (see spec_utils._select_top_k_tokens_later — the scores are multiplied along a degenerate single branch). topk_index = argmax(logits) is identical to argmax(softmax(logits)), so the softmax is purely wasted work.

Profile (Kimi-K2.5-NVFP4 / TP=4 / 80K ctx / EAGLE3 3-step / bs=1): cunn_SoftMaxForward was ~43 µs/call. It fired 2× per DRAFT_DECODE (steps 0 and 1 of the loop), 1× per _draft_extend_for_decode, and 1× inside the captured DRAFT_EXTEND graph — ~175 µs/cycle total. After this change all three call sites use argmax with a constant topk_p = ones.

Patched sites:

  • eagle_worker_v2.py:draft_forward — inner draft step loop, runs inside the captured DRAFT_DECODE cuda graph.
  • eagle_worker_v2.py:_draft_extend_for_decode — post-graph reorganization after the draft-extend replay.
  • eagle_draft_extend_cuda_graph_runner.py:capture_one_batch_size — the softmax+topk burned into the DRAFT_EXTEND cuda graph itself.

All three are gated by if self.topk == 1; multi-path tree behavior (topk > 1) is unchanged.

Measured on the canonical workload (10 prompts, max-concurrency=1, no SGLANG_SIMULATE_ACC_LEN):

metric baseline patched delta
Mean TPOT 2.41 ms 2.36 ms -0.05 ms
Med TPOT 2.37 ms 2.34 ms -0.03 ms
1000/Mean 414.9 423.7 +8.8 tok/s (+2.1%)
1000/Med 421.9 427.4 +5.5 tok/s (+1.3%)
accept_length 3.92 3.94 unchanged (within noise)

GPU-side cunn_SoftMaxForward count drops to 0 in both DRAFT_DECODE and DRAFT_EXTEND kernel breakdowns.

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ❌ Run #26374296016
Latest PR Test (Extra): ✅ Run #26374315215

The greedy spec-decoding path (`--speculative-eagle-topk 1`) currently
runs a full-vocab `torch.softmax` + `torch.max` for every draft step and
again for `_draft_extend_for_decode` (both eager and inside the captured
draft-extend cuda graph). With topk == 1 the draft tree is a single path,
so `topk_p` does not feed back into any ranking decision (see
`spec_utils._select_top_k_tokens_later` — the scores are multiplied along
a degenerate single branch). `topk_index = argmax(logits)` is identical to
`argmax(softmax(logits))`, so the softmax is purely wasted work.

Profile (Kimi-K2.5-NVFP4 / TP=4 / 80K ctx / EAGLE3 3-step / bs=1):
`cunn_SoftMaxForward` was ~43 µs/call. It fired 2× per DRAFT_DECODE
(steps 0 and 1 of the loop), 1× per `_draft_extend_for_decode`, and 1×
inside the captured DRAFT_EXTEND graph — ~175 µs/cycle total. After this
change all three call sites use argmax with a constant `topk_p = ones`.

Patched sites:
- `eagle_worker_v2.py:draft_forward` — inner draft step loop, runs inside
  the captured DRAFT_DECODE cuda graph.
- `eagle_worker_v2.py:_draft_extend_for_decode` — post-graph
  reorganization after the draft-extend replay.
- `eagle_draft_extend_cuda_graph_runner.py:capture_one_batch_size` — the
  softmax+topk burned into the DRAFT_EXTEND cuda graph itself.

All three are gated by `if self.topk == 1`; multi-path tree behavior
(topk > 1) is unchanged.

Measured on the canonical workload (10 prompts, max-concurrency=1, no
`SGLANG_SIMULATE_ACC_LEN`):

  metric         baseline   patched   delta
  Mean TPOT      2.41 ms    2.36 ms   -0.05 ms
  Med  TPOT      2.37 ms    2.34 ms   -0.03 ms
  1000/Mean      414.9      423.7     +8.8 tok/s (+2.1%)
  1000/Med       421.9      427.4     +5.5 tok/s (+1.3%)
  accept_length  3.92       3.94      unchanged (within noise)

GPU-side `cunn_SoftMaxForward` count drops to 0 in both DRAFT_DECODE
and DRAFT_EXTEND kernel breakdowns.
@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the EAGLE speculative decoding process by skipping the full-vocab softmax operation when topk is set to 1. The changes in eagle_draft_extend_cuda_graph_runner.py and eagle_worker_v2.py replace the softmax and fast_topk calls with a more efficient torch.argmax and a constant probability tensor, as the exact probability values are not required for greedy decoding. I have no feedback to provide.

@Qiaolin-Yu Qiaolin-Yu changed the title [Spec Decoding] Skip full-vocab softmax in EAGLE draft when topk == 1 [perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1 May 25, 2026
@Qiaolin-Yu Qiaolin-Yu requested a review from kpham-sgl May 25, 2026 04:37
@Qiaolin-Yu Qiaolin-Yu merged commit a77449f into main May 25, 2026
253 of 294 checks passed
@Qiaolin-Yu Qiaolin-Yu deleted the qiaolin/draft-topk1-argmax-fastpath branch May 25, 2026 09:06
Qiaolin-Yu pushed a commit that referenced this pull request May 26, 2026
Qiaolin-Yu added a commit that referenced this pull request May 26, 2026
Shunkangz pushed a commit to Shunkangz/sglang that referenced this pull request May 27, 2026
Shunkangz pushed a commit to Shunkangz/sglang that referenced this pull request May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants