[perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1#26235
Merged
Conversation
The greedy spec-decoding path (`--speculative-eagle-topk 1`) currently runs a full-vocab `torch.softmax` + `torch.max` for every draft step and again for `_draft_extend_for_decode` (both eager and inside the captured draft-extend cuda graph). With topk == 1 the draft tree is a single path, so `topk_p` does not feed back into any ranking decision (see `spec_utils._select_top_k_tokens_later` — the scores are multiplied along a degenerate single branch). `topk_index = argmax(logits)` is identical to `argmax(softmax(logits))`, so the softmax is purely wasted work. Profile (Kimi-K2.5-NVFP4 / TP=4 / 80K ctx / EAGLE3 3-step / bs=1): `cunn_SoftMaxForward` was ~43 µs/call. It fired 2× per DRAFT_DECODE (steps 0 and 1 of the loop), 1× per `_draft_extend_for_decode`, and 1× inside the captured DRAFT_EXTEND graph — ~175 µs/cycle total. After this change all three call sites use argmax with a constant `topk_p = ones`. Patched sites: - `eagle_worker_v2.py:draft_forward` — inner draft step loop, runs inside the captured DRAFT_DECODE cuda graph. - `eagle_worker_v2.py:_draft_extend_for_decode` — post-graph reorganization after the draft-extend replay. - `eagle_draft_extend_cuda_graph_runner.py:capture_one_batch_size` — the softmax+topk burned into the DRAFT_EXTEND cuda graph itself. All three are gated by `if self.topk == 1`; multi-path tree behavior (topk > 1) is unchanged. Measured on the canonical workload (10 prompts, max-concurrency=1, no `SGLANG_SIMULATE_ACC_LEN`): metric baseline patched delta Mean TPOT 2.41 ms 2.36 ms -0.05 ms Med TPOT 2.37 ms 2.34 ms -0.03 ms 1000/Mean 414.9 423.7 +8.8 tok/s (+2.1%) 1000/Med 421.9 427.4 +5.5 tok/s (+1.3%) accept_length 3.92 3.94 unchanged (within noise) GPU-side `cunn_SoftMaxForward` count drops to 0 in both DRAFT_DECODE and DRAFT_EXTEND kernel breakdowns.
71e0814 to
0b9af3b
Compare
Collaborator
Author
|
/tag-and-rerun-ci |
Contributor
There was a problem hiding this comment.
Code Review
This pull request optimizes the EAGLE speculative decoding process by skipping the full-vocab softmax operation when topk is set to 1. The changes in eagle_draft_extend_cuda_graph_runner.py and eagle_worker_v2.py replace the softmax and fast_topk calls with a more efficient torch.argmax and a constant probability tensor, as the exact probability values are not required for greedy decoding. I have no feedback to provide.
kpham-sgl
approved these changes
May 25, 2026
5 tasks
Shunkangz
pushed a commit
to Shunkangz/sglang
that referenced
this pull request
May 27, 2026
Shunkangz
pushed a commit
to Shunkangz/sglang
that referenced
this pull request
May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The greedy spec-decoding path (
--speculative-eagle-topk 1) currently runs a full-vocabtorch.softmax+torch.maxfor every draft step and again for_draft_extend_for_decode(both eager and inside the captured draft-extend cuda graph). With topk == 1 the draft tree is a single path, sotopk_pdoes not feed back into any ranking decision (seespec_utils._select_top_k_tokens_later— the scores are multiplied along a degenerate single branch).topk_index = argmax(logits)is identical toargmax(softmax(logits)), so the softmax is purely wasted work.Profile (Kimi-K2.5-NVFP4 / TP=4 / 80K ctx / EAGLE3 3-step / bs=1):
cunn_SoftMaxForwardwas ~43 µs/call. It fired 2× per DRAFT_DECODE (steps 0 and 1 of the loop), 1× per_draft_extend_for_decode, and 1× inside the captured DRAFT_EXTEND graph — ~175 µs/cycle total. After this change all three call sites use argmax with a constanttopk_p = ones.Patched sites:
eagle_worker_v2.py:draft_forward— inner draft step loop, runs inside the captured DRAFT_DECODE cuda graph.eagle_worker_v2.py:_draft_extend_for_decode— post-graph reorganization after the draft-extend replay.eagle_draft_extend_cuda_graph_runner.py:capture_one_batch_size— the softmax+topk burned into the DRAFT_EXTEND cuda graph itself.All three are gated by
if self.topk == 1; multi-path tree behavior (topk > 1) is unchanged.Measured on the canonical workload (10 prompts, max-concurrency=1, no
SGLANG_SIMULATE_ACC_LEN):metric baseline patched delta
Mean TPOT 2.41 ms 2.36 ms -0.05 ms
Med TPOT 2.37 ms 2.34 ms -0.03 ms
1000/Mean 414.9 423.7 +8.8 tok/s (+2.1%)
1000/Med 421.9 427.4 +5.5 tok/s (+1.3%)
accept_length 3.92 3.94 unchanged (within noise)
GPU-side
cunn_SoftMaxForwardcount drops to 0 in both DRAFT_DECODE and DRAFT_EXTEND kernel breakdowns.Motivation
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ciCI States
Latest PR Test (Base): ❌ Run #26374296016
Latest PR Test (Extra): ✅ Run #26374315215