[perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1 by Qiaolin-Yu · Pull Request #26235 · sgl-project/sglang

Qiaolin-Yu · 2026-05-24T22:15:48Z

The greedy spec-decoding path (--speculative-eagle-topk 1) currently runs a full-vocab torch.softmax + torch.max for every draft step and again for _draft_extend_for_decode (both eager and inside the captured draft-extend cuda graph). With topk == 1 the draft tree is a single path, so topk_p does not feed back into any ranking decision (see spec_utils._select_top_k_tokens_later — the scores are multiplied along a degenerate single branch). topk_index = argmax(logits) is identical to argmax(softmax(logits)), so the softmax is purely wasted work.

Profile (Kimi-K2.5-NVFP4 / TP=4 / 80K ctx / EAGLE3 3-step / bs=1): cunn_SoftMaxForward was ~43 µs/call. It fired 2× per DRAFT_DECODE (steps 0 and 1 of the loop), 1× per _draft_extend_for_decode, and 1× inside the captured DRAFT_EXTEND graph — ~175 µs/cycle total. After this change all three call sites use argmax with a constant topk_p = ones.

Patched sites:

eagle_worker_v2.py:draft_forward — inner draft step loop, runs inside the captured DRAFT_DECODE cuda graph.
eagle_worker_v2.py:_draft_extend_for_decode — post-graph reorganization after the draft-extend replay.
eagle_draft_extend_cuda_graph_runner.py:capture_one_batch_size — the softmax+topk burned into the DRAFT_EXTEND cuda graph itself.

All three are gated by if self.topk == 1; multi-path tree behavior (topk > 1) is unchanged.

Measured on the canonical workload (10 prompts, max-concurrency=1, no SGLANG_SIMULATE_ACC_LEN):

metric baseline patched delta
Mean TPOT 2.41 ms 2.36 ms -0.05 ms
Med TPOT 2.37 ms 2.34 ms -0.03 ms
1000/Mean 414.9 423.7 +8.8 tok/s (+2.1%)
1000/Med 421.9 427.4 +5.5 tok/s (+1.3%)
accept_length 3.92 3.94 unchanged (within noise)

GPU-side cunn_SoftMaxForward count drops to 0 in both DRAFT_DECODE and DRAFT_EXTEND kernel breakdowns.

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ❌ Run #26374296016
Latest PR Test (Extra): ✅ Run #26374315215

The greedy spec-decoding path (`--speculative-eagle-topk 1`) currently runs a full-vocab `torch.softmax` + `torch.max` for every draft step and again for `_draft_extend_for_decode` (both eager and inside the captured draft-extend cuda graph). With topk == 1 the draft tree is a single path, so `topk_p` does not feed back into any ranking decision (see `spec_utils._select_top_k_tokens_later` — the scores are multiplied along a degenerate single branch). `topk_index = argmax(logits)` is identical to `argmax(softmax(logits))`, so the softmax is purely wasted work. Profile (Kimi-K2.5-NVFP4 / TP=4 / 80K ctx / EAGLE3 3-step / bs=1): `cunn_SoftMaxForward` was ~43 µs/call. It fired 2× per DRAFT_DECODE (steps 0 and 1 of the loop), 1× per `_draft_extend_for_decode`, and 1× inside the captured DRAFT_EXTEND graph — ~175 µs/cycle total. After this change all three call sites use argmax with a constant `topk_p = ones`. Patched sites: - `eagle_worker_v2.py:draft_forward` — inner draft step loop, runs inside the captured DRAFT_DECODE cuda graph. - `eagle_worker_v2.py:_draft_extend_for_decode` — post-graph reorganization after the draft-extend replay. - `eagle_draft_extend_cuda_graph_runner.py:capture_one_batch_size` — the softmax+topk burned into the DRAFT_EXTEND cuda graph itself. All three are gated by `if self.topk == 1`; multi-path tree behavior (topk > 1) is unchanged. Measured on the canonical workload (10 prompts, max-concurrency=1, no `SGLANG_SIMULATE_ACC_LEN`): metric baseline patched delta Mean TPOT 2.41 ms 2.36 ms -0.05 ms Med TPOT 2.37 ms 2.34 ms -0.03 ms 1000/Mean 414.9 423.7 +8.8 tok/s (+2.1%) 1000/Med 421.9 427.4 +5.5 tok/s (+1.3%) accept_length 3.92 3.94 unchanged (within noise) GPU-side `cunn_SoftMaxForward` count drops to 0 in both DRAFT_DECODE and DRAFT_EXTEND kernel breakdowns.

Qiaolin-Yu · 2026-05-24T22:19:44Z

/tag-and-rerun-ci

gemini-code-assist

Code Review

This pull request optimizes the EAGLE speculative decoding process by skipping the full-vocab softmax operation when topk is set to 1. The changes in eagle_draft_extend_cuda_graph_runner.py and eagle_worker_v2.py replace the softmax and fast_topk calls with a more efficient torch.argmax and a constant probability tensor, as the exact probability values are not required for greedy decoding. I have no feedback to provide.

…when topk == 1 (#26235)" (#26358)

…when topk == 1 (#26235)" (#26397)

…k == 1 (sgl-project#26235)

…when topk == 1 (sgl-project#26235)" (sgl-project#26358)

Qiaolin-Yu requested review from Ying1123, hnyls2002 and merrymercy as code owners May 24, 2026 22:15

Qiaolin-Yu force-pushed the qiaolin/draft-topk1-argmax-fastpath branch from 71e0814 to 0b9af3b Compare May 24, 2026 22:18

Qiaolin-Yu added high priority bypass-fastfail labels May 24, 2026

Qiaolin-Yu assigned ispobock and kpham-sgl May 24, 2026

github-actions Bot added the run-ci label May 24, 2026

gemini-code-assist Bot reviewed May 24, 2026

View reviewed changes

Qiaolin-Yu changed the title ~~[Spec Decoding] Skip full-vocab softmax in EAGLE draft when topk == 1~~ [perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1 May 25, 2026

Qiaolin-Yu requested a review from kpham-sgl May 25, 2026 04:37

kpham-sgl approved these changes May 25, 2026

View reviewed changes

Qiaolin-Yu merged commit a77449f into main May 25, 2026
253 of 294 checks passed

Qiaolin-Yu deleted the qiaolin/draft-topk1-argmax-fastpath branch May 25, 2026 09:06

michaelzhang-ai mentioned this pull request May 26, 2026

Revert "[perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1 (#26235)" #26358

Merged

5 tasks

Qiaolin-Yu pushed a commit that referenced this pull request May 26, 2026

Revert "[perf][spec decoding] Skip full-vocab softmax in EAGLE draft …

9409969

…when topk == 1 (#26235)" (#26358)

michaelzhang-ai mentioned this pull request May 26, 2026

[perf][spec decoding] Re-land #26235 (skip EAGLE topk==1 softmax) for CUDA only #26407

Closed

5 tasks

Qiaolin-Yu added a commit that referenced this pull request May 26, 2026

Reland "[perf][spec decoding] Skip full-vocab softmax in EAGLE draft …

dd6f073

…when topk == 1 (#26235)" (#26397)

Shunkangz pushed a commit to Shunkangz/sglang that referenced this pull request May 27, 2026

[perf][spec decoding] Skip full-vocab softmax in EAGLE draft when top…

35aa4aa

…k == 1 (sgl-project#26235)

Shunkangz pushed a commit to Shunkangz/sglang that referenced this pull request May 27, 2026

Revert "[perf][spec decoding] Skip full-vocab softmax in EAGLE draft …

dbe04e7

…when topk == 1 (sgl-project#26235)" (sgl-project#26358)

michaelzhang-ai mentioned this pull request May 29, 2026

[spec decoding] Re-enable EAGLE topk==1 argmax fastpath on ROCm #26633

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1#26235

[perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1#26235
Qiaolin-Yu merged 1 commit into
mainfrom
qiaolin/draft-topk1-argmax-fastpath

Qiaolin-Yu commented May 24, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Qiaolin-Yu commented May 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Qiaolin-Yu commented May 24, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

CI States

Uh oh!

Qiaolin-Yu commented May 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Qiaolin-Yu commented May 24, 2026 •

edited by github-actions Bot

Loading