Reland "[perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1 (#26235)" by Qiaolin-Yu · Pull Request #26397 · sgl-project/sglang

Qiaolin-Yu · 2026-05-26T10:11:18Z

CI States

Latest PR Test (Base): 🚫 Run #26472585858
Latest PR Test (Extra): ✅ Run #26472613859

…E draft …" This reverts commit 9409969.

Qiaolin-Yu · 2026-05-26T10:12:00Z

/rerun-test test/registered/quant/test_deepseek_v32_fp4_mtp_4gpu.py

github-actions · 2026-05-26T10:12:26Z

🚀 4-gpu-b200 (1 test): ✅ View workflow run

cd test/ && python3 registered/quant/test_deepseek_v32_fp4_mtp_4gpu.py

gemini-code-assist

Code Review

This pull request optimizes speculative decoding when topk is set to 1. By checking if self.topk == 1, the code bypasses the expensive softmax and fast_topk operations, instead directly computing the argmax over the next token logits and setting the top-k probabilities to 1. This optimization is applied across eagle_draft_extend_cuda_graph_runner.py and eagle_worker_v2.py. No review comments were provided for this pull request.

Qiaolin-Yu · 2026-05-26T10:13:51Z

/rerun-test test/registered/amd/accuracy/mi35x/test_deepseek_v32_mtp_eval_mi35x.py

github-actions · 2026-05-26T10:14:11Z

⛔ test/registered/amd/accuracy/mi35x/test_deepseek_v32_mtp_eval_mi35x.py: No register_cuda_ci(runner_config=...) or register_cpu_ci() found in test/registered/amd/accuracy/mi35x/test_deepseek_v32_mtp_eval_mi35x.py. /rerun-test only supports tests registered via the new-style yml-driven API; nightly/weekly tests aren't dispatchable through this command.

michaelzhang-ai · 2026-05-26T13:41:59Z

Hi @Qiaolin-Yu — thanks for re-landing the optimization. One concern with this exact form: it's a straight revert of the revert, so it brings back the same code that caused R108 (DSv3.2-MTP gsm8k → 0.035 with 96% invalid output on ROCm 7.2 mi35x). The R108 verify on the revert branch (run) just confirmed the revert recovered MTP from 0.035 → 0.975 on the exact same hardware + aiter pin, so the cause is the topk==1 path itself, not anything else in the regression window.

I just dispatched nightly-accuracy-8-gpu-mi35x-deepseek-v32-mtp-rocm720 against this PR's branch to confirm — ETA ~50 min. If the AMD-side bug is robust (which the prior evidence suggests), it'll fail again at ~0.035.

Two ways forward, your call:

Apply a tiny AMD gate to this PR. I opened #26407 with the same optimization but wrapped in if self.topk == 1 and not _is_hip:. The CUDA fast path is byte-identical to this PR; AMD falls through to the original softmax+fast_topk. 39 ins / 6 del, lint-clean. Happy to close [perf][spec decoding] Re-land #26235 (skip EAGLE topk==1 softmax) for CUDA only #26407 and push the gate as a commit on this branch instead — whichever you prefer.
Close this PR and merge #26407 instead — same outcome, simpler attribution.

The /rerun-test failure you hit (nightly/weekly tests aren't dispatchable through this command) is because the AMD MTP test is registered as nightly-only. The workflow_dispatch I just kicked off is the only way to run it on-demand against an arbitrary branch — happy to do that for any iteration on this PR.

Long-term, gemini-code-assist suggested an alternative on #26358 — compute top1_prob via softmax-with-known-max identity, which keeps topk_p numerically correct on both backends without paying the full-vocab bandwidth cost. That'd let us drop the AMD gate in a follow-up.

gemini-code-assist · 2026-05-26T13:42:04Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…6235-eagle-topk1-softmax-skip

Qiaolin-Yu · 2026-05-26T20:17:21Z

/tag-and-rerun-ci

Qiaolin-Yu · 2026-05-26T20:19:08Z

https://github.com/sgl-project/sglang/actions/runs/26472804994 track here

PR #26397 (`Reland softmax-skip for EAGLE draft when topk==1`) added a topk==1 fast path in both `EagleDraftWorker.draft_forward` and `EAGLEDraftExtendCudaGraphRunner.replay`: when `topk == 1` (and not ROCm) production skips the full-vocab softmax and emits `topk_p = ones_like(topk_index, dtype=float32)`. The CG-runner replay output therefore returns `topk_p = 1.0` for every row. The test's eager reference helper `_run_eagle_draft_extend_eager` was still doing the old `softmax → fast_topk` path, producing real probabilities (~0.02 on these batches). `assert_outputs_close` then flagged a ~0.98 absolute drift across all 11 cuda-graph runner cases: AssertionError: Tensor-likes are not close! Mismatched elements: 2 / 2 (100.0%) Greatest absolute difference: 0.9804949760437012 (up to 0.03 allowed) Mirror the production fast path in the eager reference. `topk_index` already matches (argmax of logits == argmax of softmax(logits)); only `topk_p` had to be aligned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Revert "Revert "[perf][spec decoding] Skip full-vocab softmax in EAGL…

62ff4af

…E draft …" This reverts commit 9409969.

Qiaolin-Yu requested review from Ying1123, hnyls2002 and merrymercy as code owners May 26, 2026 10:11

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

Qiaolin-Yu added 2 commits May 26, 2026 20:08

fix

86702aa

Merge remote-tracking branch 'origin/main' into revert-26358-revert-2…

c596204

…6235-eagle-topk1-softmax-skip

Qiaolin-Yu added high priority bypass-fastfail labels May 26, 2026

github-actions Bot added the run-ci label May 26, 2026

Qiaolin-Yu merged commit dd6f073 into main May 26, 2026
203 of 243 checks passed

Qiaolin-Yu deleted the revert-26358-revert-26235-eagle-topk1-softmax-skip branch May 26, 2026 21:14

This was referenced May 27, 2026

[perf][spec decoding] Re-land #26235 (skip EAGLE topk==1 softmax) for CUDA only #26407

Closed

Regression: DeepSeek-V3.2 GSM8K accuracy 0.955 → 0.760 on MI35x rocm700 after #3148 (Optimised dynamic per group scaled quant v2) ROCm/aiter#3366

Closed

michaelzhang-ai mentioned this pull request May 29, 2026

[spec decoding] Re-enable EAGLE topk==1 argmax fastpath on ROCm #26633

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reland "[perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1 (#26235)"#26397

Reland "[perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1 (#26235)"#26397
Qiaolin-Yu merged 3 commits into
mainfrom
revert-26358-revert-26235-eagle-topk1-softmax-skip

Qiaolin-Yu commented May 26, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Qiaolin-Yu commented May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Qiaolin-Yu commented May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026 •

edited

Loading

Uh oh!

michaelzhang-ai commented May 26, 2026

Uh oh!

gemini-code-assist Bot commented May 26, 2026

Uh oh!

Qiaolin-Yu commented May 26, 2026

Uh oh!

Qiaolin-Yu commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Qiaolin-Yu commented May 26, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI States

Uh oh!

Qiaolin-Yu commented May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Qiaolin-Yu commented May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelzhang-ai commented May 26, 2026

Uh oh!

gemini-code-assist Bot commented May 26, 2026

Uh oh!

Qiaolin-Yu commented May 26, 2026

Uh oh!

Qiaolin-Yu commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Qiaolin-Yu commented May 26, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented May 26, 2026 •

edited

Loading

github-actions Bot commented May 26, 2026 •

edited

Loading