Skip to content

Reland "[perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1 (#26235)"#26397

Merged
Qiaolin-Yu merged 3 commits into
mainfrom
revert-26358-revert-26235-eagle-topk1-softmax-skip
May 26, 2026
Merged

Reland "[perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1 (#26235)"#26397
Qiaolin-Yu merged 3 commits into
mainfrom
revert-26358-revert-26235-eagle-topk1-softmax-skip

Conversation

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator

@Qiaolin-Yu Qiaolin-Yu commented May 26, 2026


CI States

Latest PR Test (Base): 🚫 Run #26472585858
Latest PR Test (Extra): ✅ Run #26472613859

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/quant/test_deepseek_v32_fp4_mtp_4gpu.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 26, 2026

🚀 4-gpu-b200 (1 test): ✅ View workflow run

cd test/ && python3 registered/quant/test_deepseek_v32_fp4_mtp_4gpu.py

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes speculative decoding when topk is set to 1. By checking if self.topk == 1, the code bypasses the expensive softmax and fast_topk operations, instead directly computing the argmax over the next token logits and setting the top-k probabilities to 1. This optimization is applied across eagle_draft_extend_cuda_graph_runner.py and eagle_worker_v2.py. No review comments were provided for this pull request.

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/amd/accuracy/mi35x/test_deepseek_v32_mtp_eval_mi35x.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 26, 2026

test/registered/amd/accuracy/mi35x/test_deepseek_v32_mtp_eval_mi35x.py: No register_cuda_ci(runner_config=...) or register_cpu_ci() found in test/registered/amd/accuracy/mi35x/test_deepseek_v32_mtp_eval_mi35x.py. /rerun-test only supports tests registered via the new-style yml-driven API; nightly/weekly tests aren't dispatchable through this command.

@michaelzhang-ai
Copy link
Copy Markdown
Collaborator

Hi @Qiaolin-Yu — thanks for re-landing the optimization. One concern with this exact form: it's a straight revert of the revert, so it brings back the same code that caused R108 (DSv3.2-MTP gsm8k → 0.035 with 96% invalid output on ROCm 7.2 mi35x). The R108 verify on the revert branch (run) just confirmed the revert recovered MTP from 0.035 → 0.975 on the exact same hardware + aiter pin, so the cause is the topk==1 path itself, not anything else in the regression window.

I just dispatched nightly-accuracy-8-gpu-mi35x-deepseek-v32-mtp-rocm720 against this PR's branch to confirm — ETA ~50 min. If the AMD-side bug is robust (which the prior evidence suggests), it'll fail again at ~0.035.

Two ways forward, your call:

  1. Apply a tiny AMD gate to this PR. I opened #26407 with the same optimization but wrapped in if self.topk == 1 and not _is_hip:. The CUDA fast path is byte-identical to this PR; AMD falls through to the original softmax+fast_topk. 39 ins / 6 del, lint-clean. Happy to close [perf][spec decoding] Re-land #26235 (skip EAGLE topk==1 softmax) for CUDA only #26407 and push the gate as a commit on this branch instead — whichever you prefer.

  2. Close this PR and merge #26407 instead — same outcome, simpler attribution.

The /rerun-test failure you hit (nightly/weekly tests aren't dispatchable through this command) is because the AMD MTP test is registered as nightly-only. The workflow_dispatch I just kicked off is the only way to run it on-demand against an arbitrary branch — happy to do that for any iteration on this PR.

Long-term, gemini-code-assist suggested an alternative on #26358 — compute top1_prob via softmax-with-known-max identity, which keeps topk_p numerically correct on both backends without paying the full-vocab bandwidth cost. That'd let us drop the AMD gate in a follow-up.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator Author

https://github.com/sgl-project/sglang/actions/runs/26472804994 track here

@Qiaolin-Yu Qiaolin-Yu merged commit dd6f073 into main May 26, 2026
203 of 243 checks passed
@Qiaolin-Yu Qiaolin-Yu deleted the revert-26358-revert-26235-eagle-topk1-softmax-skip branch May 26, 2026 21:14
ch-wan added a commit that referenced this pull request May 28, 2026
PR #26397 (`Reland softmax-skip for EAGLE draft when topk==1`) added a
topk==1 fast path in both `EagleDraftWorker.draft_forward` and
`EAGLEDraftExtendCudaGraphRunner.replay`: when `topk == 1` (and not
ROCm) production skips the full-vocab softmax and emits
`topk_p = ones_like(topk_index, dtype=float32)`. The CG-runner replay
output therefore returns `topk_p = 1.0` for every row.

The test's eager reference helper `_run_eagle_draft_extend_eager` was
still doing the old `softmax → fast_topk` path, producing real
probabilities (~0.02 on these batches). `assert_outputs_close` then
flagged a ~0.98 absolute drift across all 11 cuda-graph runner cases:

  AssertionError: Tensor-likes are not close!
  Mismatched elements: 2 / 2 (100.0%)
  Greatest absolute difference: 0.9804949760437012 (up to 0.03 allowed)

Mirror the production fast path in the eager reference. `topk_index`
already matches (argmax of logits == argmax of softmax(logits)); only
`topk_p` had to be aligned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants