[perf][spec decoding] Re-land #26235 (skip EAGLE topk==1 softmax) for CUDA only#26407
Closed
michaelzhang-ai wants to merge 1 commit into
Closed
Conversation
…oftmax) for CUDA only Re-lands the perf optimization from sgl-project#26235 — which skips the full-vocab softmax + fast_topk when `self.topk == 1` and uses `argmax(logits)` with a placeholder `topk_p = ones` — but gates it OFF on ROCm/HIP, where it was the cause of R108 (DSv3.2 + MTP gsm8k accuracy collapse to 0.035 with ~96% invalid output). ## Why gate, not redo Per verification on the revert PR (sgl-project#26358), AMD MTP draft paths consume `topk_p` somewhere downstream in a way that depends on it being the actual softmax probability, not a placeholder. The exact downstream read site has not been identified; gating off on HIP is the zero-correctness-risk way to preserve the CUDA perf win while keeping AMD safe. Evidence the gate is sufficient: - Reverting all 3 sites recovered DSv3.2-MTP gsm8k on ROCm 7.2 from 0.035 → 0.975 ([revert verify](https://github.com/sgl-project/sglang/actions/runs/26438872740/job/77828088922)). - Pre-sgl-project#26235 (with full-vocab softmax) was the historical green state on AMD for weeks; restoring that branch on HIP returns to known-good. ## What the gate looks like 3 sites get the same shape (verbatim across files): if self.topk == 1 and not _is_hip: # topk=1 → degenerate single-path tree; skip full-vocab softmax # and use argmax(logits) directly. Gated off on ROCm/HIP because # the MTP draft path is sensitive to whether topk_p is the true # probability or a placeholder; see sgl-project#26358 (revert) / R108. ret.topk_index = torch.argmax( ret.next_token_logits, dim=-1, keepdim=True ) ret.topk_p = torch.ones_like(ret.topk_index, dtype=torch.float32) else: probs = torch.softmax(ret.next_token_logits, dim=-1) ret.topk_p, ret.topk_index = fast_topk(probs, self.topk, dim=-1) `_is_hip` is already module-scope in eagle_worker_v2.py; added a parallel module-scope binding in eagle_draft_extend_cuda_graph_runner.py. ## Files - python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py (+ is_hip import, + _is_hip module binding, + 1 site gate) - python/sglang/srt/speculative/eagle_worker_v2.py (+ 2 site gates) ## Tested - Lint clean (no new ruff/flake8 issues). - AMD CI on the parent commit (origin/main = `a26913158`) is the baseline to compare against — this PR is a no-op on HIP at runtime, so AMD-side CI behavior should match origin/main exactly. ## References - Original PR (reverted): sgl-project#26235 by @Qiaolin-Yu - Revert PR (merged 2026-05-26): sgl-project#26358 - CI tracker: bingxche/sglang-ci-bot#84 (R108) cc @Qiaolin-Yu (original author) — this is a re-land of your perf optimization with the AMD safety gate we discussed in sgl-project#26358.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
Closing — the AMD-gate this PR was proposing ( cc @Qiaolin-Yu — thanks for incorporating the AMD gate. Will follow up with verification on the actual rocm720 dsv32-mtp nightly to confirm R108 stays clear under main + gate; will report results separately. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Re-lands the perf optimization from #26235 (by @Qiaolin-Yu) — which skips the full-vocab softmax +
fast_topkwhenself.topk == 1and usesargmax(logits)with a placeholdertopk_p = ones— but gates it off on ROCm/HIP, where it was the cause of R108 (DSv3.2 + MTP gsm8k accuracy collapse to 0.035 with ~96% invalid output).#26235 was reverted in #26358 (merged 2026-05-26 09:47 UTC) to recover AMD nightly. This PR brings back the CUDA perf benefit without re-breaking AMD.
Why gate (rather than fix the AMD path)
Per verification on the revert PR (#26358), the AMD MTP draft path consumes
topk_psomewhere downstream in a way that depends on it being the actual softmax probability, not a placeholder. The exact downstream read site has not been identified.Gating off on HIP is the zero-correctness-risk way to preserve the CUDA perf win while keeping AMD safe. A future PR can identify the read site and re-enable the optimization on HIP once a more surgical fix is available — likely along the lines suggested by the gemini-code-assist reviewer on #26358 (compute the top-1 probability directly via the softmax-with-known-max identity, which keeps
topk_pnumerically correct on both backends without the bandwidth cost).Modifications
Three call sites (the same three #26235 touched) now look like:
python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.pyis_hipto the existingfrom sglang.srt.utils import (...)_is_hip = is_hip()run_oncesoftmax+fast_topk sitepython/sglang/srt/speculative/eagle_worker_v2.py_is_hipwas already module-scope (line 89), no new importsdraft_forwardstep site (~line 486)_draft_extend_for_decodesite (~line 654)Net: +39/-6 across 2 files. No behavior change on AMD (runtime path becomes identical to current main).
Verification plan
pr-testshould re-validate the perf path. EAGLE/MTP tests on CUDA should be ≥ pre-revert baseline.nightly-accuracy-8-gpu-mi35x-deepseek-v32-mtp{,-rocm720}against this branch to prove the gate keeps R108 clear.References
cc @Qiaolin-Yu — this preserves your CUDA optimization while keeping AMD safe. Open to suggestions for a follow-up that re-enables the optimization on HIP once the downstream read site is identified.
Checklist
CI States
Latest PR Test (Base): ❌ Run #26451630790
Latest PR Test (Extra): ❌ Run #26451630335