[perf][spec decoding] Re-land #26235 (skip EAGLE topk==1 softmax) for CUDA only by michaelzhang-ai · Pull Request #26407 · sgl-project/sglang

michaelzhang-ai · 2026-05-26T13:39:39Z

Motivation

Re-lands the perf optimization from #26235 (by @Qiaolin-Yu) — which skips the full-vocab softmax + fast_topk when self.topk == 1 and uses argmax(logits) with a placeholder topk_p = ones — but gates it off on ROCm/HIP, where it was the cause of R108 (DSv3.2 + MTP gsm8k accuracy collapse to 0.035 with ~96% invalid output).

#26235 was reverted in #26358 (merged 2026-05-26 09:47 UTC) to recover AMD nightly. This PR brings back the CUDA perf benefit without re-breaking AMD.

Why gate (rather than fix the AMD path)

Per verification on the revert PR (#26358), the AMD MTP draft path consumes topk_p somewhere downstream in a way that depends on it being the actual softmax probability, not a placeholder. The exact downstream read site has not been identified.

Gating off on HIP is the zero-correctness-risk way to preserve the CUDA perf win while keeping AMD safe. A future PR can identify the read site and re-enable the optimization on HIP once a more surgical fix is available — likely along the lines suggested by the gemini-code-assist reviewer on #26358 (compute the top-1 probability directly via the softmax-with-known-max identity, which keeps topk_p numerically correct on both backends without the bandwidth cost).

Modifications

Three call sites (the same three #26235 touched) now look like:

if self.topk == 1 and not _is_hip:
    # topk=1 → degenerate single-path tree; skip full-vocab softmax
    # and use argmax(logits) directly. Gated off on ROCm/HIP because
    # the MTP draft path is sensitive to whether topk_p is the true
    # probability or a placeholder; see #26358 (revert) / R108.
    ret.topk_index = torch.argmax(
        ret.next_token_logits, dim=-1, keepdim=True
    )
    ret.topk_p = torch.ones_like(ret.topk_index, dtype=torch.float32)
else:
    probs = torch.softmax(ret.next_token_logits, dim=-1)
    ret.topk_p, ret.topk_index = fast_topk(probs, self.topk, dim=-1)

python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py
- Added is_hip to the existing from sglang.srt.utils import (...)
- Added module-scope _is_hip = is_hip()
- Gated the run_once softmax+fast_topk site
python/sglang/srt/speculative/eagle_worker_v2.py
- _is_hip was already module-scope (line 89), no new imports
- Gated the draft_forward step site (~line 486)
- Gated the _draft_extend_for_decode site (~line 654)

Net: +39/-6 across 2 files. No behavior change on AMD (runtime path becomes identical to current main).

Verification plan

CUDA: PR's auto-triggered pr-test should re-validate the perf path. EAGLE/MTP tests on CUDA should be ≥ pre-revert baseline.
AMD ROCm: this PR is a runtime no-op on HIP, so the AMD CI behavior should match origin/main exactly. Will additionally request a targeted dispatch of nightly-accuracy-8-gpu-mi35x-deepseek-v32-mtp{,-rocm720} against this branch to prove the gate keeps R108 clear.

References

Original PR (reverted): [perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1 #26235 (author)
Revert PR (merged 2026-05-26): Revert "[perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1 (#26235)" #26358
R108 evidence on revert verify: https://github.com/sgl-project/sglang/actions/runs/26438872740/job/77828088922 (0.975 PASS on revert branch, was 0.035)
CI tracker: https://github.com/bingxche/sglang-ci-bot/issues/84 (cluster R108)

cc @Qiaolin-Yu — this preserves your CUDA optimization while keeping AMD safe. Open to suggestions for a follow-up that re-enables the optimization on HIP once the downstream read site is identified.

Checklist

Format your code — lint clean, no new diagnostics
Add unit tests — N/A (re-lands existing optimization with backend gate; existing EAGLE tests cover both paths)
Update documentation — N/A
Provide accuracy and speed benchmark results — see Evidence in #26358 PR body (0.035 → 0.975 on AMD MTP-rocm720 from removing the optimization)
Follow the SGLang code style guidance

CI States

Latest PR Test (Base): ❌ Run #26451630790
Latest PR Test (Extra): ❌ Run #26451630335

@Qiaolin-Yu

…oftmax) for CUDA only Re-lands the perf optimization from sgl-project#26235 — which skips the full-vocab softmax + fast_topk when `self.topk == 1` and uses `argmax(logits)` with a placeholder `topk_p = ones` — but gates it OFF on ROCm/HIP, where it was the cause of R108 (DSv3.2 + MTP gsm8k accuracy collapse to 0.035 with ~96% invalid output). ## Why gate, not redo Per verification on the revert PR (sgl-project#26358), AMD MTP draft paths consume `topk_p` somewhere downstream in a way that depends on it being the actual softmax probability, not a placeholder. The exact downstream read site has not been identified; gating off on HIP is the zero-correctness-risk way to preserve the CUDA perf win while keeping AMD safe. Evidence the gate is sufficient: - Reverting all 3 sites recovered DSv3.2-MTP gsm8k on ROCm 7.2 from 0.035 → 0.975 ([revert verify](https://github.com/sgl-project/sglang/actions/runs/26438872740/job/77828088922)). - Pre-sgl-project#26235 (with full-vocab softmax) was the historical green state on AMD for weeks; restoring that branch on HIP returns to known-good. ## What the gate looks like 3 sites get the same shape (verbatim across files): if self.topk == 1 and not _is_hip: # topk=1 → degenerate single-path tree; skip full-vocab softmax # and use argmax(logits) directly. Gated off on ROCm/HIP because # the MTP draft path is sensitive to whether topk_p is the true # probability or a placeholder; see sgl-project#26358 (revert) / R108. ret.topk_index = torch.argmax( ret.next_token_logits, dim=-1, keepdim=True ) ret.topk_p = torch.ones_like(ret.topk_index, dtype=torch.float32) else: probs = torch.softmax(ret.next_token_logits, dim=-1) ret.topk_p, ret.topk_index = fast_topk(probs, self.topk, dim=-1) `_is_hip` is already module-scope in eagle_worker_v2.py; added a parallel module-scope binding in eagle_draft_extend_cuda_graph_runner.py. ## Files - python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py (+ is_hip import, + _is_hip module binding, + 1 site gate) - python/sglang/srt/speculative/eagle_worker_v2.py (+ 2 site gates) ## Tested - Lint clean (no new ruff/flake8 issues). - AMD CI on the parent commit (origin/main = `a26913158`) is the baseline to compare against — this PR is a no-op on HIP at runtime, so AMD-side CI behavior should match origin/main exactly. ## References - Original PR (reverted): sgl-project#26235 by @Qiaolin-Yu - Revert PR (merged 2026-05-26): sgl-project#26358 - CI tracker: bingxche/sglang-ci-bot#84 (R108) cc @Qiaolin-Yu (original author) — this is a re-land of your perf optimization with the AMD safety gate we discussed in sgl-project#26358.

gemini-code-assist · 2026-05-26T13:40:12Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

michaelzhang-ai · 2026-05-27T00:53:29Z

Closing — the AMD-gate this PR was proposing (if self.topk == 1 and not _is_hip:) is already in main as of the merge of #26397 (commit dd6f073). The gate looks identical to what this PR proposed. Either Qiaolin-Yu cherry-picked the gate from this PR onto #26397 before merging, or arrived at the same fix independently — either way, the desired state is now landed.

cc @Qiaolin-Yu — thanks for incorporating the AMD gate. Will follow up with verification on the actual rocm720 dsv32-mtp nightly to confirm R108 stays clear under main + gate; will report results separately.

michaelzhang-ai requested review from Qiaolin-Yu, Ying1123, hnyls2002 and merrymercy as code owners May 26, 2026 13:39

michaelzhang-ai mentioned this pull request May 26, 2026

Reland "[perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1 (#26235)" #26397

Merged

michaelzhang-ai marked this pull request as draft May 26, 2026 13:46

michaelzhang-ai closed this May 27, 2026

michaelzhang-ai deleted the reland-26235-eagle-topk1-skip-nv-only branch May 27, 2026 00:53

michaelzhang-ai restored the reland-26235-eagle-topk1-skip-nv-only branch May 29, 2026 00:53

michaelzhang-ai reopened this May 29, 2026

michaelzhang-ai closed this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf][spec decoding] Re-land #26235 (skip EAGLE topk==1 softmax) for CUDA only#26407

[perf][spec decoding] Re-land #26235 (skip EAGLE topk==1 softmax) for CUDA only#26407
michaelzhang-ai wants to merge 1 commit into
sgl-project:mainfrom
michaelzhang-ai:reland-26235-eagle-topk1-skip-nv-only

michaelzhang-ai commented May 26, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot commented May 26, 2026

Uh oh!

michaelzhang-ai commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michaelzhang-ai commented May 26, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Why gate (rather than fix the AMD path)

Modifications

Verification plan

References

Checklist

CI States

Uh oh!

gemini-code-assist Bot commented May 26, 2026

Uh oh!

michaelzhang-ai commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

michaelzhang-ai commented May 26, 2026 •

edited by github-actions Bot

Loading