Skip to content

[perf][spec decoding] Re-land #26235 (skip EAGLE topk==1 softmax) for CUDA only#26407

Closed
michaelzhang-ai wants to merge 1 commit into
sgl-project:mainfrom
michaelzhang-ai:reland-26235-eagle-topk1-skip-nv-only
Closed

[perf][spec decoding] Re-land #26235 (skip EAGLE topk==1 softmax) for CUDA only#26407
michaelzhang-ai wants to merge 1 commit into
sgl-project:mainfrom
michaelzhang-ai:reland-26235-eagle-topk1-skip-nv-only

Conversation

@michaelzhang-ai
Copy link
Copy Markdown
Collaborator

@michaelzhang-ai michaelzhang-ai commented May 26, 2026

Motivation

Re-lands the perf optimization from #26235 (by @Qiaolin-Yu) — which skips the full-vocab softmax + fast_topk when self.topk == 1 and uses argmax(logits) with a placeholder topk_p = ones — but gates it off on ROCm/HIP, where it was the cause of R108 (DSv3.2 + MTP gsm8k accuracy collapse to 0.035 with ~96% invalid output).

#26235 was reverted in #26358 (merged 2026-05-26 09:47 UTC) to recover AMD nightly. This PR brings back the CUDA perf benefit without re-breaking AMD.

Why gate (rather than fix the AMD path)

Per verification on the revert PR (#26358), the AMD MTP draft path consumes topk_p somewhere downstream in a way that depends on it being the actual softmax probability, not a placeholder. The exact downstream read site has not been identified.

Gating off on HIP is the zero-correctness-risk way to preserve the CUDA perf win while keeping AMD safe. A future PR can identify the read site and re-enable the optimization on HIP once a more surgical fix is available — likely along the lines suggested by the gemini-code-assist reviewer on #26358 (compute the top-1 probability directly via the softmax-with-known-max identity, which keeps topk_p numerically correct on both backends without the bandwidth cost).

Modifications

Three call sites (the same three #26235 touched) now look like:

if self.topk == 1 and not _is_hip:
    # topk=1 → degenerate single-path tree; skip full-vocab softmax
    # and use argmax(logits) directly. Gated off on ROCm/HIP because
    # the MTP draft path is sensitive to whether topk_p is the true
    # probability or a placeholder; see #26358 (revert) / R108.
    ret.topk_index = torch.argmax(
        ret.next_token_logits, dim=-1, keepdim=True
    )
    ret.topk_p = torch.ones_like(ret.topk_index, dtype=torch.float32)
else:
    probs = torch.softmax(ret.next_token_logits, dim=-1)
    ret.topk_p, ret.topk_index = fast_topk(probs, self.topk, dim=-1)
  • python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py
    • Added is_hip to the existing from sglang.srt.utils import (...)
    • Added module-scope _is_hip = is_hip()
    • Gated the run_once softmax+fast_topk site
  • python/sglang/srt/speculative/eagle_worker_v2.py
    • _is_hip was already module-scope (line 89), no new imports
    • Gated the draft_forward step site (~line 486)
    • Gated the _draft_extend_for_decode site (~line 654)

Net: +39/-6 across 2 files. No behavior change on AMD (runtime path becomes identical to current main).

Verification plan

  • CUDA: PR's auto-triggered pr-test should re-validate the perf path. EAGLE/MTP tests on CUDA should be ≥ pre-revert baseline.
  • AMD ROCm: this PR is a runtime no-op on HIP, so the AMD CI behavior should match origin/main exactly. Will additionally request a targeted dispatch of nightly-accuracy-8-gpu-mi35x-deepseek-v32-mtp{,-rocm720} against this branch to prove the gate keeps R108 clear.

References

cc @Qiaolin-Yu — this preserves your CUDA optimization while keeping AMD safe. Open to suggestions for a follow-up that re-enables the optimization on HIP once the downstream read site is identified.

Checklist

  • Format your code — lint clean, no new diagnostics
  • Add unit tests — N/A (re-lands existing optimization with backend gate; existing EAGLE tests cover both paths)
  • Update documentation — N/A
  • Provide accuracy and speed benchmark results — see Evidence in #26358 PR body (0.035 → 0.975 on AMD MTP-rocm720 from removing the optimization)
  • Follow the SGLang code style guidance

CI States

Latest PR Test (Base): ❌ Run #26451630790
Latest PR Test (Extra): ❌ Run #26451630335

…oftmax) for CUDA only

Re-lands the perf optimization from sgl-project#26235 — which skips the full-vocab
softmax + fast_topk when `self.topk == 1` and uses `argmax(logits)` with
a placeholder `topk_p = ones` — but gates it OFF on ROCm/HIP, where it
was the cause of R108 (DSv3.2 + MTP gsm8k accuracy collapse to 0.035
with ~96% invalid output).

## Why gate, not redo

Per verification on the revert PR (sgl-project#26358), AMD MTP draft paths consume
`topk_p` somewhere downstream in a way that depends on it being the
actual softmax probability, not a placeholder. The exact downstream
read site has not been identified; gating off on HIP is the
zero-correctness-risk way to preserve the CUDA perf win while keeping
AMD safe.

Evidence the gate is sufficient:
- Reverting all 3 sites recovered DSv3.2-MTP gsm8k on ROCm 7.2 from
  0.035 → 0.975 ([revert verify](https://github.com/sgl-project/sglang/actions/runs/26438872740/job/77828088922)).
- Pre-sgl-project#26235 (with full-vocab softmax) was the historical green state
  on AMD for weeks; restoring that branch on HIP returns to known-good.

## What the gate looks like

3 sites get the same shape (verbatim across files):

    if self.topk == 1 and not _is_hip:
        # topk=1 → degenerate single-path tree; skip full-vocab softmax
        # and use argmax(logits) directly. Gated off on ROCm/HIP because
        # the MTP draft path is sensitive to whether topk_p is the true
        # probability or a placeholder; see sgl-project#26358 (revert) / R108.
        ret.topk_index = torch.argmax(
            ret.next_token_logits, dim=-1, keepdim=True
        )
        ret.topk_p = torch.ones_like(ret.topk_index, dtype=torch.float32)
    else:
        probs = torch.softmax(ret.next_token_logits, dim=-1)
        ret.topk_p, ret.topk_index = fast_topk(probs, self.topk, dim=-1)

`_is_hip` is already module-scope in eagle_worker_v2.py; added a parallel
module-scope binding in eagle_draft_extend_cuda_graph_runner.py.

## Files

- python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py
  (+ is_hip import, + _is_hip module binding, + 1 site gate)
- python/sglang/srt/speculative/eagle_worker_v2.py (+ 2 site gates)

## Tested

- Lint clean (no new ruff/flake8 issues).
- AMD CI on the parent commit (origin/main = `a26913158`) is the baseline
  to compare against — this PR is a no-op on HIP at runtime, so AMD-side
  CI behavior should match origin/main exactly.

## References

- Original PR (reverted): sgl-project#26235 by @Qiaolin-Yu
- Revert PR (merged 2026-05-26): sgl-project#26358
- CI tracker: bingxche/sglang-ci-bot#84 (R108)

cc @Qiaolin-Yu (original author) — this is a re-land of your perf
optimization with the AMD safety gate we discussed in sgl-project#26358.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@michaelzhang-ai
Copy link
Copy Markdown
Collaborator Author

Closing — the AMD-gate this PR was proposing (if self.topk == 1 and not _is_hip:) is already in main as of the merge of #26397 (commit dd6f073). The gate looks identical to what this PR proposed. Either Qiaolin-Yu cherry-picked the gate from this PR onto #26397 before merging, or arrived at the same fix independently — either way, the desired state is now landed.

cc @Qiaolin-Yu — thanks for incorporating the AMD gate. Will follow up with verification on the actual rocm720 dsv32-mtp nightly to confirm R108 stays clear under main + gate; will report results separately.

@michaelzhang-ai michaelzhang-ai deleted the reland-26235-eagle-topk1-skip-nv-only branch May 27, 2026 00:53
@michaelzhang-ai michaelzhang-ai restored the reland-26235-eagle-topk1-skip-nv-only branch May 29, 2026 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant