Skip to content

[spec decoding] Re-enable EAGLE topk==1 argmax fastpath on ROCm#26633

Open
michaelzhang-ai wants to merge 2 commits into
mainfrom
reenable-eagle-topk1-rocm-fastpath
Open

[spec decoding] Re-enable EAGLE topk==1 argmax fastpath on ROCm#26633
michaelzhang-ai wants to merge 2 commits into
mainfrom
reenable-eagle-topk1-rocm-fastpath

Conversation

@michaelzhang-ai
Copy link
Copy Markdown
Collaborator

@michaelzhang-ai michaelzhang-ai commented May 29, 2026

Motivation

Removes the not _is_hip gate added in #26397, so the topk == 1 argmax fastpath (skip full-vocab softmax + fast_topk) runs on ROCm/HIP as well as CUDA.

The gate was added defensively when R108 (DSv3.2-MTP gsm8k → 0.040 with ~96% invalid output) appeared on the 2026-05-25 rocm720 nightly. Subsequent investigation indicates the gate is not load-bearing and R108 was a single transient blip, not a deterministic regression from this optimization.

Evidence the gate isn't needed

R108 fired exactly once and self-recovered on gated main (nightly-accuracy-8-gpu-mi35x-deepseek-v32-mtp-rocm720):

Date Result Score
2026-05-25 ❌ FAIL 0.040
2026-05-26 ✅ PASS 0.960
2026-05-27 ✅ PASS 0.960
2026-05-28 ✅ PASS 0.965

The un-gated code passes in isolation — same code path and aiter pin that scored 0.040 in the full suite scored 0.965–0.970 in isolated single-job dispatch:

Run Code Context Score
daily nightly 05-25 un-gated full suite 0.040 FAIL
26484542377 un-gated b13d3d18c isolated 0.965 PASS
26451704480 un-gated 62ff4af990 isolated 0.970 PASS

The only variable that flips the result is full-suite vs isolated dispatch — not the gate — which points to an environmental cause (cumulative runner state across back-to-back 8-GPU jobs), not the argmax fastpath.

Modifications

Removes and not _is_hip at the 3 EAGLE draft sites (restoring #26235's original CUDA+ROCm behavior), and drops the now-unused is_hip import / _is_hip binding from eagle_draft_extend_cuda_graph_runner.py. _is_hip is retained in eagle_worker_v2.py (still used independently at line 317).

  • python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py
  • python/sglang/srt/speculative/eagle_worker_v2.py

Net: +3 / −14.

Validation (this is a draft pending AMD nightly)

Dispatching the full nightly-test-amd-rocm720 suite on this branch to confirm the un-gated path holds at ~0.96 under full-suite load (the only context R108 ever appeared). Will mark ready once the AMD MTP nightly is green on this branch. If it reproduces 0.040, I'll close this and keep the gate.

References

cc @Qiaolin-Yu


CI States

Latest PR Test (Base): ❌ Run #26612546618
Latest PR Test (Extra): ❌ Run #26612548549

Experiment: removes the `not _is_hip` gate added in #26397 so the
topk==1 argmax fastpath also runs on ROCm/HIP.

Rationale: R108 (DSv3.2-MTP gsm8k → 0.035) only ever reproduced inside
the full AMD nightly suite. In isolated single-job dispatch the un-gated
code passes consistently (0.965-0.970), same as the gated/softmax path
(0.975) — so the gate is not load-bearing and the failure looks
environment-driven rather than caused by the argmax fastpath. This branch
re-enables ROCm to test that hypothesis under full-suite conditions.

Removes the now-unused is_hip import + _is_hip binding from
eagle_draft_extend_cuda_graph_runner.py; keeps _is_hip in eagle_worker_v2.py
(still used independently at line 317).
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@michaelzhang-ai michaelzhang-ai marked this pull request as ready for review May 29, 2026 01:30
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant