Reland "[perf][spec decoding] Skip full-vocab softmax in EAGLE draft when topk == 1 (#26235)"#26397
Conversation
…E draft …" This reverts commit 9409969.
|
/rerun-test test/registered/quant/test_deepseek_v32_fp4_mtp_4gpu.py |
|
🚀 |
There was a problem hiding this comment.
Code Review
This pull request optimizes speculative decoding when topk is set to 1. By checking if self.topk == 1, the code bypasses the expensive softmax and fast_topk operations, instead directly computing the argmax over the next token logits and setting the top-k probabilities to 1. This optimization is applied across eagle_draft_extend_cuda_graph_runner.py and eagle_worker_v2.py. No review comments were provided for this pull request.
|
/rerun-test test/registered/amd/accuracy/mi35x/test_deepseek_v32_mtp_eval_mi35x.py |
|
⛔ |
|
Hi @Qiaolin-Yu — thanks for re-landing the optimization. One concern with this exact form: it's a straight revert of the revert, so it brings back the same code that caused R108 (DSv3.2-MTP gsm8k → 0.035 with 96% invalid output on ROCm 7.2 mi35x). The R108 verify on the revert branch (run) just confirmed the revert recovered MTP from 0.035 → 0.975 on the exact same hardware + aiter pin, so the cause is the topk==1 path itself, not anything else in the regression window. I just dispatched Two ways forward, your call:
The Long-term, gemini-code-assist suggested an alternative on #26358 — compute |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…6235-eagle-topk1-softmax-skip
|
/tag-and-rerun-ci |
PR #26397 (`Reland softmax-skip for EAGLE draft when topk==1`) added a topk==1 fast path in both `EagleDraftWorker.draft_forward` and `EAGLEDraftExtendCudaGraphRunner.replay`: when `topk == 1` (and not ROCm) production skips the full-vocab softmax and emits `topk_p = ones_like(topk_index, dtype=float32)`. The CG-runner replay output therefore returns `topk_p = 1.0` for every row. The test's eager reference helper `_run_eagle_draft_extend_eager` was still doing the old `softmax → fast_topk` path, producing real probabilities (~0.02 on these batches). `assert_outputs_close` then flagged a ~0.98 absolute drift across all 11 cuda-graph runner cases: AssertionError: Tensor-likes are not close! Mismatched elements: 2 / 2 (100.0%) Greatest absolute difference: 0.9804949760437012 (up to 0.03 allowed) Mirror the production fast path in the eager reference. `topk_index` already matches (argmax of logits == argmax of softmax(logits)); only `topk_p` had to be aligned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI States
Latest PR Test (Base): 🚫 Run #26472585858
Latest PR Test (Extra): ✅ Run #26472613859