Fix DSv4 attention backend for EAGLE per-step draft by ch-wan · Pull Request #24750 · sgl-project/sglang

ch-wan · 2026-05-09T03:12:24Z

Summary

The DeepseekV4MultiStepDraftBackend (added in #23882) constructs one DeepseekV4AttnBackend per spec step and forwards the same ForwardBatch to each step's init_forward_metadata. For forward_mode == decode, that method passed forward_batch.out_cache_loc straight through to init_forward_metadata_decode, which then asserted out_cache_loc.shape[0] == req_pool_indices.shape[0] == seq_lens.shape[0].

But in EAGLE draft, forward_batch.out_cache_loc has shape [bs * speculative_num_steps] (the full spec-decode cache-loc tensor — every step writes one new slot per request), while req_pool_indices and seq_lens are still at the unrepeated [bs] shape. The assertion fires whenever a draft batch reaches this path with bs >= 2:

AssertionError: req_pool_indices.shape=torch.Size([2])
    seq_lens.shape=torch.Size([2])
    out_cache_loc.shape=torch.Size([6])

In larger configurations (TP=8 / EP=8 / DP=8 multi-node decode) the shapes can coincidentally line up so the assertion does not fire, but the metadata is still mis-aligned: GSM8K accuracy collapses from ~0.93 to ~0.42 and decoded outputs are visibly malformed (stray Weapon: / Weaponry / Weaponized prefixes).

Mirror the FA3 backend's draft-decode handling at flashattention_backend.py (cache_seqlens_int32 = seqlens_in_batch + (self.speculative_step_id + 1)): when the backend was constructed as a per-step draft step (self.speculative_num_steps > 0), slice out_cache_loc to this step's portion and advance seq_lens / max_seq_len by step_id + 1 before calling init_forward_metadata_decode.

Closes #24747.

Reproducer

lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034 (built from main 9ee83034):

python3 -m sglang.launch_server \
  --model-path /model \
  --tp 4 --port 30001 \
  --trust-remote-code \
  --tool-call-parser deepseekv4 \
  --kv-cache-dtype fp8_e4m3 \
  --mem-fraction-static 0.85 \
  --max-running-requests 16 \
  --context-length 8192 \
  --moe-runner-backend flashinfer_mxfp4 \
  --disable-flashinfer-autotune \
  --disable-cuda-graph \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4

Send any 5-shot gsm8k chat-completion request — server crashes with the assertion above. With this patch, server completes normally and returns the correct answer (verified on the same prompt that GB300 disagg-decode CI was getting wrong).

Test plan

Standalone 5-shot gsm8k (Janet ducks, target=18) before patch: scheduler crash with the assertion above.
Standalone 5-shot gsm8k after patch: returns correct answer 18 with clean thinking content.
Multi-node disagg-decode CI sweep (TP=8 EP=8 DP=8) for em_strict ≥ ~0.93 — running.
Single-step / single-batch (bs=1) regression test — assertion was already happy at bs=1, behaviour should be unchanged. Worth a unit test for the no-spec path.

The ``DeepseekV4MultiStepDraftBackend`` (added in sgl-project#23882) constructs one ``DeepseekV4AttnBackend`` per spec step and forwards the same ``ForwardBatch`` to each step's ``init_forward_metadata``. For ``forward_mode == decode``, that method passed ``forward_batch.out_cache_loc`` straight through to ``init_forward_metadata_decode``, which then asserted ``out_cache_loc.shape[0] == req_pool_indices.shape[0] == seq_lens.shape[0]``. But in EAGLE draft, ``forward_batch.out_cache_loc`` has shape ``[bs * speculative_num_steps]`` (the full spec-decode cache-loc tensor — every step writes one new slot per request), while ``req_pool_indices`` and ``seq_lens`` are still at the unrepeated ``[bs]`` shape. The assertion fires whenever a draft batch reaches this path with ``bs >= 2``: AssertionError: req_pool_indices.shape=torch.Size([2]) seq_lens.shape=torch.Size([2]) out_cache_loc.shape=torch.Size([6]) In larger configurations (TP=8 / EP=8 / DP=8 multi-node decode) the shapes can coincidentally line up so the assertion does not fire, but the metadata is still mis-aligned: GSM8K accuracy collapses from ~0.93 to ~0.42 and decoded outputs are visibly malformed (stray ``Weapon:`` / ``Weaponry`` / ``Weaponized`` prefixes). Mirror the FA3 backend's draft-decode handling at ``flashattention_backend.py`` (``cache_seqlens_int32 = seqlens_in_batch + (self.speculative_step_id + 1)``): when the backend was constructed as a per-step draft step (``self.speculative_num_steps > 0``), slice ``out_cache_loc`` to this step's portion and advance ``seq_lens`` / ``max_seq_len`` by ``step_id + 1`` before calling ``init_forward_metadata_decode``. Closes sgl-project#24747 Signed-off-by: Cheng Wan <wan4ch@gmail.com>

ch-wan · 2026-05-09T06:47:37Z

Heads-up: posted an update on #24747 — this PR holds for the monolithic case (verified gsm8k 0.975 / 200 questions on a TP=4 single-node setup with the failing 5-shot returning the correct answer end-to-end), but the disaggregated prefill+decode + EAGLE+DSv4 path is still broken with the same patch applied. Same image, same recipe topology, prompt_tokens identical to the working monolithic path (1128), MTP accept-rate healthy, but the model resumes from a previous few-shot assistant turn instead of generating from <|Assistant|><think>. Worth flagging before merging.

Fridge003 · 2026-05-11T20:38:32Z

        if bucket == _GraphBucket.DECODE_OR_IDLE:
            assert out_cache_loc is not None
            assert len(out_cache_loc.shape) == 1, f"{out_cache_loc.shape=}"
+            if self.speculative_num_steps > 0:


If the bug only happens when disabling cuda graph, these lines are unnecessary?

Fridge003 · 2026-05-11T20:39:16Z


        if forward_batch.forward_mode.is_decode_or_idle():
+            out_cache_loc = forward_batch.out_cache_loc
+            if self.speculative_num_steps > 0:


Maybe only do this when cuda graph is disabled

ch-wan requested review from Fridge003, HaiShaw, Qiaolin-Yu, hebiao064, ispobock and merrymercy as code owners May 9, 2026 03:12

github-actions Bot added the deepseek label May 9, 2026

ch-wan force-pushed the cwan/fix-dsv4-eagle-draft branch from 4518746 to a7d3504 Compare May 9, 2026 03:14

ch-wan added the high priority label May 9, 2026

ch-wan force-pushed the cwan/fix-dsv4-eagle-draft branch from a7d3504 to ac6b542 Compare May 9, 2026 03:23

ch-wan mentioned this pull request May 9, 2026

DSv4 attention backend assertion fails on EAGLE draft path: req_pool_indices vs out_cache_loc shape mismatch #24747

Open

Fridge003 reviewed May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DSv4 attention backend for EAGLE per-step draft#24750

Fix DSv4 attention backend for EAGLE per-step draft#24750
ch-wan wants to merge 1 commit into
sgl-project:mainfrom
ch-wan:cwan/fix-dsv4-eagle-draft

ch-wan commented May 9, 2026

Uh oh!

gemini-code-assist Bot commented May 9, 2026

Uh oh!

ch-wan commented May 9, 2026

Uh oh!

Fridge003 May 11, 2026

Uh oh!

Fridge003 May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ch-wan commented May 9, 2026

Summary

Reproducer

Test plan

Related

Uh oh!

gemini-code-assist Bot commented May 9, 2026

Uh oh!

ch-wan commented May 9, 2026

Uh oh!

Fridge003 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Fridge003 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants