Skip to content

Fix DSv4 attention backend for EAGLE per-step draft#24750

Open
ch-wan wants to merge 1 commit into
sgl-project:mainfrom
ch-wan:cwan/fix-dsv4-eagle-draft
Open

Fix DSv4 attention backend for EAGLE per-step draft#24750
ch-wan wants to merge 1 commit into
sgl-project:mainfrom
ch-wan:cwan/fix-dsv4-eagle-draft

Conversation

@ch-wan
Copy link
Copy Markdown
Collaborator

@ch-wan ch-wan commented May 9, 2026

Summary

The DeepseekV4MultiStepDraftBackend (added in #23882) constructs one DeepseekV4AttnBackend per spec step and forwards the same ForwardBatch to each step's init_forward_metadata. For forward_mode == decode, that method passed forward_batch.out_cache_loc straight through to init_forward_metadata_decode, which then asserted out_cache_loc.shape[0] == req_pool_indices.shape[0] == seq_lens.shape[0].

But in EAGLE draft, forward_batch.out_cache_loc has shape [bs * speculative_num_steps] (the full spec-decode cache-loc tensor — every step writes one new slot per request), while req_pool_indices and seq_lens are still at the unrepeated [bs] shape. The assertion fires whenever a draft batch reaches this path with bs >= 2:

AssertionError: req_pool_indices.shape=torch.Size([2])
    seq_lens.shape=torch.Size([2])
    out_cache_loc.shape=torch.Size([6])

In larger configurations (TP=8 / EP=8 / DP=8 multi-node decode) the shapes can coincidentally line up so the assertion does not fire, but the metadata is still mis-aligned: GSM8K accuracy collapses from ~0.93 to ~0.42 and decoded outputs are visibly malformed (stray Weapon: / Weaponry / Weaponized prefixes).

Mirror the FA3 backend's draft-decode handling at flashattention_backend.py (cache_seqlens_int32 = seqlens_in_batch + (self.speculative_step_id + 1)): when the backend was constructed as a per-step draft step (self.speculative_num_steps > 0), slice out_cache_loc to this step's portion and advance seq_lens / max_seq_len by step_id + 1 before calling init_forward_metadata_decode.

Closes #24747.

Reproducer

lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034 (built from main 9ee83034):

python3 -m sglang.launch_server \
  --model-path /model \
  --tp 4 --port 30001 \
  --trust-remote-code \
  --tool-call-parser deepseekv4 \
  --kv-cache-dtype fp8_e4m3 \
  --mem-fraction-static 0.85 \
  --max-running-requests 16 \
  --context-length 8192 \
  --moe-runner-backend flashinfer_mxfp4 \
  --disable-flashinfer-autotune \
  --disable-cuda-graph \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4

Send any 5-shot gsm8k chat-completion request — server crashes with the assertion above. With this patch, server completes normally and returns the correct answer (verified on the same prompt that GB300 disagg-decode CI was getting wrong).

Test plan

  • Standalone 5-shot gsm8k (Janet ducks, target=18) before patch: scheduler crash with the assertion above.
  • Standalone 5-shot gsm8k after patch: returns correct answer 18 with clean thinking content.
  • Multi-node disagg-decode CI sweep (TP=8 EP=8 DP=8) for em_strict ≥ ~0.93 — running.
  • Single-step / single-batch (bs=1) regression test — assertion was already happy at bs=1, behaviour should be unchanged. Worth a unit test for the no-spec path.

Related

Signed-off-by: Cheng Wan wan4ch@gmail.com

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ch-wan ch-wan force-pushed the cwan/fix-dsv4-eagle-draft branch from 4518746 to a7d3504 Compare May 9, 2026 03:14
The ``DeepseekV4MultiStepDraftBackend`` (added in sgl-project#23882) constructs
one ``DeepseekV4AttnBackend`` per spec step and forwards the same
``ForwardBatch`` to each step's ``init_forward_metadata``. For
``forward_mode == decode``, that method passed
``forward_batch.out_cache_loc`` straight through to
``init_forward_metadata_decode``, which then asserted
``out_cache_loc.shape[0] == req_pool_indices.shape[0] == seq_lens.shape[0]``.

But in EAGLE draft, ``forward_batch.out_cache_loc`` has shape
``[bs * speculative_num_steps]`` (the full spec-decode cache-loc
tensor — every step writes one new slot per request), while
``req_pool_indices`` and ``seq_lens`` are still at the unrepeated
``[bs]`` shape. The assertion fires whenever a draft batch reaches
this path with ``bs >= 2``:

    AssertionError: req_pool_indices.shape=torch.Size([2])
        seq_lens.shape=torch.Size([2])
        out_cache_loc.shape=torch.Size([6])

In larger configurations (TP=8 / EP=8 / DP=8 multi-node decode) the
shapes can coincidentally line up so the assertion does not fire,
but the metadata is still mis-aligned: GSM8K accuracy collapses
from ~0.93 to ~0.42 and decoded outputs are visibly malformed
(stray ``Weapon:`` / ``Weaponry`` / ``Weaponized`` prefixes).

Mirror the FA3 backend's draft-decode handling at
``flashattention_backend.py`` (``cache_seqlens_int32 =
seqlens_in_batch + (self.speculative_step_id + 1)``): when the
backend was constructed as a per-step draft step
(``self.speculative_num_steps > 0``), slice ``out_cache_loc`` to
this step's portion and advance ``seq_lens`` / ``max_seq_len`` by
``step_id + 1`` before calling ``init_forward_metadata_decode``.

Closes sgl-project#24747

Signed-off-by: Cheng Wan <wan4ch@gmail.com>
@ch-wan
Copy link
Copy Markdown
Collaborator Author

ch-wan commented May 9, 2026

Heads-up: posted an update on #24747 — this PR holds for the monolithic case (verified gsm8k 0.975 / 200 questions on a TP=4 single-node setup with the failing 5-shot returning the correct answer end-to-end), but the disaggregated prefill+decode + EAGLE+DSv4 path is still broken with the same patch applied. Same image, same recipe topology, prompt_tokens identical to the working monolithic path (1128), MTP accept-rate healthy, but the model resumes from a previous few-shot assistant turn instead of generating from <|Assistant|><think>. Worth flagging before merging.

if bucket == _GraphBucket.DECODE_OR_IDLE:
assert out_cache_loc is not None
assert len(out_cache_loc.shape) == 1, f"{out_cache_loc.shape=}"
if self.speculative_num_steps > 0:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the bug only happens when disabling cuda graph, these lines are unnecessary?


if forward_batch.forward_mode.is_decode_or_idle():
out_cache_loc = forward_batch.out_cache_loc
if self.speculative_num_steps > 0:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe only do this when cuda graph is disabled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DSv4 attention backend assertion fails on EAGLE draft path: req_pool_indices vs out_cache_loc shape mismatch

2 participants