Skip to content

[Spec V1] Split draft-extend phase from EagleDraftInput into new EagleDraftExtendInput#24859

Merged
hnyls2002 merged 24 commits into
mainfrom
lsyin/spec-pr1
May 10, 2026
Merged

[Spec V1] Split draft-extend phase from EagleDraftInput into new EagleDraftExtendInput#24859
hnyls2002 merged 24 commits into
mainfrom
lsyin/spec-pr1

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented May 9, 2026

Summary

  • Split the draft-extend phase out of EagleDraftInput into a new EagleDraftExtendInput dataclass, eliminating the phase-shifting overload where one instance was mutated across draft / draft-extend phases.
  • V1 path only (eagle_worker.py, multi_layer_eagle_worker.py, frozen_kv_mtp_worker.py). V2 overlap worker still reuses one instance across phases — alignment is a follow-up.

Background

  • Pre-PR EagleDraftInput.hidden_states switched between [bs, hidden] (draft) and [total_accepted, hidden] (draft-extend) on the same instance. Workers maintained the invariant locally; attention backends had to special-case by phase.
  • The next_draft_input returned by EagleVerifyInput.verify was misleadingly named — at construction time all 4 fields it carried (hidden_states[accept_index], num_accepted_drafts, num_accepted_tokens, num_accepted_tokens_cpu) were draft-extend data. It only "became" a draft input after forward_draft_extend_after_decode mutated topk_p / topk_index / hidden_states in place.
  • Four transient verify->extend handoff fields (unfinished_accept_tokens, seq_lens_for_draft_extend, seq_lens_for_draft_extend_cpu, req_pool_indices_for_draft_extend) lived on EagleVerifyOutput purely to thread state from verify to prepare_extend_after_decode.

Schema changes (eagle_info.py)

New EagleDraftExtendInput

  • Owns full extend-phase state: per-accept-token hidden_states, per-req accept counts, the 4 ex-handoff fields (input_ids, seq_lens, seq_lens_cpu, req_pool_indices), and kernel outputs (positions, bonus_tokens).
  • prepare_extend_after_decode drops its verify_output arg and reads everything from self; adds assert batch.spec_info is self invariant.

Trimmed EagleDraftInput

  • Keeps only true draft-phase fields (topk_p, topk_index, hidden_states[bs, h], bonus_tokens, kv_indptr, kv_indices).
  • Five V2-only fields (future_indices, new_seq_lens, verify_done, num_accepted_drafts, num_accepted_tokens) kept as Optional carve-outs and commented as "V2 overlap worker only" — to be cleaned up after V2 alignment.

EagleVerifyOutput

  • draft_extend_input: EagleDraftExtendInput replaces next_draft_input and absorbs the 4 transient handoff fields.

Worker control flow

  • verify(self, batch) no longer takes spec_info — reads from batch.spec_info after caller installs (mirrors V1 / multi-layer / Frozen).
  • forward_draft_extend_after_decode is now a pure transform: caller installs EagleDraftExtendInput as batch.spec_info, method returns a freshly-built EagleDraftInput for next iter, caller installs that.
  • All-reqs-finished branch installs an empty EagleDraftInput(capture_hidden_mode=LAST) so next iter's merge_batch short-circuits on hidden_states is None (EagleVerifyInput has no merge_batch).
  • Non-cuda-graph extend path replaces self.capture_for_decode(logits_output, forward_batch.spec_info) with inline softmax + fast_topk — equivalent semantics, no longer mutates spec_info.
  • Backup/restore in forward_draft_extend_after_decode no longer touches num_accepted_drafts / num_accepted_tokens (now on the soon-to-be-discarded EagleDraftExtendInput).

Type registration & padding

  • Add SpecInputType.EAGLE_DRAFT_EXTEND and SpecInputType.FROZEN_KV_MTP_DRAFT_EXTEND; both in is_draft_input() so _pad_inputs_to_size covers the new phase.
  • forward_batch_info._pad_inputs_to_size switches to getattr(spec_info, ..., None) since the two draft-input types now carry disjoint subsets of topk_p / topk_index / num_accepted_drafts.

Frozen-KV MTP mirror

  • New FrozenKVMTPDraftExtendInput(EagleDraftExtendInput) tag-only subclass; _to_frozen_kv_mtp_draft_input renamed to _to_frozen_kv_mtp_draft_extend_input and reflects over EagleDraftExtendInput.fields.
  • select_last_verified_seed drops the num_accepted_tokens is None early-return (always present on the new dataclass).
  • frozen_kv_mtp_worker.forward_draft_extend_after_decode adds idle early-return after stashing an idle FrozenKVMTPDraftInput.

Looks confusing but is correct

  • filter_batch / merge_batch appear rewritten in the diff but are byte-identical to pre-PR. They moved up inside EagleDraftInput only because prepare_extend_after_decode / generate_attn_arg_prefill got extracted to EagleDraftExtendInput — verified via sha1 on the function range.
  • EagleDraftInput still has num_accepted_drafts / num_accepted_tokens (and 3 other V2 fields) after a "schema split" PR. Looks like a leftover, but it is intentional: V2 overlap worker still reuses one instance across phases. Comment explicitly tags them "V2 overlap worker only"; cleaned up after V2 alignment.
  • bonus_tokens exists on both EagleDraftInput and EagleDraftExtendInput. Not a duplicate. The kernel writes it on the extend-input; the worker copies it onto the next-iter draft-input where the next draft forward consumes it. Two roles, two homes.
  • All-reqs-finished branch installs an empty EagleDraftInput(capture_hidden_mode=LAST) instead of leaving batch.spec_info as the now-stale EagleVerifyInput. Looks like a no-op assignment, but it is required so the scheduler's next-iter merge_batch finds an EagleDraftInput (which has merge_batch) instead of an EagleVerifyInput (which doesn't), and short-circuits on hidden_states is None.
  • Non-cuda-graph extend path drops self.capture_for_decode(...) and inlines softmax + fast_topk. Looks like a behavior change — it's not. capture_for_decode body is exactly those two lines plus an in-place assignment to spec_info; inlining is equivalent and avoids mutating the soon-to-be-discarded EagleDraftExtendInput.
  • forward_draft_extend_after_decode returns EagleDraftInput in EAGLE / multi-layer-EAGLE workers but returns None in frozen_kv_mtp_worker. Asymmetric on purpose: Frozen's _run_assistant_seed_step already installs a fresh FrozenKVMTPDraftInput onto batch.spec_info internally, so there is nothing for the caller to reinstall.
  • EagleDraftExtendInput.is_draft_input() returns True despite the name "draft-extend". Reused on purpose — _pad_inputs_to_size keys off is_draft_input() to decide whether to pad spec-info tensors, and the new extend phase needs the same padding treatment.
  • prepare_extend_after_decode adds assert batch.spec_info is self. Looks like a defensive paranoia check; it's actually a phase-boundary invariant — the method now reads input_ids / seq_lens / req_pool_indices off self rather than via a verify_output arg, so the caller must have installed self as batch.spec_info first.

Test plan

  • CI runs all V1 EAGLE / Multi-layer EAGLE / Frozen KV MTP suites
  • Retraction tests under EAGLE3 (TestStreamingSessionEagleRetractLargePage) — covered by stage-b
  • DP-attention forced-extend path (kept under enable_dp_attention or input_ids.shape[0] > 0)

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hnyls2002 hnyls2002 changed the base branch from main to lsyin/spec-drop-dead-code May 9, 2026 22:43
Base automatically changed from lsyin/spec-drop-dead-code to main May 9, 2026 22:53
@hnyls2002 hnyls2002 changed the title [Spec V1] Introduce EagleDraftExtendInput; split draft-extend phase from EagleDraftInput [Spec V1] Split draft-extend phase from EagleDraftInput into new EagleDraftExtendInput May 10, 2026
@hnyls2002 hnyls2002 merged commit d087442 into main May 10, 2026
124 of 134 checks passed
@hnyls2002 hnyls2002 deleted the lsyin/spec-pr1 branch May 10, 2026 08:07
ltcs11 added a commit to ltcs11/sglang that referenced this pull request May 11, 2026
* main: (87 commits)
  [Fix] Disable FlashInfer allreduce fusion under deterministic inference (sgl-project#24629)
  fix: STANDALONE spec-decode hidden-size mismatch crash (sgl-project#24217)
  Followup fix for Custom AR V2 in non NVL scenarios (sgl-project#24742)
  Fix reduce_scatterv producer contract for SUM_LEN (sgl-project#24785)
  [NPU]Documentation update for communications quantization feature (sgl-project#24668)
  [Session R3] Add routed_experts_start_len for absolute routing slice control (sgl-project#24851)
  [Model] Add MiniCPM-V 4.6 support (sgl-project#24855)
  Support Intern-S2-Preview (sgl-project#24875)
  [PD] Unify dsv4 dispatch with swa (sgl-project#24888)
  Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (sgl-project#24775)
  Fix PD bootstrap failure handling (sgl-project#24772)
  [Spec] Cleanup idle stub and shape-check patterns (sgl-project#24881)
  [Bug] Add dsv4 state_type branch to mooncake disaggregation (sgl-project#24878)
  [Spec V1] Split draft-extend phase from `EagleDraftInput` into new `EagleDraftExtendInput` (sgl-project#24859)
  [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (sgl-project#24696)
  [spec decoding] support kimi-k2.5-eagle3-mla (sgl-project#24826)
  [SPEC V2] fix: skip stale state updates in spec-v2 overlap (sgl-project#23456)
  [RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM (sgl-project#24854)
  [diffusion] CI: add cache-dit CI tests (sgl-project#19213)
  [Utils] Make request dump robust to unpicklable server_args and large meta_info (sgl-project#24767)
  ...

# Conflicts:
#	python/sglang/srt/utils/common.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant