[Spec V1] Split draft-extend phase from EagleDraftInput into new EagleDraftExtendInput#24859
Merged
Conversation
…tend_after_decode
…ct_last_verified_seed
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
2 tasks
2 tasks
EagleDraftInput into new EagleDraftExtendInput
ltcs11
added a commit
to ltcs11/sglang
that referenced
this pull request
May 11, 2026
* main: (87 commits) [Fix] Disable FlashInfer allreduce fusion under deterministic inference (sgl-project#24629) fix: STANDALONE spec-decode hidden-size mismatch crash (sgl-project#24217) Followup fix for Custom AR V2 in non NVL scenarios (sgl-project#24742) Fix reduce_scatterv producer contract for SUM_LEN (sgl-project#24785) [NPU]Documentation update for communications quantization feature (sgl-project#24668) [Session R3] Add routed_experts_start_len for absolute routing slice control (sgl-project#24851) [Model] Add MiniCPM-V 4.6 support (sgl-project#24855) Support Intern-S2-Preview (sgl-project#24875) [PD] Unify dsv4 dispatch with swa (sgl-project#24888) Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (sgl-project#24775) Fix PD bootstrap failure handling (sgl-project#24772) [Spec] Cleanup idle stub and shape-check patterns (sgl-project#24881) [Bug] Add dsv4 state_type branch to mooncake disaggregation (sgl-project#24878) [Spec V1] Split draft-extend phase from `EagleDraftInput` into new `EagleDraftExtendInput` (sgl-project#24859) [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (sgl-project#24696) [spec decoding] support kimi-k2.5-eagle3-mla (sgl-project#24826) [SPEC V2] fix: skip stale state updates in spec-v2 overlap (sgl-project#23456) [RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM (sgl-project#24854) [diffusion] CI: add cache-dit CI tests (sgl-project#19213) [Utils] Make request dump robust to unpicklable server_args and large meta_info (sgl-project#24767) ... # Conflicts: # python/sglang/srt/utils/common.py
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EagleDraftInputinto a newEagleDraftExtendInputdataclass, eliminating the phase-shifting overload where one instance was mutated across draft / draft-extend phases.eagle_worker.py,multi_layer_eagle_worker.py,frozen_kv_mtp_worker.py). V2 overlap worker still reuses one instance across phases — alignment is a follow-up.Background
EagleDraftInput.hidden_statesswitched between[bs, hidden](draft) and[total_accepted, hidden](draft-extend) on the same instance. Workers maintained the invariant locally; attention backends had to special-case by phase.next_draft_inputreturned byEagleVerifyInput.verifywas misleadingly named — at construction time all 4 fields it carried (hidden_states[accept_index],num_accepted_drafts,num_accepted_tokens,num_accepted_tokens_cpu) were draft-extend data. It only "became" a draft input afterforward_draft_extend_after_decodemutatedtopk_p/topk_index/hidden_statesin place.unfinished_accept_tokens,seq_lens_for_draft_extend,seq_lens_for_draft_extend_cpu,req_pool_indices_for_draft_extend) lived onEagleVerifyOutputpurely to thread state fromverifytoprepare_extend_after_decode.Schema changes (
eagle_info.py)New
EagleDraftExtendInputhidden_states, per-req accept counts, the 4 ex-handoff fields (input_ids,seq_lens,seq_lens_cpu,req_pool_indices), and kernel outputs (positions,bonus_tokens).prepare_extend_after_decodedrops itsverify_outputarg and reads everything fromself; addsassert batch.spec_info is selfinvariant.Trimmed
EagleDraftInputtopk_p,topk_index,hidden_states[bs, h],bonus_tokens,kv_indptr,kv_indices).future_indices,new_seq_lens,verify_done,num_accepted_drafts,num_accepted_tokens) kept asOptionalcarve-outs and commented as "V2 overlap worker only" — to be cleaned up after V2 alignment.EagleVerifyOutputdraft_extend_input: EagleDraftExtendInputreplacesnext_draft_inputand absorbs the 4 transient handoff fields.Worker control flow
verify(self, batch)no longer takesspec_info— reads frombatch.spec_infoafter caller installs (mirrors V1 / multi-layer / Frozen).forward_draft_extend_after_decodeis now a pure transform: caller installsEagleDraftExtendInputasbatch.spec_info, method returns a freshly-builtEagleDraftInputfor next iter, caller installs that.EagleDraftInput(capture_hidden_mode=LAST)so next iter'smerge_batchshort-circuits onhidden_states is None(EagleVerifyInputhas nomerge_batch).self.capture_for_decode(logits_output, forward_batch.spec_info)with inlinesoftmax + fast_topk— equivalent semantics, no longer mutatesspec_info.forward_draft_extend_after_decodeno longer touchesnum_accepted_drafts/num_accepted_tokens(now on the soon-to-be-discardedEagleDraftExtendInput).Type registration & padding
SpecInputType.EAGLE_DRAFT_EXTENDandSpecInputType.FROZEN_KV_MTP_DRAFT_EXTEND; both inis_draft_input()so_pad_inputs_to_sizecovers the new phase.forward_batch_info._pad_inputs_to_sizeswitches togetattr(spec_info, ..., None)since the two draft-input types now carry disjoint subsets oftopk_p/topk_index/num_accepted_drafts.Frozen-KV MTP mirror
FrozenKVMTPDraftExtendInput(EagleDraftExtendInput)tag-only subclass;_to_frozen_kv_mtp_draft_inputrenamed to_to_frozen_kv_mtp_draft_extend_inputand reflects overEagleDraftExtendInput.fields.select_last_verified_seeddrops thenum_accepted_tokens is Noneearly-return (always present on the new dataclass).frozen_kv_mtp_worker.forward_draft_extend_after_decodeadds idle early-return after stashing an idleFrozenKVMTPDraftInput.Looks confusing but is correct
filter_batch/merge_batchappear rewritten in the diff but are byte-identical to pre-PR. They moved up insideEagleDraftInputonly becauseprepare_extend_after_decode/generate_attn_arg_prefillgot extracted toEagleDraftExtendInput— verified viasha1on the function range.EagleDraftInputstill hasnum_accepted_drafts/num_accepted_tokens(and 3 other V2 fields) after a "schema split" PR. Looks like a leftover, but it is intentional: V2 overlap worker still reuses one instance across phases. Comment explicitly tags them "V2 overlap worker only"; cleaned up after V2 alignment.bonus_tokensexists on bothEagleDraftInputandEagleDraftExtendInput. Not a duplicate. The kernel writes it on the extend-input; the worker copies it onto the next-iter draft-input where the next draft forward consumes it. Two roles, two homes.EagleDraftInput(capture_hidden_mode=LAST)instead of leavingbatch.spec_infoas the now-staleEagleVerifyInput. Looks like a no-op assignment, but it is required so the scheduler's next-itermerge_batchfinds anEagleDraftInput(which hasmerge_batch) instead of anEagleVerifyInput(which doesn't), and short-circuits onhidden_states is None.self.capture_for_decode(...)and inlinessoftmax + fast_topk. Looks like a behavior change — it's not.capture_for_decodebody is exactly those two lines plus an in-place assignment tospec_info; inlining is equivalent and avoids mutating the soon-to-be-discardedEagleDraftExtendInput.forward_draft_extend_after_decodereturnsEagleDraftInputin EAGLE / multi-layer-EAGLE workers but returnsNoneinfrozen_kv_mtp_worker. Asymmetric on purpose: Frozen's_run_assistant_seed_stepalready installs a freshFrozenKVMTPDraftInputontobatch.spec_infointernally, so there is nothing for the caller to reinstall.EagleDraftExtendInput.is_draft_input()returnsTruedespite the name "draft-extend". Reused on purpose —_pad_inputs_to_sizekeys offis_draft_input()to decide whether to pad spec-info tensors, and the new extend phase needs the same padding treatment.prepare_extend_after_decodeaddsassert batch.spec_info is self. Looks like a defensive paranoia check; it's actually a phase-boundary invariant — the method now readsinput_ids/seq_lens/req_pool_indicesoffselfrather than via averify_outputarg, so the caller must have installedselfasbatch.spec_infofirst.Test plan
TestStreamingSessionEagleRetractLargePage) — covered by stage-benable_dp_attention or input_ids.shape[0] > 0)