feat(spec-v2): Frozen-KV MTP overlap-scheduling worker (experimental opt-in) + V1 None-safety fixes#23
Draft
pyc96 wants to merge 2 commits into
Draft
Conversation
… + V1 None-safety fixes
Wires the spec-V2 (overlap-scheduling) infrastructure for FROZEN_KV_MTP
behind an experimental env-var opt-in (SGLANG_FROZEN_KV_MTP_EXPERIMENTAL_V2=1).
Default behavior unchanged: V1 worker + disable_overlap_schedule=True.
Why -- The MTP gap audit traced the SGLang-vs-vLLM MTP gap (-50% tok/s
on both 26B and 31B) to the fact that FROZEN_KV_MTP forces
disable_overlap_schedule=True because FrozenKVMTPWorkerV2 was a
NotImplementedError stub. Each scheduler step blocks on draft + verify
+ seed (= num_steps + 2 = 5 GPU forwards) with zero overlap with the
next iter's CPU prep. vLLM's async scheduler overlaps these and gets
~2x tok/s at the same nominal max-running-requests.
What this PR ships:
1. FrozenKVMTPWorkerV2 (frozen_kv_mtp_worker_v2.py, +250 lines)
* Inherits FrozenKVMTPWorker (reuses every correctness-tested
method: draft/verify/seed/forward_target_extend/init_cuda_graphs).
* Implements BaseSpecWorker (target_worker / draft_worker /
clear_cache_pool properties + setter for the V1 ctor's
self.target_worker = ... assignment).
* Adds draft_runner alias for kv_cache_builder.get_draft_kv_pool's
EAGLE-V2-style accessor.
* Adds forward_batch_generation(batch, on_publish=None) that fires
the on_publish callback at the verify-end fence point (so the
scheduler's future_map publishes new seq_lens BEFORE the seed
blocks the forward stream).
* Ferries next_draft_input across iterations (synthesizes an idle
FrozenKVMTPDraftInput when all reqs finish so future_map.stash
doesn't crash on uninitialized topk_p/topk_index).
2. None-safety fixes in V1 (frozen_kv_mtp_worker.py + eagle_info.py)
* V1 draft() and verify() now handle batch.sampling_info=None and
sampling_info.penalizer_orchestrator=None correctly. Required
for V2 overlap path where sampling is deferred to schedule stream.
Backwards-compatible: V1 keeps working identically when
orchestrator is present.
3. spec_info.py dispatcher
* supports_spec_v2() now returns True for FROZEN_KV_MTP (was False).
* create_worker() dispatches to V2 when enable_overlap=True.
4. arg_groups/speculative_hook.py
* disable_overlap_schedule=True is now conditional on the env var
SGLANG_FROZEN_KV_MTP_EXPERIMENTAL_V2 NOT being set. Default
behavior preserved (V1 + overlap disabled); experimental V2
opt-in available for further development.
Known limitations of the V2 worker (why it's experimental):
* The synthesized idle FrozenKVMTPDraftInput at end-of-decode hits a
scheduler-side stash assertion (output_tokens_buf shape mismatch
for empty bonus_tokens). Need to also synthesize empty bonus_tokens
with the right shape (B == future_indices.indices.shape[0]).
* The per-step draft loop still runs as one block; full overlap
potential requires composition pattern (a la EAGLEWorkerV2's
BaseDraftWorker split) rather than inheritance.
* Cuda graph capture on the V2 path is not verified.
The audit + design (full plan + invariant table + porting matrix +
implementation skeleton) is preserved at:
agent-pod/runs/20260525_mtp_v2/analysis/eagle_v2_pattern.md
(2400 lines, full EAGLE V2 reference + frozen-KV-specific invariants)
V1 path verified intact (smoke test + 5/5 parity) after this PR.
Stack base: pyc/feat-gemma4-ultimate-v2 (PR #21).
Recommendation: V2 enablement gated behind env var until follow-up
work lands the proper composition-based worker.
…eak fix) Resolves two of the three V2 scheduler-side issues found during testing: 1. ferry size: bs is now batch.req_pool_indices.shape[0] (= future_indices size), not verify_output.accept_tokens.shape[0] (= flat total). accept_tokens contains all accepted tokens across reqs, so its shape was wrong for the per-req bonus_tokens contract the stash kernel expects. 2. per-req bonus_tokens: pick the LAST accepted token per req using cumsum(num_correct_drafts_per_req_cpu + 1) - 1 to index into the flat accept_tokens. Mirrors EAGLE V2's fill_bonus_tokens. 3. accept_lens for batch_result: spec-V2 batch_result_processor (_resolve_spec_overlap_tokens at batch_result_processor.py:488) asserts result.accept_lens is a CPU tensor. Build it from num_correct_drafts_per_req_cpu + 1. After these fixes V2 works end-to-end through draft + verify + seed: * boots cleanly * routes prefill -> decode -> seed -> verify -> next iter correctly * stash payload shapes pass the kernel contract * request completes and returns the right output to the client Remaining blocker (NOT fixed in this commit): on_idle invariant_checker detects a 24-token KV pool leak per finished request. Likely caused by the V1 verify path's free of rejected draft slots racing with the schedule-stream tear-down of the batch under overlap. Documented as a known limitation; V2 stays env-gated experimental until the leak is fixed (requires architectural understanding of pool-lifecycle ownership across overlap stream boundaries). V1 path unchanged (still default; opt into V2 via SGLANG_FROZEN_KV_MTP_EXPERIMENTAL_V2=1). Stack base: pyc/feat-gemma4-mtp-spec-v2
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on PR #21 (ULT v2). Adds
FrozenKVMTPWorkerV2spec-V2 infrastructure for FROZEN_KV_MTP behind an experimental env-var opt-in (SGLANG_FROZEN_KV_MTP_EXPERIMENTAL_V2=1). Default behavior unchanged: V1 worker +disable_overlap_schedule=True.State of the V2 implementation
draft,verify,_run_assistant_seed_step, etc.)BaseSpecWorkerinterface (target_worker,draft_worker,clear_cache_pool)forward_batch_generation(batch, on_publish=None)with verify-end fencenext_draft_inputwith correct shape ([B]bonus_tokens,[B, topk]topk_p/index,[B, H]hidden_states)accept_lensfor batch result processoron_idleinvariant_checkerpasses (KV pool leak check)Bugs fixed during V2 development
V1
draft()andverify()None-safety onbatch.sampling_info.penalizer_orchestrator(real fix benefiting V1 too). Required because V2 defers sampling to the schedule stream; the worker side getspenalizer_orchestrator=None.V2 ferry size: was using
verify_output.accept_tokens.shape[0](= flat total across all reqs); now usesbatch.req_pool_indices.shape[0](= per-req batch size).V2 per-req bonus token: was passing the flat
accept_tokens; now picks the last accepted token per req viacumsum(num_correct_drafts_per_req_cpu + 1) - 1, mirroring EAGLE V2'sfill_bonus_tokenskernel.V2
accept_lensin batch result:_resolve_spec_overlap_tokensatbatch_result_processor.py:488assertsresult.accept_lens.is_cpu. Now populated.Known limitation (NOT fixed)
The
on_idleinvariant_checker detects a 24-token KV pool leak per finished request:24 ~= 20 (max_tokens) + 4 (speculative_num_draft_tokens), suggesting the V1 verify path's free of rejected draft slots races with the schedule-stream tear-down of the batch under overlap.
Fixing this requires architectural understanding of SGLang's pool-lifecycle ownership across overlap stream boundaries. Identified next-step direction in the audit (
agent-pod/runs/20260525_mtp_v2/analysis/eagle_v2_pattern.mdsection 14.1 "The target KV pool swap must NOT race with overlap"). The proper fix is likely to either:What this PR ships (value)
frozen_kv_mtp_worker.py+eagle_info.pyFrozenKVMTPWorkerV2with correct ferry contractNotImplementedErrorstub; next implementer can focus exclusively on the KV-leak fixagent-pod/runs/20260525_mtp_v2/analysis/eagle_v2_pattern.mdReproducer
Production recommendation
Use SGLang no-MTP (per PR #21). The MTP gap is structural and this PR does not close it yet.
Stack base
pyc/feat-gemma4-ultimate-v2(PR #21)Files
python/sglang/srt/speculative/frozen_kv_mtp_worker_v2.py(+363 lines vs original stub)python/sglang/srt/speculative/frozen_kv_mtp_worker.py(+8/-3 None-safety)python/sglang/srt/speculative/eagle_info.py(+11/-4 None-safety)python/sglang/srt/speculative/spec_info.py(+13/-3 dispatcher)python/sglang/srt/arg_groups/speculative_hook.py(+18/-7 opt-in)agent-pod/runs/20260525_mtp_v2/analysis/eagle_v2_pattern.md(2409-line audit)