feat(spec-v2): Frozen-KV MTP overlap-scheduling worker (experimental opt-in) + V1 None-safety fixes by pyc96 · Pull Request #23 · pyc96/sglang

pyc96 · 2026-05-26T04:12:12Z

Summary

Stacked on PR #21 (ULT v2). Adds FrozenKVMTPWorkerV2 spec-V2 infrastructure for FROZEN_KV_MTP behind an experimental env-var opt-in (SGLANG_FROZEN_KV_MTP_EXPERIMENTAL_V2=1). Default behavior unchanged: V1 worker + disable_overlap_schedule=True.

State of the V2 implementation

Component	Status
Boots cleanly with overlap enabled	OK
Dispatcher routes to V2 worker	OK
Inherits all V1 correctness-tested methods (`draft`, `verify`, `_run_assistant_seed_step`, etc.)	OK
Implements `BaseSpecWorker` interface (`target_worker`, `draft_worker`, `clear_cache_pool`)	OK
`forward_batch_generation(batch, on_publish=None)` with verify-end fence	OK
Ferries `next_draft_input` with correct shape (`[B]` bonus_tokens, `[B, topk]` topk_p/index, `[B, H]` hidden_states)	OK
Produces `accept_lens` for batch result processor	OK
Routes prefill -> decode -> seed -> verify -> next iter end-to-end	OK
Request completes and returns correct output to client	OK (single request validated)
`on_idle` `invariant_checker` passes (KV pool leak check)	FAIL -- 24-token leak per finished req

Bugs fixed during V2 development

V1 draft() and verify() None-safety on batch.sampling_info.penalizer_orchestrator (real fix benefiting V1 too). Required because V2 defers sampling to the schedule stream; the worker side gets penalizer_orchestrator=None.
V2 ferry size: was using verify_output.accept_tokens.shape[0] (= flat total across all reqs); now uses batch.req_pool_indices.shape[0] (= per-req batch size).
V2 per-req bonus token: was passing the flat accept_tokens; now picks the last accepted token per req via cumsum(num_correct_drafts_per_req_cpu + 1) - 1, mirroring EAGLE V2's fill_bonus_tokens kernel.
V2 accept_lens in batch result: _resolve_spec_overlap_tokens at batch_result_processor.py:488 asserts result.accept_lens.is_cpu. Now populated.

Known limitation (NOT fixed)

The on_idle invariant_checker detects a 24-token KV pool leak per finished request:

ValueError: pool memory leak detected! [full] total=106600, available=106570,
evictable=0, protected=6, session_held=0, uncached=0

24 ~= 20 (max_tokens) + 4 (speculative_num_draft_tokens), suggesting the V1 verify path's free of rejected draft slots races with the schedule-stream tear-down of the batch under overlap.

Fixing this requires architectural understanding of SGLang's pool-lifecycle ownership across overlap stream boundaries. Identified next-step direction in the audit (agent-pod/runs/20260525_mtp_v2/analysis/eagle_v2_pattern.md section 14.1 "The target KV pool swap must NOT race with overlap"). The proper fix is likely to either:

Wrap the verify-side free in a stream-event sync that ensures completion before iter N+1's schedule-side cleanup runs
Or migrate to the full composition-based V2 pattern (BaseDraftWorker split per the audit's section 15 skeleton, ~600 LOC)

What this PR ships (value)

Item	Value
V1 None-safety in `frozen_kv_mtp_worker.py` + `eagle_info.py`	Real bug fixes for V1 paths under future scenarios where penalizer becomes None
`FrozenKVMTPWorkerV2` with correct ferry contract	Removes the `NotImplementedError` stub; next implementer can focus exclusively on the KV-leak fix
Dispatcher + opt-in plumbing	Once the leak is fixed, flip the default by removing the env-var check
2409-line audit at `agent-pod/runs/20260525_mtp_v2/analysis/eagle_v2_pattern.md`	Comprehensive EAGLE V2 walkthrough + frozen-KV invariants + porting matrix

Reproducer

SGLANG_FROZEN_KV_MTP_EXPERIMENTAL_V2=1 python -m sglang.launch_server \
  --model-path google/gemma-4-31B-it --dtype bfloat16 --trust-remote-code --tp-size 2 \
  --host 0.0.0.0 --port 30000 --attention-backend triton \
  --speculative-algorithm NEXTN \
  --speculative-draft-model-path google/gemma-4-31B-it-assistant \
  --speculative-num-steps 3 --speculative-num-draft-tokens 4 --speculative-eagle-topk 1 \
  --max-running-requests 80 --random-seed 1

# Single request works:
curl -X POST http://127.0.0.1:30000/v1/chat/completions \
  -d '{"model":"google/gemma-4-31B-it","messages":[{"role":"user","content":"What is 2+2?"}],"temperature":0,"max_tokens":15}'

# After the request finishes, the server crashes on on_idle invariant check.
# That is the documented known limitation -- the 24-token KV leak.

Production recommendation

Use SGLang no-MTP (per PR #21). The MTP gap is structural and this PR does not close it yet.

Stack base

pyc/feat-gemma4-ultimate-v2 (PR #21)

Files

python/sglang/srt/speculative/frozen_kv_mtp_worker_v2.py (+363 lines vs original stub)
python/sglang/srt/speculative/frozen_kv_mtp_worker.py (+8/-3 None-safety)
python/sglang/srt/speculative/eagle_info.py (+11/-4 None-safety)
python/sglang/srt/speculative/spec_info.py (+13/-3 dispatcher)
python/sglang/srt/arg_groups/speculative_hook.py (+18/-7 opt-in)
agent-pod/runs/20260525_mtp_v2/analysis/eagle_v2_pattern.md (2409-line audit)

… + V1 None-safety fixes Wires the spec-V2 (overlap-scheduling) infrastructure for FROZEN_KV_MTP behind an experimental env-var opt-in (SGLANG_FROZEN_KV_MTP_EXPERIMENTAL_V2=1). Default behavior unchanged: V1 worker + disable_overlap_schedule=True. Why -- The MTP gap audit traced the SGLang-vs-vLLM MTP gap (-50% tok/s on both 26B and 31B) to the fact that FROZEN_KV_MTP forces disable_overlap_schedule=True because FrozenKVMTPWorkerV2 was a NotImplementedError stub. Each scheduler step blocks on draft + verify + seed (= num_steps + 2 = 5 GPU forwards) with zero overlap with the next iter's CPU prep. vLLM's async scheduler overlaps these and gets ~2x tok/s at the same nominal max-running-requests. What this PR ships: 1. FrozenKVMTPWorkerV2 (frozen_kv_mtp_worker_v2.py, +250 lines) * Inherits FrozenKVMTPWorker (reuses every correctness-tested method: draft/verify/seed/forward_target_extend/init_cuda_graphs). * Implements BaseSpecWorker (target_worker / draft_worker / clear_cache_pool properties + setter for the V1 ctor's self.target_worker = ... assignment). * Adds draft_runner alias for kv_cache_builder.get_draft_kv_pool's EAGLE-V2-style accessor. * Adds forward_batch_generation(batch, on_publish=None) that fires the on_publish callback at the verify-end fence point (so the scheduler's future_map publishes new seq_lens BEFORE the seed blocks the forward stream). * Ferries next_draft_input across iterations (synthesizes an idle FrozenKVMTPDraftInput when all reqs finish so future_map.stash doesn't crash on uninitialized topk_p/topk_index). 2. None-safety fixes in V1 (frozen_kv_mtp_worker.py + eagle_info.py) * V1 draft() and verify() now handle batch.sampling_info=None and sampling_info.penalizer_orchestrator=None correctly. Required for V2 overlap path where sampling is deferred to schedule stream. Backwards-compatible: V1 keeps working identically when orchestrator is present. 3. spec_info.py dispatcher * supports_spec_v2() now returns True for FROZEN_KV_MTP (was False). * create_worker() dispatches to V2 when enable_overlap=True. 4. arg_groups/speculative_hook.py * disable_overlap_schedule=True is now conditional on the env var SGLANG_FROZEN_KV_MTP_EXPERIMENTAL_V2 NOT being set. Default behavior preserved (V1 + overlap disabled); experimental V2 opt-in available for further development. Known limitations of the V2 worker (why it's experimental): * The synthesized idle FrozenKVMTPDraftInput at end-of-decode hits a scheduler-side stash assertion (output_tokens_buf shape mismatch for empty bonus_tokens). Need to also synthesize empty bonus_tokens with the right shape (B == future_indices.indices.shape[0]). * The per-step draft loop still runs as one block; full overlap potential requires composition pattern (a la EAGLEWorkerV2's BaseDraftWorker split) rather than inheritance. * Cuda graph capture on the V2 path is not verified. The audit + design (full plan + invariant table + porting matrix + implementation skeleton) is preserved at: agent-pod/runs/20260525_mtp_v2/analysis/eagle_v2_pattern.md (2400 lines, full EAGLE V2 reference + frozen-KV-specific invariants) V1 path verified intact (smoke test + 5/5 parity) after this PR. Stack base: pyc/feat-gemma4-ultimate-v2 (PR #21). Recommendation: V2 enablement gated behind env var until follow-up work lands the proper composition-based worker.

…eak fix) Resolves two of the three V2 scheduler-side issues found during testing: 1. ferry size: bs is now batch.req_pool_indices.shape[0] (= future_indices size), not verify_output.accept_tokens.shape[0] (= flat total). accept_tokens contains all accepted tokens across reqs, so its shape was wrong for the per-req bonus_tokens contract the stash kernel expects. 2. per-req bonus_tokens: pick the LAST accepted token per req using cumsum(num_correct_drafts_per_req_cpu + 1) - 1 to index into the flat accept_tokens. Mirrors EAGLE V2's fill_bonus_tokens. 3. accept_lens for batch_result: spec-V2 batch_result_processor (_resolve_spec_overlap_tokens at batch_result_processor.py:488) asserts result.accept_lens is a CPU tensor. Build it from num_correct_drafts_per_req_cpu + 1. After these fixes V2 works end-to-end through draft + verify + seed: * boots cleanly * routes prefill -> decode -> seed -> verify -> next iter correctly * stash payload shapes pass the kernel contract * request completes and returns the right output to the client Remaining blocker (NOT fixed in this commit): on_idle invariant_checker detects a 24-token KV pool leak per finished request. Likely caused by the V1 verify path's free of rejected draft slots racing with the schedule-stream tear-down of the batch under overlap. Documented as a known limitation; V2 stays env-gated experimental until the leak is fixed (requires architectural understanding of pool-lifecycle ownership across overlap stream boundaries). V1 path unchanged (still default; opt into V2 via SGLANG_FROZEN_KV_MTP_EXPERIMENTAL_V2=1). Stack base: pyc/feat-gemma4-mtp-spec-v2

github-actions Bot added the speculative-decoding label May 26, 2026

pyc96 mentioned this pull request May 26, 2026

feat(spec-v2): Option-alpha scaffolding -- FROZEN_KV_MTP via EAGLE V2 (experimental, env-gated, NOT yet functional) #26

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spec-v2): Frozen-KV MTP overlap-scheduling worker (experimental opt-in) + V1 None-safety fixes#23

feat(spec-v2): Frozen-KV MTP overlap-scheduling worker (experimental opt-in) + V1 None-safety fixes#23
pyc96 wants to merge 2 commits into
pyc/feat-gemma4-ultimate-v2from
pyc/feat-gemma4-mtp-spec-v2

pyc96 commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pyc96 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

State of the V2 implementation

Bugs fixed during V2 development

Known limitation (NOT fixed)

What this PR ships (value)

Reproducer

Production recommendation

Stack base

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pyc96 commented May 26, 2026 •

edited

Loading