Skip to content

feat(spec-v2): Frozen-KV MTP overlap-scheduling worker (experimental opt-in) + V1 None-safety fixes#23

Draft
pyc96 wants to merge 2 commits into
pyc/feat-gemma4-ultimate-v2from
pyc/feat-gemma4-mtp-spec-v2
Draft

feat(spec-v2): Frozen-KV MTP overlap-scheduling worker (experimental opt-in) + V1 None-safety fixes#23
pyc96 wants to merge 2 commits into
pyc/feat-gemma4-ultimate-v2from
pyc/feat-gemma4-mtp-spec-v2

Conversation

@pyc96
Copy link
Copy Markdown
Owner

@pyc96 pyc96 commented May 26, 2026

Summary

Stacked on PR #21 (ULT v2). Adds FrozenKVMTPWorkerV2 spec-V2 infrastructure for FROZEN_KV_MTP behind an experimental env-var opt-in (SGLANG_FROZEN_KV_MTP_EXPERIMENTAL_V2=1). Default behavior unchanged: V1 worker + disable_overlap_schedule=True.

State of the V2 implementation

Component Status
Boots cleanly with overlap enabled OK
Dispatcher routes to V2 worker OK
Inherits all V1 correctness-tested methods (draft, verify, _run_assistant_seed_step, etc.) OK
Implements BaseSpecWorker interface (target_worker, draft_worker, clear_cache_pool) OK
forward_batch_generation(batch, on_publish=None) with verify-end fence OK
Ferries next_draft_input with correct shape ([B] bonus_tokens, [B, topk] topk_p/index, [B, H] hidden_states) OK
Produces accept_lens for batch result processor OK
Routes prefill -> decode -> seed -> verify -> next iter end-to-end OK
Request completes and returns correct output to client OK (single request validated)
on_idle invariant_checker passes (KV pool leak check) FAIL -- 24-token leak per finished req

Bugs fixed during V2 development

  1. V1 draft() and verify() None-safety on batch.sampling_info.penalizer_orchestrator (real fix benefiting V1 too). Required because V2 defers sampling to the schedule stream; the worker side gets penalizer_orchestrator=None.

  2. V2 ferry size: was using verify_output.accept_tokens.shape[0] (= flat total across all reqs); now uses batch.req_pool_indices.shape[0] (= per-req batch size).

  3. V2 per-req bonus token: was passing the flat accept_tokens; now picks the last accepted token per req via cumsum(num_correct_drafts_per_req_cpu + 1) - 1, mirroring EAGLE V2's fill_bonus_tokens kernel.

  4. V2 accept_lens in batch result: _resolve_spec_overlap_tokens at batch_result_processor.py:488 asserts result.accept_lens.is_cpu. Now populated.

Known limitation (NOT fixed)

The on_idle invariant_checker detects a 24-token KV pool leak per finished request:

ValueError: pool memory leak detected! [full] total=106600, available=106570,
evictable=0, protected=6, session_held=0, uncached=0

24 ~= 20 (max_tokens) + 4 (speculative_num_draft_tokens), suggesting the V1 verify path's free of rejected draft slots races with the schedule-stream tear-down of the batch under overlap.

Fixing this requires architectural understanding of SGLang's pool-lifecycle ownership across overlap stream boundaries. Identified next-step direction in the audit (agent-pod/runs/20260525_mtp_v2/analysis/eagle_v2_pattern.md section 14.1 "The target KV pool swap must NOT race with overlap"). The proper fix is likely to either:

  • Wrap the verify-side free in a stream-event sync that ensures completion before iter N+1's schedule-side cleanup runs
  • Or migrate to the full composition-based V2 pattern (BaseDraftWorker split per the audit's section 15 skeleton, ~600 LOC)

What this PR ships (value)

Item Value
V1 None-safety in frozen_kv_mtp_worker.py + eagle_info.py Real bug fixes for V1 paths under future scenarios where penalizer becomes None
FrozenKVMTPWorkerV2 with correct ferry contract Removes the NotImplementedError stub; next implementer can focus exclusively on the KV-leak fix
Dispatcher + opt-in plumbing Once the leak is fixed, flip the default by removing the env-var check
2409-line audit at agent-pod/runs/20260525_mtp_v2/analysis/eagle_v2_pattern.md Comprehensive EAGLE V2 walkthrough + frozen-KV invariants + porting matrix

Reproducer

SGLANG_FROZEN_KV_MTP_EXPERIMENTAL_V2=1 python -m sglang.launch_server \
  --model-path google/gemma-4-31B-it --dtype bfloat16 --trust-remote-code --tp-size 2 \
  --host 0.0.0.0 --port 30000 --attention-backend triton \
  --speculative-algorithm NEXTN \
  --speculative-draft-model-path google/gemma-4-31B-it-assistant \
  --speculative-num-steps 3 --speculative-num-draft-tokens 4 --speculative-eagle-topk 1 \
  --max-running-requests 80 --random-seed 1

# Single request works:
curl -X POST http://127.0.0.1:30000/v1/chat/completions \
  -d '{"model":"google/gemma-4-31B-it","messages":[{"role":"user","content":"What is 2+2?"}],"temperature":0,"max_tokens":15}'

# After the request finishes, the server crashes on on_idle invariant check.
# That is the documented known limitation -- the 24-token KV leak.

Production recommendation

Use SGLang no-MTP (per PR #21). The MTP gap is structural and this PR does not close it yet.

Stack base

pyc/feat-gemma4-ultimate-v2 (PR #21)

Files

  • python/sglang/srt/speculative/frozen_kv_mtp_worker_v2.py (+363 lines vs original stub)
  • python/sglang/srt/speculative/frozen_kv_mtp_worker.py (+8/-3 None-safety)
  • python/sglang/srt/speculative/eagle_info.py (+11/-4 None-safety)
  • python/sglang/srt/speculative/spec_info.py (+13/-3 dispatcher)
  • python/sglang/srt/arg_groups/speculative_hook.py (+18/-7 opt-in)
  • agent-pod/runs/20260525_mtp_v2/analysis/eagle_v2_pattern.md (2409-line audit)

… + V1 None-safety fixes

Wires the spec-V2 (overlap-scheduling) infrastructure for FROZEN_KV_MTP
behind an experimental env-var opt-in (SGLANG_FROZEN_KV_MTP_EXPERIMENTAL_V2=1).
Default behavior unchanged: V1 worker + disable_overlap_schedule=True.

Why -- The MTP gap audit traced the SGLang-vs-vLLM MTP gap (-50% tok/s
on both 26B and 31B) to the fact that FROZEN_KV_MTP forces
disable_overlap_schedule=True because FrozenKVMTPWorkerV2 was a
NotImplementedError stub.  Each scheduler step blocks on draft + verify
+ seed (= num_steps + 2 = 5 GPU forwards) with zero overlap with the
next iter's CPU prep.  vLLM's async scheduler overlaps these and gets
~2x tok/s at the same nominal max-running-requests.

What this PR ships:

1. FrozenKVMTPWorkerV2 (frozen_kv_mtp_worker_v2.py, +250 lines)
   * Inherits FrozenKVMTPWorker (reuses every correctness-tested
     method: draft/verify/seed/forward_target_extend/init_cuda_graphs).
   * Implements BaseSpecWorker (target_worker / draft_worker /
     clear_cache_pool properties + setter for the V1 ctor's
     self.target_worker = ... assignment).
   * Adds draft_runner alias for kv_cache_builder.get_draft_kv_pool's
     EAGLE-V2-style accessor.
   * Adds forward_batch_generation(batch, on_publish=None) that fires
     the on_publish callback at the verify-end fence point (so the
     scheduler's future_map publishes new seq_lens BEFORE the seed
     blocks the forward stream).
   * Ferries next_draft_input across iterations (synthesizes an idle
     FrozenKVMTPDraftInput when all reqs finish so future_map.stash
     doesn't crash on uninitialized topk_p/topk_index).

2. None-safety fixes in V1 (frozen_kv_mtp_worker.py + eagle_info.py)
   * V1 draft() and verify() now handle batch.sampling_info=None and
     sampling_info.penalizer_orchestrator=None correctly.  Required
     for V2 overlap path where sampling is deferred to schedule stream.
     Backwards-compatible: V1 keeps working identically when
     orchestrator is present.

3. spec_info.py dispatcher
   * supports_spec_v2() now returns True for FROZEN_KV_MTP (was False).
   * create_worker() dispatches to V2 when enable_overlap=True.

4. arg_groups/speculative_hook.py
   * disable_overlap_schedule=True is now conditional on the env var
     SGLANG_FROZEN_KV_MTP_EXPERIMENTAL_V2 NOT being set.  Default
     behavior preserved (V1 + overlap disabled); experimental V2
     opt-in available for further development.

Known limitations of the V2 worker (why it's experimental):

* The synthesized idle FrozenKVMTPDraftInput at end-of-decode hits a
  scheduler-side stash assertion (output_tokens_buf shape mismatch
  for empty bonus_tokens).  Need to also synthesize empty bonus_tokens
  with the right shape (B == future_indices.indices.shape[0]).
* The per-step draft loop still runs as one block; full overlap
  potential requires composition pattern (a la EAGLEWorkerV2's
  BaseDraftWorker split) rather than inheritance.
* Cuda graph capture on the V2 path is not verified.

The audit + design (full plan + invariant table + porting matrix +
implementation skeleton) is preserved at:
  agent-pod/runs/20260525_mtp_v2/analysis/eagle_v2_pattern.md
  (2400 lines, full EAGLE V2 reference + frozen-KV-specific invariants)

V1 path verified intact (smoke test + 5/5 parity) after this PR.

Stack base: pyc/feat-gemma4-ultimate-v2 (PR #21).
Recommendation: V2 enablement gated behind env var until follow-up
work lands the proper composition-based worker.
…eak fix)

Resolves two of the three V2 scheduler-side issues found during testing:

1. ferry size: bs is now batch.req_pool_indices.shape[0] (= future_indices
   size), not verify_output.accept_tokens.shape[0] (= flat total).
   accept_tokens contains all accepted tokens across reqs, so its shape
   was wrong for the per-req bonus_tokens contract the stash kernel
   expects.

2. per-req bonus_tokens: pick the LAST accepted token per req using
   cumsum(num_correct_drafts_per_req_cpu + 1) - 1 to index into the
   flat accept_tokens. Mirrors EAGLE V2's fill_bonus_tokens.

3. accept_lens for batch_result: spec-V2 batch_result_processor
   (_resolve_spec_overlap_tokens at batch_result_processor.py:488)
   asserts result.accept_lens is a CPU tensor. Build it from
   num_correct_drafts_per_req_cpu + 1.

After these fixes V2 works end-to-end through draft + verify + seed:
* boots cleanly
* routes prefill -> decode -> seed -> verify -> next iter correctly
* stash payload shapes pass the kernel contract
* request completes and returns the right output to the client

Remaining blocker (NOT fixed in this commit): on_idle invariant_checker
detects a 24-token KV pool leak per finished request. Likely caused by
the V1 verify path's free of rejected draft slots racing with the
schedule-stream tear-down of the batch under overlap. Documented as a
known limitation; V2 stays env-gated experimental until the leak is
fixed (requires architectural understanding of pool-lifecycle ownership
across overlap stream boundaries).

V1 path unchanged (still default; opt into V2 via
SGLANG_FROZEN_KV_MTP_EXPERIMENTAL_V2=1).

Stack base: pyc/feat-gemma4-mtp-spec-v2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant