[Spec] `FrozenKVMTP` fold assistant seed into captured draft graph by kpham-sgl · Pull Request #25539 · sgl-project/sglang

kpham-sgl · 2026-05-17T18:09:29Z

Motivation

Frozen-KV MTP ran a one-token eager assistant "seed" forward before the
captured recurrent draft loop. Seed and recurrent iters share the same
seq_lens - 1 rope position against frozen target KV, so splitting them
just costs an extra launch + sync per decode.

Modifications

frozen_kv_mtp_worker.py: _run_assistant_seed_step only stashes
bonus_tokens / hidden_states now; draft_forward runs the seed
as iter 0 of the recurrent loop (topk>1 replicate/slice inline).
Attn init no longer gated on num_steps > 1. Drops the stale
topk == 1 shortcut in _init_draft_attn_backend.
frozen_kv_mtp_cuda_graph_runner.py: adds bonus_tokens to input
buffers; stops copying topk_p/topk_index (now produced by the
captured seed iter). Wraps _replay() in a record_function span.

No behavior change on verify / accept / KV-write paths.

Accuracy Tests

Test with #24552 for the 31B model

topk	num_draft_tokens	GSM8K score	threshold	avg_spec_accept_length	result
1	6	0.8150	0.7750	4.4767	PASS
3	12	0.8000	0.7750	5.0259	PASS

Speed Tests and Profiling

Before

After

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ✅ Run #26061044934
Latest PR Test (Extra): ⚠️ Not enabled -- add run-ci-extra label to opt in.

gemini-code-assist

Code Review

This pull request refactors the Frozen-KV MTP (Multi-Token Prediction) implementation by moving the assistant seed forward pass into the captured CUDA graph within draft_forward. Key changes include adding bonus_tokens to the input buffers, updating the worker to stash seed inputs for the graph runner, and enforcing the use of the Triton attention backend. Feedback suggests improving the API by removing unused arguments from the _run_assistant_seed_step signature and optimizing performance by using torch.empty instead of torch.zeros for tensors that are immediately overwritten.

gemini-code-assist · 2026-05-17T18:12:22Z

        mm_input_embeds: Optional[torch.Tensor] = None,
        draft_input: Optional[FrozenKVMTPDraftInput] = None,
    ) -> None:
-        """Run the one-token assistant seed step against frozen target KV."""
+        """Stash seed inputs on ``batch.spec_info``; the forward runs inside
+        the captured draft graph (see ``draft_forward``'s seed iter)."""
+        del seq_lens_cpu, mm_input_embeds, draft_input


The arguments seq_lens_cpu, mm_input_embeds, and draft_input are no longer used in _run_assistant_seed_step because the assistant forward pass has been moved into the captured draft graph (executed within draft_forward). While they are currently being explicitly deleted to avoid unused variable warnings, it would be cleaner to remove them from the function signature and update the callers (forward_draft_extend and forward_draft_extend_after_decode) accordingly to maintain a clean API.

gemini-code-assist · 2026-05-17T18:12:22Z

+        stashed.topk_p = torch.zeros(
+            (bs, self.topk), device=device, dtype=torch.float32
+        )
+        stashed.topk_index = torch.zeros(
+            (bs, self.topk), device=device, dtype=torch.int64
+        )


Since topk_p and topk_index are only initialized here for compatibility with filter_batch/merge_batch and are overwritten by the captured seed iteration in draft_forward, you can use torch.empty instead of torch.zeros to avoid unnecessary zero-initialization overhead on the GPU.

Suggested change

stashed.topk_p = torch.zeros(

(bs, self.topk), device=device, dtype=torch.float32

)

stashed.topk_index = torch.zeros(

(bs, self.topk), device=device, dtype=torch.int64

)

stashed.topk_p = torch.empty(

(bs, self.topk), device=device, dtype=torch.float32

)

stashed.topk_index = torch.empty(

(bs, self.topk), device=device, dtype=torch.int64

)

kpham-sgl · 2026-05-17T18:45:59Z

/tag-and-rerun-ci

Co-authored-by: Cursor <cursoragent@cursor.com>

kpham-sgl requested review from Qiaolin-Yu and hnyls2002 as code owners May 17, 2026 18:09

gemini-code-assist Bot reviewed May 17, 2026

View reviewed changes

github-actions Bot added the run-ci label May 17, 2026

kpham-sgl mentioned this pull request May 17, 2026

[Spec] Add trtllm_mha support for Gemma 4 MTP draft attention backend #25545

Open

5 tasks

kpham-sgl force-pushed the kp/gemma4-mtp-remove-seed branch from 903a193 to 4d9203d Compare May 17, 2026 22:14

kpham-sgl and others added 3 commits May 18, 2026 21:15

remove unnecessary seed step

347aa8c

remove stale topk=1 path

6c331e9

trim verbose comments

ee7c80c

Co-authored-by: Cursor <cursoragent@cursor.com>

kpham-sgl force-pushed the kp/gemma4-mtp-remove-seed branch from 4d9203d to ee7c80c Compare May 18, 2026 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spec] `FrozenKVMTP` fold assistant seed into captured draft graph#25539

[Spec] `FrozenKVMTP` fold assistant seed into captured draft graph#25539
kpham-sgl wants to merge 3 commits into
mainfrom
kp/gemma4-mtp-remove-seed

kpham-sgl commented May 17, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Uh oh!

gemini-code-assist Bot May 17, 2026

Uh oh!

kpham-sgl commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kpham-sgl commented May 17, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

CI States

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

kpham-sgl commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kpham-sgl commented May 17, 2026 •

edited by github-actions Bot

Loading