[Gemma 4] Adding MTP support#24436
Merged
Merged
Conversation
Qiaolin-Yu
reviewed
May 6, 2026
Collaborator
Qiaolin-Yu
left a comment
There was a problem hiding this comment.
Since it's implemented in a separate spec algo, I think we could merge it safely anyway. We could do some refactor/clean later if possible.
Collaborator
Author
|
Added new stage b CI test for E4B variant. Will add the bigger model test to nightly in a separate PR |
Collaborator
Author
|
/rerun-test test/registered/spec/test_frozen_kv_mtp.py |
Contributor
|
✅ |
Collaborator
Author
|
/tag-and-rerun-ci |
Collaborator
Author
|
/rerun-test test/registered/spec/test_frozen_kv_mtp.py |
Contributor
|
✅ |
Collaborator
Author
|
/rerun-failed-ci two |
5 tasks
Qiaolin-Yu
approved these changes
May 7, 2026
Dogacel
pushed a commit
to Dogacel/sglang-fork
that referenced
this pull request
May 8, 2026
Co-authored-by: Pengyu Chen <pychen96@gmail.com>
LLThomas
pushed a commit
to LLThomas/sglang
that referenced
this pull request
May 8, 2026
Co-authored-by: Pengyu Chen <pychen96@gmail.com>
5 tasks
LucQueen
pushed a commit
to LucQueen/sglang
that referenced
this pull request
May 12, 2026
Co-authored-by: Pengyu Chen <pychen96@gmail.com>
vroomfondel
added a commit
to vroomfondel/dgxarley
that referenced
this pull request
May 13, 2026
**Gemma-4 MTP (Frozen-KV) speculative-decoding patch** — the `0.5.11-gemma4-sm121` tag carries a cherry-pick of upstream [PR #24436](sgl-project/sglang#24436), ("Gemma 4 — Adding MTP support", merged 2026-05-07, after the v0.5.11 release tag). Adds the dedicated `Gemma4AssistantForCausalLM` model and a new `FROZEN_KV_MTP` speculative algorithm (recurrent hidden-state draft loop with frozen target KV cache). At runtime SGLang auto-promotes `--speculative-algorithm NEXTN → FROZEN_KV_MTP` once the drafter is detected. Without this patch the stock NEXTN/EAGLE worker crashes with `ValueError: No module or parameter named 'model.language_model' in TransformersMultiModalForCausalLM` during drafter weight-load. Verified working on the 4-node DGX Spark cluster — see the 31B-it TESTLOG, [Test 07 (`num_steps=2`, `num_draft_tokens=3`)](https://github.com/vroomfondel/dgxarley/blob/main/TESTLOGS/sglang_nn4_tp4_ep1/gemma-4-31b-it/TESTLOG_nv580.142_sglang-0.5.11_gemma-4-31b-it_4n.md#mtp-sweep-tests-711--partial-15-cases-done): +98 % at n=1 (10.49 → 20.83 tok/s), +76 % at n=4 (44.06 → 77.67 tok/s), drafter acceptance rate median ~0.68, 5/5 requests stopped on natural EOS. The 26B-A4B MoE sibling's MTP sweep is still in progress ([TESTLOG](https://github.com/vroomfondel/dgxarley/blob/main/TESTLOGS/sglang_nn4_tp4_ep1/gemma-4-26b-a4b-it/TESTLOG_nv580.142_sglang-0.5.11_gemma-4-26b-a4b-it_4n.md)). P.S.: "In the end, it's not about how hard you can hit, but how accurately you can score." - Tom Cruise
Jiminator
added a commit
to Jiminator/sglang
that referenced
this pull request
May 15, 2026
…2c1034 Two findings appended to the bisect report: 1. PR sgl-project#25335 ("Fix gpt oss triton kernels and upgrade flashinfer back to 0.6.11.post1") re-bumped flashinfer past PR sgl-project#25310's revert. The one-line fix in fp4_utils.py:22 (cute-dsl -> cuda) is therefore no longer sufficient on latest main: experiment G reproduces the strict cuda-side check from fp4Quantize.cpp:64 ("globalScale should have shape [1] or [num_tokens]"), identical to experiment C. The proper fix is now at the call site in compressed_tensors_w4a4_nvfp4_moe.py:315: collapse layer.w13_input_scale_quant (shape [num_experts]) to scalar [1] or per-token [num_tokens] before passing as global_scale. 2. The TP8+MTP variant has its own separate pre-existing regression, bisected to d2c1034 ("[Gemma 4] Adding MTP support", PR sgl-project#24436). That PR added _resolve_speculative_algorithm_alias in server_args.py:318-342 which unconditionally calls AutoConfig.from_pretrained on the draft path to detect Gemma4 drafts. It crashes on any draft in Mistral native format (params.json, no HF config.json), even when --speculative-algorithm is already explicit EAGLE. Empirical proof for (2): - d2c1034 + TP8+MTP-only test: FAIL with "Unrecognized model in ...Eagle. Should have a model_type key in its config.json", total wall time 60.7s (crashes before model load). - f1395af (parent of d2c1034) + same test: PASS, gsm8k 0.949. Both with flashinfer 0.6.8.post1, sglang-kernel 0.4.2.post1+cu130, torch 2.11.0+cu130, SGLANG_IS_IN_CI=true, SGLANG_ENABLE_JIT_DEEPGEMM=0, SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1. Minimal fix for (2): wrap the AutoConfig.from_pretrained call in _resolve_speculative_algorithm_alias with try/except, or short-circuit when speculative_algorithm is already explicit and the user did not request NEXTN aliasing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jiminator
added a commit
to Jiminator/sglang
that referenced
this pull request
May 15, 2026
…5407 The Mistral-Large-3 B200 nightly partition has been red because of TWO independent regressions sharing the same job. Keeping them in one document is misleading — different root causes, different fixes, different PRs. This split: - Creates mistral_large3_tp8_mtp_b200_bisect_report.md with all TP8+MTP-specific content (root cause d2c1034 / PR sgl-project#24436, the _resolve_speculative_algorithm_alias crash on Mistral-native-format drafts, the AutoConfig.from_pretrained ValueError, the empirical one-commit bisect d2c1034 vs f1395af, the proposed try/except fix, the maintainer-ready server log block, and the CI-visibility table). - Strips the same content out of mistral_large3_nvfp4_b200_bisect_report.md, replacing it with cross-references in the header, Open Items, follow-up note, and TL;DR. - Adds a PR sgl-project#25407 verification section to BOTH documents (NVFP4 doc records that PR sgl-project#25407 fixes its issue with gsm8k 0.957; TP8+MTP doc records that PR sgl-project#25407 explicitly does NOT touch server_args.py and the failure remains identical). Run summary on PR sgl-project#25407 head e3fb4ee (1574s wall time, 8x B200, flashinfer 0.6.11.post1, sglang-kernel 0.4.2.post2+cu130, torch 2.11.0): - TP8 PASS gsm8k 0.953 - TP8+MTP FAIL unchanged ValueError (server_args.py:329) - NVFP4 PASS gsm8k 0.957 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
|
Now the gemma4 mtp model has renamed to google/gemma-4-E4B-it-assistant. FYI. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Add Multi-Token Prediction (MTP) speculative decoding for Gemma 4. Each Gemma 4 target ships with a small "assistant" checkpoint trained for MTP. This PR introduces a new speculative algorithm —
FROZEN_KV_MTP— that runs the assistant against the target's KV cache (the assistant has no KV of its own and a recurrent hidden state across draft steps, so it does not fit cleanly under EAGLE/EAGLE3 or NEXTN).NEXTNandEAGLEare auto-promoted toFROZEN_KV_MTPwhen--speculative-draft-model-pathresolves to aGemma4AssistantForCausalLM.EAGLE3is rejected for this draft architecture.Supported Targets:
Both
Gemma4ForCausalLM(text) andGemma4ForConditionalGeneration(multimodal) targets are supported.Installation
Usage
Launch Server
When
FROZEN_KV_MTPis active, the overlap scheduler and mixed chunked prefill are force-disabled, and--max-running-requestsdefaults to 48 if unset.Accuracy Tests
MTP matches the target-only baseline across GPQA Diamond (CoT / generative) and GSM8K (CoT, 0-shot):
Topk = 1
Speed Tests and Profiling
Will follow up with
FROZEN_KV_MTPspeed benchmark and profiling laterModifications
gemma4_mtp.py(Gemma4AssistantForCausalLM) — target-embed + recurrent hidden via pre/post projection, owns its ownlm_head, optional centroid-ordered logits head.FROZEN_KV_MTPinspeculative/spec_info.py(withFROZEN_KV_MTP_DRAFT/FROZEN_KV_MTP_VERIFYspec input types).frozen_kv_mtp_worker.py,frozen_kv_mtp_info.py,frozen_kv_mtp_utils.py,frozen_kv_mtp_cuda_graph_runner.py(supportstopk=1and tree verify fortopk>1, with full CUDA graph capture).kv_shared_layer_indextwo-hop into a direct assistant-logical → target-physical layer map. Assistant KV writes are suppressed.server_args.py: alias resolution promotesNEXTN/EAGLE→FROZEN_KV_MTPfor Gemma 4 assistant drafts and rejectsEAGLE3. Force-disables overlap scheduler and mixed chunked prefill.hf_transformers/config.py: recognizemodel_type == "gemma4_assistant"for SWA attribute remap.gemma4_causal.py/gemma4_mm.py: exposeget_embed_and_head()so the assistant can rebind to the target's input embedding at load time.test/manual/models/test_gemma4_mtp.py(target-only baseline →topk=1MTP →topk>1MTP).Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci