[Gemma4] Add test for MTP models by kpham-sgl · Pull Request #24552 · sgl-project/sglang

kpham-sgl · 2026-05-06T21:44:45Z

Motivation

Add test for PR #24436. Require transformer v5.8.0

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ❌ Run #26135220858
Latest PR Test (Extra): ❌ Run #26135220756

gemini-code-assist

Code Review

This pull request implements Frozen-KV MTP (Multi-Token Prediction) speculative decoding for Gemma 4 models, introducing a new assistant model class, worker, and CUDA graph runner. The review highlights critical issues regarding Tensor Parallelism compatibility in embedding and linear layers, suggests avoiding torch.cuda.empty_cache() to prevent performance degradation, recommends copying server_args for isolation, and advises using specific exceptions for better error handling.

I am having trouble creating individual review comments. Click here to see my feedback.

python/sglang/srt/models/gemma4_mtp.py (234)

Using torch.nn.functional.embedding directly on self.target_embed_weight will fail when Tensor Parallelism (TP) is enabled (TP > 1). In SGLang, target_embed_weight is typically a partitioned weight from a VocabParallelEmbedding. A direct lookup will only work for tokens within the local GPU's vocabulary range and will not perform the necessary communication (all-reduce) to produce the full embedding vector on all GPUs. This will lead to incorrect results or IndexError on GPUs where the token IDs are out of the local partition range.

python/sglang/srt/models/gemma4_mtp.py (111)

Using nn.Linear for lm_head bypasses Tensor Parallelism. For large vocabulary sizes (e.g., Gemma's 256k), this replicates a large weight matrix (approx. 1.5GB for bf16) on every GPU, which is memory-inefficient and inconsistent with SGLang's standard use of ColumnParallelLinear for output heads. This also applies to the centroids layer on line 125.

python/sglang/srt/models/gemma4_mtp.py (211)

Calling torch.cuda.empty_cache() inside a model method is generally discouraged as it can cause significant performance overhead due to GPU synchronization and fragmentation. It is better to let the high-level memory manager or the user handle cache clearing if necessary.

python/sglang/srt/speculative/frozen_kv_mtp_worker.py (117)

Modifying server_args.context_length in-place can have unintended side effects if the server_args object is shared with other components (like the target worker). It is safer to create a copy of the arguments for the draft worker to ensure isolation.

        import copy
        server_args = copy.copy(server_args)
        server_args.context_length = target_worker.model_runner.model_config.context_len

python/sglang/srt/server_args.py (3465)

Hardcoding max_running_requests to 48 when using FROZEN_KV_MTP seems arbitrary and might be too restrictive for some hardware configurations. Consider making this a configurable default or providing a more detailed rationale for this specific limit.

python/sglang/srt/speculative/frozen_kv_mtp_cuda_graph_runner.py (146-149)

Raising a generic Exception is discouraged. Using a more specific exception like RuntimeError is preferred for better error handling and clarity.

        except RuntimeError as e:
            raise RuntimeError(
                f"Capture frozen-KV MTP cuda graph failed: {e}\n"
                f"{CUDA_GRAPH_CAPTURE_FAILED_MSG}"
            ) from e

Per #25197, stage-shaped per-commit CUDA suites must register via stage=/runner_config= kwargs; the legacy suite= form is reserved for nightly/stress/weekly and non-stage backends. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kpham-sgl · 2026-05-19T18:13:58Z

/tag-run-ci-label

The per-commit CUDA suites were renamed from stage-* to base-* on main; test_frozen_kv_mtp was still registering stage="stage-b", which fails validate_all_suites with "Tests registered to invalid suites". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

These tests are gemma4-specific and only relevant for PRs touching that model path. Move them off the nightly schedule onto extra-a so they: - gate on per-commit signal when a PR opts in with run-ci-extra label - stop consuming nightly slot for changes unrelated to gemma4 Both fit extra-a-test-2-gpu-large (TP=2, ~720s each). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eval/ skews nightly (its other 2 files are nightly=True); these two are now extra-a per-commit speculative-decoding tests. Their sibling test_frozen_kv_mtp.py already lives in spec/, and spec/ uses the _extra suffix to mark extra-* registrations (test_spec_ngram_extra.py, test_spec_standalone_extra.py). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kpham-sgl · 2026-05-20T02:58:34Z

/rerun-failed-ci

kpham-sgl requested review from Qiaolin-Yu, Ying1123, hnyls2002 and merrymercy as code owners May 6, 2026 21:44

kpham-sgl changed the base branch from main to gemma4-mtp-fin May 6, 2026 21:44

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

Base automatically changed from gemma4-mtp-fin to main May 7, 2026 21:08

test

b780757

kpham-sgl force-pushed the gemma4-mtp-add-test branch from bced3e0 to b780757 Compare May 10, 2026 08:03

kpham-sgl mentioned this pull request May 17, 2026

[Spec] FrozenKVMTP fold assistant seed into captured draft graph #25539

Open

5 tasks

kpham-sgl changed the title ~~[WIP][Gemma4] Add test for MTP models~~ [Gemma4] Add test for MTP models May 19, 2026

github-actions Bot added the run-ci label May 19, 2026

kpham-sgl and others added 5 commits May 19, 2026 15:31

Merge branch 'main' into gemma4-mtp-add-test

eb52e41

Merge branch 'main' into gemma4-mtp-add-test

4f4af21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gemma4] Add test for MTP models #24552

[Gemma4] Add test for MTP models #24552
kpham-sgl wants to merge 7 commits into
mainfrom
gemma4-mtp-add-test

kpham-sgl commented May 6, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

kpham-sgl commented May 19, 2026

Uh oh!

kpham-sgl commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kpham-sgl commented May 6, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

CI States

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

python/sglang/srt/models/gemma4_mtp.py (234)

python/sglang/srt/models/gemma4_mtp.py (111)

python/sglang/srt/models/gemma4_mtp.py (211)

python/sglang/srt/speculative/frozen_kv_mtp_worker.py (117)

python/sglang/srt/server_args.py (3465)

python/sglang/srt/speculative/frozen_kv_mtp_cuda_graph_runner.py (146-149)

Uh oh!

kpham-sgl commented May 19, 2026

Uh oh!

kpham-sgl commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kpham-sgl commented May 6, 2026 •

edited by github-actions Bot

Loading