fix(gemma4): register image/video/audio token_regex for HF-expanded prompts by BiggieW · Pull Request #26320 · sgl-project/sglang

BiggieW · 2026-05-25T19:22:55Z

Motivation

Gemma4SGLangProcessor registers only image_token_id/video_token_id/audio_token_id on MultimodalSpecialTokens. With no *_token_regex provided, parse_regex() falls back to re.escape(token_str), which only matches a single bare placeholder.

This breaks clients that send prompts in the format produced by transformers.models.gemma4.processing_gemma4, where each multimodal item is expanded into a BOI + N×patch_token + EOI block:

Modality	Expanded form (per HF Gemma 4 processor)
image	`<\|image>` + N × `<\|image\|>` + `<image\|>`
video	`<\|image>` + N × `<\|video\|>` + `<image\|>` (reuses image BOI/EOI)
audio	`<\|audio>` + N × `<\|audio\|>` + `<audio\|>`

With only the bare-token fallback, base_processor.legacy_load_mm_data counts N markers per modality (one per patch token) but receives only one data entry, then raises:

RuntimeError: An exception occurred while loading multimodal data:                                                                                                                                    
 Mismatch: More 'IMAGE' tokens found than corresponding data provided.

gemma3.py and qwen_vl.py already set image_token_regex to match the expanded block as a single marker; gemma4 was missed when this processor was added in #21952.

Minimal reproduction (no server / no model weights)

import asyncio  
from types import SimpleNamespace
from transformers import AutoConfig, AutoProcessor                                                                                                                                                    
from sglang.srt.multimodal.processors.gemma4 import Gemma4SGLangProcessor                                                                                                                             
                                                                                                                                                                                                      
hf_cfg = AutoConfig.from_pretrained("google/gemma-4-E2B-it")                                                                                                                                          
hf_proc = AutoProcessor.from_pretrained("google/gemma-4-E2B-it")                                                                                                                                      
proc = Gemma4SGLangProcessor(                                                                                                                                                                         
    hf_cfg,     
    SimpleNamespace(mm_process_config={}, skip_tokenizer_init=False),                                                                                                                                 
    hf_proc,                                                                                                                                                                                          
    "default",
)                                                                                                                                                                                                     
                
# HF-expanded prompt for ONE image: BOI + 256 patches + EOI                                                                                                                                           
prompt = "<|image>" + "<|image|>" * 256 + "<image|>"
print("regex matches:", len(proc.mm_tokens.combined_regex.findall(prompt)))                                                                                                                           
# Before this PR: 256 (one per patch token)  →  mismatch with len(image_data)=1                                                                                                                        
# After this PR: 1 (one per full expanded block)

Running this on the current main prints regex matches: 256. Feeding the prompt + 1 image into process_mm_data_async(...) then raises the RuntimeError above.

Modifications

In Gemma4SGLangProcessor.init, add image_token / video_token / audio_token string literals and matching *_token_regex patterns to MultimodalSpecialTokens(...):

self.mm_tokens = MultimodalSpecialTokens(                                                                                                                                                             
    image_token="<|image|>",                                                                                                                                                                          
    image_token_id=hf_config.image_token_id,                                                                                                                                                          
    image_token_regex=re.compile(                                                                                                                                                                     
        r"<\|image>(?:<\|image\|>)+<image\|>|<\|image\|>"                                                                                                                                             
    ),                                                                                                                                                                                                
    video_token="<|video|>",
    video_token_id=hf_config.video_token_id,                                                                                                                                                          
    video_token_regex=re.compile(                                                                                                                                                                     
        r"<\|image>(?:<\|video\|>)+<image\|>|<\|video\|>"
    ),                                                                                                                                                                                                
    audio_token="<|audio|>",                                                                                                                                                                          
    audio_token_id=hf_config.audio_token_id,
    audio_token_regex=re.compile(                                                                                                                                                                     
        r"<\|audio>(?:<\|audio\|>)+<audio\|>|<\|audio\|>"
    ),                                                                                                                                                                                                
).build(_processor)

Each regex:

First alternative matches a full HF-expanded block (BOI + ≥1 patches + EOI) as one marker.
Second alternative (|<|...|>) preserves backward compatibility for clients that already send a single bare placeholder.

No changes outside init. No behavioral change for clients that already send single-token markers (verified in the second alternative of each regex).

CI States

Latest PR Test (Base): ❌ Run #26795494947
Latest PR Test (Extra): ❌ Run #26795494883

gemini-code-assist · 2026-05-25T19:22:59Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

BiggieW · 2026-05-26T13:35:50Z

Hi maintainers, thanks for taking a look at this PR! It seems the CI checks are blocked because I don't yet have permission to add the run-ci label. Would someone mind helping trigger CI for this PR when you get a chance?

Really appreciate it!

kpham-sgl · 2026-05-26T19:55:17Z

/tag-and-rerun-ci

BiggieW · 2026-05-27T09:15:24Z

/rerun-failed-ci

BiggieW · 2026-05-27T12:24:02Z

/rerun-failed-ci

BiggieW · 2026-05-27T14:59:10Z

Hey @kpham-sgl, thanks for the review. I see that some tests are failing, but they're not related to my change:

AMD stage-b: Local docker registry pull failed (10.245.143.50:5000 timeout) + detokenizer Health check failed — runner infra issues.
NPU multimodal-gen: wan2_1_t2v_1.3b diffusion E2E perf regression (39s vs 36.7s threshold) on Ascend, mostly cold-start (TextEncoding 5s, denoise step 0 = 7.9s).
PR Test Extra gate: missing run-ci-extra label (I can't add it).

BiggieW · 2026-06-01T09:44:39Z

Hey @kpham-sgl, just following up on this PR. It was already approved earlier, and the remaining CI failures seem unrelated to my changes (infra issues / existing perf regression / missing run-ci-extra label).

I re-requested review just in case. If everything looks good on your side, would you be comfortable merging it when you get a chance?

Thanks!

fix(gemma4): add image_token_regex for pre-expanded prompts

149511b

BiggieW requested review from JustinTong0323, kpham-sgl, mickqian, yhyang201 and yuan-luo as code owners May 25, 2026 19:22

kpham-sgl approved these changes May 26, 2026

View reviewed changes

github-actions Bot added the run-ci label May 26, 2026

Merge branch 'main' into fix/gemma4-image-token-regex

3d3212f

hongboshi1234 mentioned this pull request May 30, 2026

[Bug] Gemma-4 mm: single non-RGB image crashes vision tower and kills scheduler (mat1 256 vs 768) #26751

Open

BiggieW requested a review from kpham-sgl June 1, 2026 09:42

kpham-sgl self-assigned this Jun 1, 2026

kpham-sgl added 2 commits June 1, 2026 08:41

Merge branch 'main' into fix/gemma4-image-token-regex

118219a

Merge branch 'main' into fix/gemma4-image-token-regex

2e056ff

kpham-sgl requested a review from pyc96 as a code owner June 2, 2026 02:56

pyc96 approved these changes Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gemma4): register image/video/audio token_regex for HF-expanded prompts #26320

fix(gemma4): register image/video/audio token_regex for HF-expanded prompts #26320
BiggieW wants to merge 4 commits into
sgl-project:mainfrom
BiggieW:fix/gemma4-image-token-regex

BiggieW commented May 25, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot commented May 25, 2026

Uh oh!

BiggieW commented May 26, 2026

Uh oh!

kpham-sgl commented May 26, 2026

Uh oh!

BiggieW commented May 27, 2026

Uh oh!

BiggieW commented May 27, 2026

Uh oh!

BiggieW commented May 27, 2026

Uh oh!

BiggieW commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BiggieW commented May 25, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Minimal reproduction (no server / no model weights)

Modifications

CI States

Uh oh!

gemini-code-assist Bot commented May 25, 2026

Uh oh!

BiggieW commented May 26, 2026

Uh oh!

kpham-sgl commented May 26, 2026

Uh oh!

BiggieW commented May 27, 2026

Uh oh!

BiggieW commented May 27, 2026

Uh oh!

BiggieW commented May 27, 2026

Uh oh!

BiggieW commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BiggieW commented May 25, 2026 •

edited by github-actions Bot

Loading