Skip to content

fix(gemma4): register image/video/audio token_regex for HF-expanded prompts #26320

Open
BiggieW wants to merge 4 commits into
sgl-project:mainfrom
BiggieW:fix/gemma4-image-token-regex
Open

fix(gemma4): register image/video/audio token_regex for HF-expanded prompts #26320
BiggieW wants to merge 4 commits into
sgl-project:mainfrom
BiggieW:fix/gemma4-image-token-regex

Conversation

@BiggieW
Copy link
Copy Markdown

@BiggieW BiggieW commented May 25, 2026

Motivation

Gemma4SGLangProcessor registers only image_token_id/video_token_id/audio_token_id on MultimodalSpecialTokens. With no *_token_regex provided, parse_regex() falls back to re.escape(token_str), which only matches a single bare placeholder.

This breaks clients that send prompts in the format produced by transformers.models.gemma4.processing_gemma4, where each multimodal item is expanded into a BOI + N×patch_token + EOI block:

Modality Expanded form (per HF Gemma 4 processor)
image <|image> + N × <|image|> + <image|>
video <|image> + N × <|video|> + <image|> (reuses image BOI/EOI)
audio <|audio> + N × <|audio|> + <audio|>

With only the bare-token fallback, base_processor.legacy_load_mm_data counts N markers per modality (one per patch token) but receives only one data entry, then raises:

RuntimeError: An exception occurred while loading multimodal data:                                                                                                                                    
 Mismatch: More 'IMAGE' tokens found than corresponding data provided.

gemma3.py and qwen_vl.py already set image_token_regex to match the expanded block as a single marker; gemma4 was missed when this processor was added in #21952.

Minimal reproduction (no server / no model weights)

import asyncio  
from types import SimpleNamespace
from transformers import AutoConfig, AutoProcessor                                                                                                                                                    
from sglang.srt.multimodal.processors.gemma4 import Gemma4SGLangProcessor                                                                                                                             
                                                                                                                                                                                                      
hf_cfg = AutoConfig.from_pretrained("google/gemma-4-E2B-it")                                                                                                                                          
hf_proc = AutoProcessor.from_pretrained("google/gemma-4-E2B-it")                                                                                                                                      
proc = Gemma4SGLangProcessor(                                                                                                                                                                         
    hf_cfg,     
    SimpleNamespace(mm_process_config={}, skip_tokenizer_init=False),                                                                                                                                 
    hf_proc,                                                                                                                                                                                          
    "default",
)                                                                                                                                                                                                     
                
# HF-expanded prompt for ONE image: BOI + 256 patches + EOI                                                                                                                                           
prompt = "<|image>" + "<|image|>" * 256 + "<image|>"
print("regex matches:", len(proc.mm_tokens.combined_regex.findall(prompt)))                                                                                                                           
# Before this PR: 256 (one per patch token)  →  mismatch with len(image_data)=1                                                                                                                        
# After this PR: 1 (one per full expanded block) 

Running this on the current main prints regex matches: 256. Feeding the prompt + 1 image into process_mm_data_async(...) then raises the RuntimeError above.

Modifications

In Gemma4SGLangProcessor.init, add image_token / video_token / audio_token string literals and matching *_token_regex patterns to MultimodalSpecialTokens(...):

self.mm_tokens = MultimodalSpecialTokens(                                                                                                                                                             
    image_token="<|image|>",                                                                                                                                                                          
    image_token_id=hf_config.image_token_id,                                                                                                                                                          
    image_token_regex=re.compile(                                                                                                                                                                     
        r"<\|image>(?:<\|image\|>)+<image\|>|<\|image\|>"                                                                                                                                             
    ),                                                                                                                                                                                                
    video_token="<|video|>",
    video_token_id=hf_config.video_token_id,                                                                                                                                                          
    video_token_regex=re.compile(                                                                                                                                                                     
        r"<\|image>(?:<\|video\|>)+<image\|>|<\|video\|>"
    ),                                                                                                                                                                                                
    audio_token="<|audio|>",                                                                                                                                                                          
    audio_token_id=hf_config.audio_token_id,
    audio_token_regex=re.compile(                                                                                                                                                                     
        r"<\|audio>(?:<\|audio\|>)+<audio\|>|<\|audio\|>"
    ),                                                                                                                                                                                                
).build(_processor)

Each regex:

  • First alternative matches a full HF-expanded block (BOI + ≥1 patches + EOI) as one marker.
  • Second alternative (|<|...|>) preserves backward compatibility for clients that already send a single bare placeholder.

No changes outside init. No behavioral change for clients that already send single-token markers (verified in the second alternative of each regex).


CI States

Latest PR Test (Base): ❌ Run #26795494947
Latest PR Test (Extra): ❌ Run #26795494883

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@BiggieW
Copy link
Copy Markdown
Author

BiggieW commented May 26, 2026

Hi maintainers, thanks for taking a look at this PR! It seems the CI checks are blocked because I don't yet have permission to add the run-ci label. Would someone mind helping trigger CI for this PR when you get a chance?

Really appreciate it!

@kpham-sgl
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@BiggieW
Copy link
Copy Markdown
Author

BiggieW commented May 27, 2026

/rerun-failed-ci

1 similar comment
@BiggieW
Copy link
Copy Markdown
Author

BiggieW commented May 27, 2026

/rerun-failed-ci

@BiggieW
Copy link
Copy Markdown
Author

BiggieW commented May 27, 2026

Hey @kpham-sgl, thanks for the review. I see that some tests are failing, but they're not related to my change:

  • AMD stage-b: Local docker registry pull failed (10.245.143.50:5000 timeout) + detokenizer Health check failed — runner infra issues.
  • NPU multimodal-gen: wan2_1_t2v_1.3b diffusion E2E perf regression (39s vs 36.7s threshold) on Ascend, mostly cold-start (TextEncoding 5s, denoise step 0 = 7.9s).
  • PR Test Extra gate: missing run-ci-extra label (I can't add it).

@BiggieW
Copy link
Copy Markdown
Author

BiggieW commented Jun 1, 2026

Hey @kpham-sgl, just following up on this PR. It was already approved earlier, and the remaining CI failures seem unrelated to my changes (infra issues / existing perf regression / missing run-ci-extra label).

I re-requested review just in case. If everything looks good on your side, would you be comfortable merging it when you get a chance?

Thanks!

@kpham-sgl kpham-sgl self-assigned this Jun 1, 2026
@kpham-sgl kpham-sgl requested a review from pyc96 as a code owner June 2, 2026 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants