fix(gemma4): register image/video/audio token_regex for HF-expanded prompts #26320
fix(gemma4): register image/video/audio token_regex for HF-expanded prompts #26320BiggieW wants to merge 4 commits into
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Hi maintainers, thanks for taking a look at this PR! It seems the CI checks are blocked because I don't yet have permission to add the run-ci label. Would someone mind helping trigger CI for this PR when you get a chance? Really appreciate it! |
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
Hey @kpham-sgl, thanks for the review. I see that some tests are failing, but they're not related to my change:
|
|
Hey @kpham-sgl, just following up on this PR. It was already approved earlier, and the remaining CI failures seem unrelated to my changes (infra issues / existing perf regression / missing run-ci-extra label). I re-requested review just in case. If everything looks good on your side, would you be comfortable merging it when you get a chance? Thanks! |
Motivation
Gemma4SGLangProcessorregisters onlyimage_token_id/video_token_id/audio_token_idonMultimodalSpecialTokens. With no*_token_regexprovided,parse_regex()falls back tore.escape(token_str), which only matches a single bare placeholder.This breaks clients that send prompts in the format produced by
transformers.models.gemma4.processing_gemma4, where each multimodal item is expanded into aBOI + N×patch_token + EOIblock:<|image>+ N ×<|image|>+<image|><|image>+ N ×<|video|>+<image|>(reuses image BOI/EOI)<|audio>+ N ×<|audio|>+<audio|>With only the bare-token fallback,
base_processor.legacy_load_mm_datacounts N markers per modality (one per patch token) but receives only one data entry, then raises:gemma3.pyandqwen_vl.pyalready setimage_token_regexto match the expanded block as a single marker; gemma4 was missed when this processor was added in #21952.Minimal reproduction (no server / no model weights)
Running this on the current main prints regex matches: 256. Feeding the prompt + 1 image into process_mm_data_async(...) then raises the RuntimeError above.
Modifications
In Gemma4SGLangProcessor.init, add image_token / video_token / audio_token string literals and matching *_token_regex patterns to MultimodalSpecialTokens(...):
Each regex:
No changes outside init. No behavioral change for clients that already send single-token markers (verified in the second alternative of each regex).
CI States
Latest PR Test (Base): ❌ Run #26795494947
Latest PR Test (Extra): ❌ Run #26795494883