feat: add Gemma 4 multimodal model support by jackneil · Pull Request #268 · waybarrios/vllm-mlx

jackneil · 2026-04-09T18:28:03Z

Summary

Adds Gemma 4 multimodal model support (loading, inference, vision)
Fixes BatchKVCache for Gemma 4's RotatingKVCache architecture
Adds Gemma 4 reasoning parser for channel-based thinking
Fixes RotatingKVCache isinstance check in MLLM batch generator
Fixes missing return statement in tokenizer load_model_with_fallback

Test plan

Model loads and generates text
Multimodal (image) input works
Reasoning parser extracts thinking blocks

🤖 Generated with Claude Code

Add detection and inference support for Google's Gemma 4 models (e.g. mlx-community/gemma-4-e2b-it-mxfp4) which include vision and audio capabilities via mlx-vlm >= 0.4.3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Patch gemma4 Attention to snapshot cache.offset before mutation (mx.array.__iadd__ is in-place, causes wrong RoPE positions) - Add Gemma 4 reasoning parser with channel name stripping (strips "thought"/"response" prefixes, supports both <channel|> and <|channel>response transition formats) - Configure Gemma 4 EOS/stop tokens to prevent uncontrolled generation - Add 16 Gemma 4 parser tests (non-streaming + streaming) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…okenizer - Accept RotatingKVCache (used by Gemma 4) in batch cache validation - Add missing return statement in load_model_with_fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

waybarrios · 2026-04-10T15:15:00Z

@Thump604 @janhilgard I am planning to review this PR carefully. Any insight so far?

janhilgard · 2026-04-10T15:17:44Z

We've been running essentially this exact code in production on gemma-4-26b-a4b-it (mixed 6/8-bit, port 1236) for about a week now. A few notes:

What works well:

BatchKVCache offset patch (gemma4_mllm.py) — this was the critical fix. Without it, Gemma 4 produces repetition (TheTheThe...) after ~3 tokens in continuous batching due to mx.array.__iadd__ mutating the offset in-place. The defensive off + 0 copy is the right approach.
Reasoning parser handles <|channel>thought...<channel|> cleanly in both streaming and non-streaming
generation_config.json EOS token reading is important — Gemma 4 has <turn|> (106) and <|tool_response> (50) as additional EOS tokens that weren't being picked up before
RotatingKVCache trim fix prevents shape mismatch on long prompts
Error handling in _next() prevents infinite retry loops on bad requests

One thing to watch: The mlx-vlm>=0.4.3 bump is necessary (Gemma 4 model type was added in 0.4.3). No compatibility issues observed with existing models.

This is production-tested and ready from our side. +1 for merge.

waybarrios · 2026-04-10T15:19:27Z

Could you paste here the benchmark results you got so far? @janhilgard using your hardware. Like I am curious about the token/seconds, Time for the firs token and so on.

janhilgard · 2026-04-10T15:22:45Z

@waybarrios Here are benchmark results from Apple M3 Ultra (256 GB unified memory), running gemma-4-26b-a4b-it in mixed 6/8-bit quantization (~25 GB footprint), with --continuous-batching --max-num-seqs 16 --kv-cache-quantization --kv-cache-quantization-bits 8:

Single request (streaming)

Test	Completion tokens	TTFT	Generation	Total
Short prompt (~12 tok)	68	689 ms	85.9 tok/s	1.48s
Medium prompt (~50 tok)	977	776 ms	79.2 tok/s	13.1s
Reasoning (step-by-step)	409	753 ms	68.3 tok/s	6.74s

Concurrent requests (2 simultaneous, continuous batching)

Request	Tokens	TTFT	Per-request	Total
A	284	807 ms	73.1 tok/s	4.69s
B	69	807 ms	24.1 tok/s	3.66s
Aggregate	353	—	75.2 tok/s	4.69s wall

Summary

TTFT: 689–807 ms (MoE routing + prefill for ~4B active params)
Single-stream generation: 68–86 tok/s depending on context length
Concurrent batching: scales well, aggregate throughput maintained
Memory: ~25 GB active (mixed 6/8-bit), no OOM issues
No repetition bug — the BatchKVCache offset patch is working correctly, generation is coherent across all tests

Thump604 · 2026-04-10T16:56:49Z

My read: this PR is bundling several pieces that have already proven out independently on our side, and I do not see a blocker in the combined shape.

The load-bearing parts are:

gemma4_mllm.py offset snapshot (off + 0) for BatchKVCache / continuous batching
Gemma 4 parser registration (reasoning_parser=gemma4, tool_call_parser=gemma4)
additional EOS tokens from generation_config.json (<turn|>, <|tool_response>)
RotatingKVCache handling on the MLLM path
the tokenizer happy-path return fix in load_model_with_fallback()

That last one is the same bug already covered by #243 / #215, and the RotatingKVCache guard is the same issue keegoid flagged on #256 that I split into follow-up #273. So from a review perspective this is more of an integration bundle than a brand-new feature.

If you want the lowest-risk path, the pieces split cleanly as:

fix: add missing return in load_model_with_fallback #243 (or tokenizer: return successful mlx-lm load result #215) for the tokenizer return
feat: Gemma 4 tool call + reasoning parser #250 for the Gemma 4 parser pair
fix: patch Gemma 4 attention and RotatingKVCache for BatchKVCache #256 + fix: accept RotatingKVCache in MLLM prompt merge #273 for the Gemma 4 / RotatingKVCache MLLM fixes

If you prefer the integrated route, this PR looks technically sound to me and lines up with what we've been running locally.

error responses with token=0 were falling through to the detokenizer and decoding garbage text. now they skip decoding and set the request status to FINISHED_ABORTED. added a test for this case. also ran black on batched.py to fix CI.

feat: add Gemma 4 multimodal model support

yiheng chen and others added 3 commits April 9, 2026 14:25

feat: add Gemma 4 multimodal model support

705586b

Add detection and inference support for Google's Gemma 4 models (e.g. mlx-community/gemma-4-e2b-it-mxfp4) which include vision and audio capabilities via mlx-vlm >= 0.4.3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: RotatingKVCache support in MLLM batching and missing return in t…

dc2279d

…okenizer - Accept RotatingKVCache (used by Gemma 4) in batch cache validation - Add missing return statement in load_model_with_fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

waybarrios merged commit 4c80feb into waybarrios:main Apr 10, 2026

waybarrios added a commit that referenced this pull request Apr 10, 2026

feat: add Gemma 4 multimodal model support (#268)

a147fb5

feat: add Gemma 4 multimodal model support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Gemma 4 multimodal model support#268

feat: add Gemma 4 multimodal model support#268
waybarrios merged 4 commits intowaybarrios:mainfrom
jackneil:pr/gemma4-model-support

jackneil commented Apr 9, 2026

Uh oh!

waybarrios commented Apr 10, 2026

Uh oh!

janhilgard commented Apr 10, 2026

Uh oh!

waybarrios commented Apr 10, 2026 •

edited

Loading

Uh oh!

janhilgard commented Apr 10, 2026

Uh oh!

Thump604 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jackneil commented Apr 9, 2026

Summary

Test plan

Uh oh!

waybarrios commented Apr 10, 2026

Uh oh!

janhilgard commented Apr 10, 2026

Uh oh!

waybarrios commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

janhilgard commented Apr 10, 2026

Single request (streaming)

Concurrent requests (2 simultaneous, continuous batching)

Summary

Uh oh!

Thump604 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

waybarrios commented Apr 10, 2026 •

edited

Loading