Skip to content

feat: add Gemma 4 multimodal model support#268

Merged
waybarrios merged 4 commits intowaybarrios:mainfrom
jackneil:pr/gemma4-model-support
Apr 10, 2026
Merged

feat: add Gemma 4 multimodal model support#268
waybarrios merged 4 commits intowaybarrios:mainfrom
jackneil:pr/gemma4-model-support

Conversation

@jackneil
Copy link
Copy Markdown
Contributor

@jackneil jackneil commented Apr 9, 2026

Summary

  • Adds Gemma 4 multimodal model support (loading, inference, vision)
  • Fixes BatchKVCache for Gemma 4's RotatingKVCache architecture
  • Adds Gemma 4 reasoning parser for channel-based thinking
  • Fixes RotatingKVCache isinstance check in MLLM batch generator
  • Fixes missing return statement in tokenizer load_model_with_fallback

Test plan

  • Model loads and generates text
  • Multimodal (image) input works
  • Reasoning parser extracts thinking blocks

🤖 Generated with Claude Code

yiheng chen and others added 3 commits April 9, 2026 14:25
Add detection and inference support for Google's Gemma 4 models
(e.g. mlx-community/gemma-4-e2b-it-mxfp4) which include vision
and audio capabilities via mlx-vlm >= 0.4.3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Patch gemma4 Attention to snapshot cache.offset before mutation
  (mx.array.__iadd__ is in-place, causes wrong RoPE positions)
- Add Gemma 4 reasoning parser with channel name stripping
  (strips "thought"/"response" prefixes, supports both <channel|>
  and <|channel>response transition formats)
- Configure Gemma 4 EOS/stop tokens to prevent uncontrolled generation
- Add 16 Gemma 4 parser tests (non-streaming + streaming)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…okenizer

- Accept RotatingKVCache (used by Gemma 4) in batch cache validation
- Add missing return statement in load_model_with_fallback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@waybarrios
Copy link
Copy Markdown
Owner

@Thump604 @janhilgard I am planning to review this PR carefully. Any insight so far?

@janhilgard
Copy link
Copy Markdown
Collaborator

We've been running essentially this exact code in production on gemma-4-26b-a4b-it (mixed 6/8-bit, port 1236) for about a week now. A few notes:

What works well:

  • BatchKVCache offset patch (gemma4_mllm.py) — this was the critical fix. Without it, Gemma 4 produces repetition (TheTheThe...) after ~3 tokens in continuous batching due to mx.array.__iadd__ mutating the offset in-place. The defensive off + 0 copy is the right approach.
  • Reasoning parser handles <|channel>thought...<channel|> cleanly in both streaming and non-streaming
  • generation_config.json EOS token reading is important — Gemma 4 has <turn|> (106) and <|tool_response> (50) as additional EOS tokens that weren't being picked up before
  • RotatingKVCache trim fix prevents shape mismatch on long prompts
  • Error handling in _next() prevents infinite retry loops on bad requests

One thing to watch: The mlx-vlm>=0.4.3 bump is necessary (Gemma 4 model type was added in 0.4.3). No compatibility issues observed with existing models.

This is production-tested and ready from our side. +1 for merge.

@waybarrios
Copy link
Copy Markdown
Owner

waybarrios commented Apr 10, 2026

Could you paste here the benchmark results you got so far? @janhilgard using your hardware. Like I am curious about the token/seconds, Time for the firs token and so on.

@janhilgard
Copy link
Copy Markdown
Collaborator

@waybarrios Here are benchmark results from Apple M3 Ultra (256 GB unified memory), running gemma-4-26b-a4b-it in mixed 6/8-bit quantization (~25 GB footprint), with --continuous-batching --max-num-seqs 16 --kv-cache-quantization --kv-cache-quantization-bits 8:

Single request (streaming)

Test Completion tokens TTFT Generation Total
Short prompt (~12 tok) 68 689 ms 85.9 tok/s 1.48s
Medium prompt (~50 tok) 977 776 ms 79.2 tok/s 13.1s
Reasoning (step-by-step) 409 753 ms 68.3 tok/s 6.74s

Concurrent requests (2 simultaneous, continuous batching)

Request Tokens TTFT Per-request Total
A 284 807 ms 73.1 tok/s 4.69s
B 69 807 ms 24.1 tok/s 3.66s
Aggregate 353 75.2 tok/s 4.69s wall

Summary

  • TTFT: 689–807 ms (MoE routing + prefill for ~4B active params)
  • Single-stream generation: 68–86 tok/s depending on context length
  • Concurrent batching: scales well, aggregate throughput maintained
  • Memory: ~25 GB active (mixed 6/8-bit), no OOM issues
  • No repetition bug — the BatchKVCache offset patch is working correctly, generation is coherent across all tests

@Thump604
Copy link
Copy Markdown
Collaborator

My read: this PR is bundling several pieces that have already proven out independently on our side, and I do not see a blocker in the combined shape.

The load-bearing parts are:

  • gemma4_mllm.py offset snapshot (off + 0) for BatchKVCache / continuous batching
  • Gemma 4 parser registration (reasoning_parser=gemma4, tool_call_parser=gemma4)
  • additional EOS tokens from generation_config.json (<turn|>, <|tool_response>)
  • RotatingKVCache handling on the MLLM path
  • the tokenizer happy-path return fix in load_model_with_fallback()

That last one is the same bug already covered by #243 / #215, and the RotatingKVCache guard is the same issue keegoid flagged on #256 that I split into follow-up #273. So from a review perspective this is more of an integration bundle than a brand-new feature.

If you want the lowest-risk path, the pieces split cleanly as:

If you prefer the integrated route, this PR looks technically sound to me and lines up with what we've been running locally.

error responses with token=0 were falling through to the detokenizer
and decoding garbage text. now they skip decoding and set the request
status to FINISHED_ABORTED. added a test for this case.
also ran black on batched.py to fix CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants