feat: add Gemma 4 multimodal model support#268
Conversation
Add detection and inference support for Google's Gemma 4 models (e.g. mlx-community/gemma-4-e2b-it-mxfp4) which include vision and audio capabilities via mlx-vlm >= 0.4.3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Patch gemma4 Attention to snapshot cache.offset before mutation (mx.array.__iadd__ is in-place, causes wrong RoPE positions) - Add Gemma 4 reasoning parser with channel name stripping (strips "thought"/"response" prefixes, supports both <channel|> and <|channel>response transition formats) - Configure Gemma 4 EOS/stop tokens to prevent uncontrolled generation - Add 16 Gemma 4 parser tests (non-streaming + streaming) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…okenizer - Accept RotatingKVCache (used by Gemma 4) in batch cache validation - Add missing return statement in load_model_with_fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@Thump604 @janhilgard I am planning to review this PR carefully. Any insight so far? |
|
We've been running essentially this exact code in production on What works well:
One thing to watch: The This is production-tested and ready from our side. +1 for merge. |
|
Could you paste here the benchmark results you got so far? @janhilgard using your hardware. Like I am curious about the token/seconds, Time for the firs token and so on. |
|
@waybarrios Here are benchmark results from Apple M3 Ultra (256 GB unified memory), running Single request (streaming)
Concurrent requests (2 simultaneous, continuous batching)
Summary
|
|
My read: this PR is bundling several pieces that have already proven out independently on our side, and I do not see a blocker in the combined shape. The load-bearing parts are:
That last one is the same bug already covered by #243 / #215, and the RotatingKVCache guard is the same issue keegoid flagged on #256 that I split into follow-up #273. So from a review perspective this is more of an integration bundle than a brand-new feature. If you want the lowest-risk path, the pieces split cleanly as:
If you prefer the integrated route, this PR looks technically sound to me and lines up with what we've been running locally. |
error responses with token=0 were falling through to the detokenizer and decoding garbage text. now they skip decoding and set the request status to FINISHED_ABORTED. added a test for this case. also ran black on batched.py to fix CI.
feat: add Gemma 4 multimodal model support
Summary
Test plan
🤖 Generated with Claude Code