Skip to content

Merge: fix(gemma3): remove auto SLIDING_WINDOW=0 that breaks multimodal#7

Merged
lubauss merged 1 commit intomainfrom
patch/fixgemma3-remove-auto-slidingwindow0-that-breaks-m
Jan 20, 2026
Merged

Merge: fix(gemma3): remove auto SLIDING_WINDOW=0 that breaks multimodal#7
lubauss merged 1 commit intomainfrom
patch/fixgemma3-remove-auto-slidingwindow0-that-breaks-m

Conversation

@lubauss
Copy link
Copy Markdown
Owner

@lubauss lubauss commented Jan 20, 2026

Merging local patches to main

Synced from local patches in .venv-vllm-mlx

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@lubauss lubauss merged commit 42563fe into main Jan 20, 2026
@lubauss lubauss deleted the patch/fixgemma3-remove-auto-slidingwindow0-that-breaks-m branch January 20, 2026 01:09
lubauss pushed a commit that referenced this pull request Jan 20, 2026
lubauss added a commit that referenced this pull request Jan 20, 2026
Synced from local patches in .venv-vllm-mlx

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
lubauss added a commit that referenced this pull request Jan 20, 2026
Synced from local patches in .venv-vllm-mlx

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
lubauss added a commit that referenced this pull request Jan 20, 2026
* Fix --api-key argument for serve command (fixes #7)

* Document --api-key, --rate-limit and --timeout options in CLI reference

* fix: Enable vision and streaming for MLLM models + Gemma 3 support (#2)

* fix: Enable vision and streaming for MLLM models

This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add Gemma 3 to MLLM detection patterns

Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add Gemma 3 support section with long context patch instructions

- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* fix: disable skip_prompt_processing for multimodal to prevent garbled output

For MLLM with images, skip_prompt_processing cannot be used because:
- Vision encoder must run each time to provide visual context
- The skip path only calls language_model() which has no vision
- Using it produces garbled output like 'TheTheTheThe...'

Text-only caching still works with 6x+ speedup.
Multimodal correctly gets no speedup but produces coherent output.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Wayner Barrios <waybarrios@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant