Merge: fix(gemma3): remove auto SLIDING_WINDOW=0 that breaks multimodal by lubauss · Pull Request #7 · lubauss/vllm-mlx

lubauss · 2026-01-20T01:09:09Z

Merging local patches to main

Synced from local patches in .venv-vllm-mlx Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* Fix --api-key argument for serve command (fixes #7) * Document --api-key, --rate-limit and --timeout options in CLI reference * fix: Enable vision and streaming for MLLM models + Gemma 3 support (#2) * fix: Enable vision and streaming for MLLM models This patch fixes two critical issues with multimodal language models (MLLM): ## Vision Fix (server.py, simple.py) - Preserve original messages when calling MLLM models - The engine was passing only the prompt string, losing image data - Now passes full message objects with images to MLLM.chat() ## Streaming Fix (mllm.py, simple.py) - Add stream_chat() method to MLLMMultimodalLM class - Uses mlx_vlm.stream_generate() for true token-by-token streaming - Update engine to call stream_chat() for MLLM models - Properly yields GenerationOutput with new_text for SSE streaming Tested with: - mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit - Text streaming: 5 tokens streamed correctly - Vision streaming: Image analysis works with streaming Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Add Gemma 3 to MLLM detection patterns Gemma 3 models are multimodal but weren't being detected as VLMs. This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx correctly loads them with vision support via mlx-vlm. Tested with mlx-community/gemma-3-27b-it-4bit: - Vision: ✅ Working (cat, Kali, Ganesha images) - Streaming: ✅ Working (40 chunks) - Long context: ✅ Up to ~5K tokens Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: Add Gemma 3 support section with long context patch instructions - Document Gemma 3 MLLM detection (already patched in utils.py) - Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var - Include benchmark results showing 5x improvement (10K → 50K tokens) - Explain Metal GPU timeout limitation and workaround --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> * fix: disable skip_prompt_processing for multimodal to prevent garbled output For MLLM with images, skip_prompt_processing cannot be used because: - Vision encoder must run each time to provide visual context - The skip path only calls language_model() which has no vision - Using it produces garbled output like 'TheTheTheThe...' Text-only caching still works with 6x+ speedup. Multimodal correctly gets no speedup but produces coherent output. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Wayner Barrios <waybarrios@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

fix(gemma3): remove auto SLIDING_WINDOW=0 that breaks multimodal

3c3347e

Synced from local patches in .venv-vllm-mlx Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

lubauss merged commit 42563fe into main Jan 20, 2026

lubauss deleted the patch/fixgemma3-remove-auto-slidingwindow0-that-breaks-m branch January 20, 2026 01:09

lubauss pushed a commit that referenced this pull request Jan 20, 2026

Fix --api-key argument for serve command (fixes #7)

92aa655

lubauss added a commit that referenced this pull request Jan 20, 2026

fix(gemma3): remove auto SLIDING_WINDOW=0 that breaks multimodal (#7)

3bd175c

Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

lubauss added a commit that referenced this pull request Jan 20, 2026

fix(gemma3): remove auto SLIDING_WINDOW=0 that breaks multimodal (#7)

069e91c

Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

lubauss pushed a commit that referenced this pull request Jan 20, 2026

Fix --api-key argument for serve command (fixes #7)

a3dece0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge: fix(gemma3): remove auto SLIDING_WINDOW=0 that breaks multimodal#7

Merge: fix(gemma3): remove auto SLIDING_WINDOW=0 that breaks multimodal#7
lubauss merged 1 commit intomainfrom
patch/fixgemma3-remove-auto-slidingwindow0-that-breaks-m

lubauss commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lubauss commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant