Skip to content

fix: Enable vision and streaming for MLLM models + Gemma 3 support#2

Merged
lubauss merged 3 commits intomainfrom
fix/mllm-vision-and-streaming
Jan 19, 2026
Merged

fix: Enable vision and streaming for MLLM models + Gemma 3 support#2
lubauss merged 3 commits intomainfrom
fix/mllm-vision-and-streaming

Conversation

@lubauss
Copy link
Copy Markdown
Owner

@lubauss lubauss commented Jan 19, 2026

Summary

  • Enable vision and streaming for MLLM models
  • Add Gemma 3 to MLLM detection patterns
  • Add documentation for Gemma 3 long context patch

🤖 Generated with Claude Code

lubauss and others added 3 commits January 19, 2026 08:28
This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround
@lubauss lubauss merged commit 7472185 into main Jan 19, 2026
@lubauss lubauss deleted the fix/mllm-vision-and-streaming branch January 19, 2026 19:48
lubauss added a commit that referenced this pull request Jan 20, 2026
* fix: Enable vision and streaming for MLLM models

This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add Gemma 3 to MLLM detection patterns

Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add Gemma 3 support section with long context patch instructions

- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
lubauss added a commit that referenced this pull request Jan 20, 2026
* fix: Enable vision and streaming for MLLM models

This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add Gemma 3 to MLLM detection patterns

Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add Gemma 3 support section with long context patch instructions

- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
lubauss added a commit that referenced this pull request Jan 20, 2026
* fix: Enable vision and streaming for MLLM models

This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add Gemma 3 to MLLM detection patterns

Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add Gemma 3 support section with long context patch instructions

- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
lubauss added a commit that referenced this pull request Jan 20, 2026
* Fix --api-key argument for serve command (fixes #7)

* Document --api-key, --rate-limit and --timeout options in CLI reference

* fix: Enable vision and streaming for MLLM models + Gemma 3 support (#2)

* fix: Enable vision and streaming for MLLM models

This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add Gemma 3 to MLLM detection patterns

Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add Gemma 3 support section with long context patch instructions

- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* fix: disable skip_prompt_processing for multimodal to prevent garbled output

For MLLM with images, skip_prompt_processing cannot be used because:
- Vision encoder must run each time to provide visual context
- The skip path only calls language_model() which has no vision
- Using it produces garbled output like 'TheTheTheThe...'

Text-only caching still works with 6x+ speedup.
Multimodal correctly gets no speedup but produces coherent output.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Wayner Barrios <waybarrios@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant