feat(batched): add hybrid multimodal support for MLLM image processing by lubauss · Pull Request #10 · waybarrios/vllm-mlx

lubauss · 2026-01-20T00:35:09Z

Summary

Synced from local development patches.

Files changed

vllm_mlx/engine/batched.py
vllm_mlx/api/utils.py

🤖 Generated with Claude Code

* fix: Enable vision and streaming for MLLM models This patch fixes two critical issues with multimodal language models (MLLM): ## Vision Fix (server.py, simple.py) - Preserve original messages when calling MLLM models - The engine was passing only the prompt string, losing image data - Now passes full message objects with images to MLLM.chat() ## Streaming Fix (mllm.py, simple.py) - Add stream_chat() method to MLLMMultimodalLM class - Uses mlx_vlm.stream_generate() for true token-by-token streaming - Update engine to call stream_chat() for MLLM models - Properly yields GenerationOutput with new_text for SSE streaming Tested with: - mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit - Text streaming: 5 tokens streamed correctly - Vision streaming: Image analysis works with streaming Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Add Gemma 3 to MLLM detection patterns Gemma 3 models are multimodal but weren't being detected as VLMs. This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx correctly loads them with vision support via mlx-vlm. Tested with mlx-community/gemma-3-27b-it-4bit: - Vision: ✅ Working (cat, Kali, Ganesha images) - Streaming: ✅ Working (40 chunks) - Long context: ✅ Up to ~5K tokens Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: Add Gemma 3 support section with long context patch instructions - Document Gemma 3 MLLM detection (already patched in utils.py) - Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var - Include benchmark results showing 5x improvement (10K → 50K tokens) - Explain Metal GPU timeout limitation and workaround * feat: Enable continuous batching for MLLM models This patch enables continuous batching (with prefix caching) for multimodal LLM models like Qwen3-VL and Gemma 3. Changes: - Add MLLMModelWrapper to extract logits from LanguageModelOutput - Fix tokenizer.encode to work with processors (Qwen3VLProcessor) - Fix tokenizer.decode to use nested tokenizer for processors - Fix _get_stop_tokens to check both processor and tokenizer Performance improvement on M4 Max 128GB with Qwen3-VL-30B: - First request (cache miss): ~22s for 17K tokens - Subsequent requests (cache hit): ~0.8-1.2s - Speedup: 10-28x faster with prefix caching Multi-turn conversation (6 turns, 90K char document): - 90.7% faster on average - 10.76x speedup vs uncached Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Enable vision and streaming for MLLM models This patch fixes two critical issues with multimodal language models (MLLM): ## Vision Fix (server.py, simple.py) - Preserve original messages when calling MLLM models - The engine was passing only the prompt string, losing image data - Now passes full message objects with images to MLLM.chat() ## Streaming Fix (mllm.py, simple.py) - Add stream_chat() method to MLLMMultimodalLM class - Uses mlx_vlm.stream_generate() for true token-by-token streaming - Update engine to call stream_chat() for MLLM models - Properly yields GenerationOutput with new_text for SSE streaming Tested with: - mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit - Text streaming: 5 tokens streamed correctly - Vision streaming: Image analysis works with streaming Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Add Gemma 3 to MLLM detection patterns Gemma 3 models are multimodal but weren't being detected as VLMs. This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx correctly loads them with vision support via mlx-vlm. Tested with mlx-community/gemma-3-27b-it-4bit: - Vision: ✅ Working (cat, Kali, Ganesha images) - Streaming: ✅ Working (40 chunks) - Long context: ✅ Up to ~5K tokens Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: Add Gemma 3 support section with long context patch instructions - Document Gemma 3 MLLM detection (already patched in utils.py) - Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var - Include benchmark results showing 5x improvement (10K → 50K tokens) - Explain Metal GPU timeout limitation and workaround --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…ching (#4) Gemma 3's model __call__() requires pixel_values as a positional argument, unlike Qwen2-VL which makes it optional. This caused "missing required positional argument: 'pixel_values'" errors when using continuous batching with text-only requests. The MLLMModelWrapper now injects pixel_values=None for text-only requests, enabling Gemma 3 to work with continuous batching and prefix caching. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…batch mode (#5) Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Synced from local patches in .venv-vllm-mlx Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: engine state tracking and None guards for cache layers

Refactor streaming to use tested granular event builders instead of inline dict construction, fixing the gap where tested code wasn't production code (waybarrios#13). Fix text omission in completed events (waybarrios#6), add [DONE] sentinel (waybarrios#8), use typed output models to prevent cross-type field leakage (waybarrios#4, waybarrios#5), fix content join separator (waybarrios#10), remove dead code branches (waybarrios#9, waybarrios#11), and warn on unrecognized content types (waybarrios#7). Add Codex CLI setup guide. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: unify tool-enabled simple chat on streaming path * fix: preserve simple chat contracts on streaming path * fix: keep tool chat on the streaming execution path * fix: preserve streamed completion token counts

lubauss and others added 5 commits January 19, 2026 20:45

feat(gemma3): auto-configure sliding window and fix pixel_values for …

d37cb11

…batch mode (#5) Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

feat(batched): add hybrid multimodal support for MLLM image processing

75fe43e

Synced from local patches in .venv-vllm-mlx Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

lubauss closed this Jan 20, 2026

lubauss deleted the patch/featbatched-add-hybrid-multimodal-support-for-mllm branch January 20, 2026 00:35

sean-esk pushed a commit to sean-esk/vllm-mlx that referenced this pull request Mar 3, 2026

Merge pull request waybarrios#10 from raullenchai/fix/review-cleanup

e717b06

fix: engine state tracking and None guards for cache layers

This was referenced Apr 14, 2026

Security audit: authentication bypass, SSRF, and other vulnerabilities #68

Open

fix(audio): enforce endpoint resource limits #335

Merged

security: add audio upload and TTS input size limits #336

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(batched): add hybrid multimodal support for MLLM image processing#10

feat(batched): add hybrid multimodal support for MLLM image processing#10
lubauss wants to merge 5 commits intowaybarrios:mainfrom
lubauss:patch/featbatched-add-hybrid-multimodal-support-for-mllm

lubauss commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lubauss commented Jan 20, 2026

Summary

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant