feat: Enable continuous batching for MLLM models by lubauss · Pull Request #1 · lubauss/vllm-mlx

lubauss · 2026-01-19T19:33:44Z

Summary

Add MLLMModelWrapper to extract logits from LanguageModelOutput objects, making MLLM models compatible with BatchGenerator
Fix tokenizer handling in scheduler to work with processors (e.g., Qwen3VLProcessor) that wrap the actual tokenizer
Fix _get_stop_tokens() to check both processor and nested tokenizer for EOS tokens

Problem

Continuous batching (with prefix caching) was broken for multimodal LLM models like Qwen3-VL and Gemma 3:

AttributeError: 'Qwen3VLProcessor' object has no attribute 'encode' - The scheduler called tokenizer.encode() but MLLM processors don't have an encode method directly
'LanguageModelOutput' object is not subscriptable - BatchGenerator expected raw logits array but MLLM models return LanguageModelOutput(logits=...) objects

Solution

Created MLLMModelWrapper that wraps MLLM models and extracts .logits from output
Added _get_actual_tokenizer() to extract the nested tokenizer from processors
Added _decode_tokens() helper to use the actual tokenizer for decoding
Updated tokenize logic to check for tokenizer.tokenizer.encode() for processors

Performance Results

Tested on M4 Max 128GB with mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit:

Request	Latency	Speedup
1 (cold)	21.7s	-
2 (cached)	1.15s	19x
3 (cached)	0.79s	28x
4 (cached)	0.78s	28x

Multi-turn conversation (6 turns with 17K token context):

Average latency improvement: 90.7%
Overall speedup: 10.76x

Test plan

Tested with Qwen3-VL-30B-A3B-Instruct-4bit model
Verified prefix caching works across multi-turn conversations
Confirmed standard LLM models still work (no regression)
Tested both streaming and non-streaming modes

🤖 Generated with Claude Code

This patch fixes two critical issues with multimodal language models (MLLM): ## Vision Fix (server.py, simple.py) - Preserve original messages when calling MLLM models - The engine was passing only the prompt string, losing image data - Now passes full message objects with images to MLLM.chat() ## Streaming Fix (mllm.py, simple.py) - Add stream_chat() method to MLLMMultimodalLM class - Uses mlx_vlm.stream_generate() for true token-by-token streaming - Update engine to call stream_chat() for MLLM models - Properly yields GenerationOutput with new_text for SSE streaming Tested with: - mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit - Text streaming: 5 tokens streamed correctly - Vision streaming: Image analysis works with streaming Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Gemma 3 models are multimodal but weren't being detected as VLMs. This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx correctly loads them with vision support via mlx-vlm. Tested with mlx-community/gemma-3-27b-it-4bit: - Vision: ✅ Working (cat, Kali, Ganesha images) - Streaming: ✅ Working (40 chunks) - Long context: ✅ Up to ~5K tokens Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Document Gemma 3 MLLM detection (already patched in utils.py) - Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var - Include benchmark results showing 5x improvement (10K → 50K tokens) - Explain Metal GPU timeout limitation and workaround

This patch enables continuous batching (with prefix caching) for multimodal LLM models like Qwen3-VL and Gemma 3. Changes: - Add MLLMModelWrapper to extract logits from LanguageModelOutput - Fix tokenizer.encode to work with processors (Qwen3VLProcessor) - Fix tokenizer.decode to use nested tokenizer for processors - Fix _get_stop_tokens to check both processor and tokenizer Performance improvement on M4 Max 128GB with Qwen3-VL-30B: - First request (cache miss): ~22s for 17K tokens - Subsequent requests (cache hit): ~0.8-1.2s - Speedup: 10-28x faster with prefix caching Multi-turn conversation (6 turns, 90K char document): - 90.7% faster on average - 10.76x speedup vs uncached Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Enable vision and streaming for MLLM models This patch fixes two critical issues with multimodal language models (MLLM): ## Vision Fix (server.py, simple.py) - Preserve original messages when calling MLLM models - The engine was passing only the prompt string, losing image data - Now passes full message objects with images to MLLM.chat() ## Streaming Fix (mllm.py, simple.py) - Add stream_chat() method to MLLMMultimodalLM class - Uses mlx_vlm.stream_generate() for true token-by-token streaming - Update engine to call stream_chat() for MLLM models - Properly yields GenerationOutput with new_text for SSE streaming Tested with: - mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit - Text streaming: 5 tokens streamed correctly - Vision streaming: Image analysis works with streaming Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Add Gemma 3 to MLLM detection patterns Gemma 3 models are multimodal but weren't being detected as VLMs. This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx correctly loads them with vision support via mlx-vlm. Tested with mlx-community/gemma-3-27b-it-4bit: - Vision: ✅ Working (cat, Kali, Ganesha images) - Streaming: ✅ Working (40 chunks) - Long context: ✅ Up to ~5K tokens Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: Add Gemma 3 support section with long context patch instructions - Document Gemma 3 MLLM detection (already patched in utils.py) - Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var - Include benchmark results showing 5x improvement (10K → 50K tokens) - Explain Metal GPU timeout limitation and workaround * feat: Enable continuous batching for MLLM models This patch enables continuous batching (with prefix caching) for multimodal LLM models like Qwen3-VL and Gemma 3. Changes: - Add MLLMModelWrapper to extract logits from LanguageModelOutput - Fix tokenizer.encode to work with processors (Qwen3VLProcessor) - Fix tokenizer.decode to use nested tokenizer for processors - Fix _get_stop_tokens to check both processor and tokenizer Performance improvement on M4 Max 128GB with Qwen3-VL-30B: - First request (cache miss): ~22s for 17K tokens - Subsequent requests (cache hit): ~0.8-1.2s - Speedup: 10-28x faster with prefix caching Multi-turn conversation (6 turns, 90K char document): - 90.7% faster on average - 10.76x speedup vs uncached Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

lubauss and others added 4 commits January 19, 2026 08:28

lubauss merged commit f590f7e into main Jan 19, 2026

lubauss deleted the feat/mllm-continuous-batching branch January 19, 2026 19:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Enable continuous batching for MLLM models#1

feat: Enable continuous batching for MLLM models#1
lubauss merged 4 commits intomainfrom
feat/mllm-continuous-batching

lubauss commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lubauss commented Jan 19, 2026

Summary

Problem

Solution

Performance Results

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant