Skip to content

feat: Enable continuous batching for MLLM models#1

Merged
lubauss merged 4 commits intomainfrom
feat/mllm-continuous-batching
Jan 19, 2026
Merged

feat: Enable continuous batching for MLLM models#1
lubauss merged 4 commits intomainfrom
feat/mllm-continuous-batching

Conversation

@lubauss
Copy link
Copy Markdown
Owner

@lubauss lubauss commented Jan 19, 2026

Summary

  • Add MLLMModelWrapper to extract logits from LanguageModelOutput objects, making MLLM models compatible with BatchGenerator
  • Fix tokenizer handling in scheduler to work with processors (e.g., Qwen3VLProcessor) that wrap the actual tokenizer
  • Fix _get_stop_tokens() to check both processor and nested tokenizer for EOS tokens

Problem

Continuous batching (with prefix caching) was broken for multimodal LLM models like Qwen3-VL and Gemma 3:

  1. AttributeError: 'Qwen3VLProcessor' object has no attribute 'encode' - The scheduler called tokenizer.encode() but MLLM processors don't have an encode method directly
  2. 'LanguageModelOutput' object is not subscriptable - BatchGenerator expected raw logits array but MLLM models return LanguageModelOutput(logits=...) objects

Solution

  1. Created MLLMModelWrapper that wraps MLLM models and extracts .logits from output
  2. Added _get_actual_tokenizer() to extract the nested tokenizer from processors
  3. Added _decode_tokens() helper to use the actual tokenizer for decoding
  4. Updated tokenize logic to check for tokenizer.tokenizer.encode() for processors

Performance Results

Tested on M4 Max 128GB with mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit:

Request Latency Speedup
1 (cold) 21.7s -
2 (cached) 1.15s 19x
3 (cached) 0.79s 28x
4 (cached) 0.78s 28x

Multi-turn conversation (6 turns with 17K token context):

  • Average latency improvement: 90.7%
  • Overall speedup: 10.76x

Test plan

  • Tested with Qwen3-VL-30B-A3B-Instruct-4bit model
  • Verified prefix caching works across multi-turn conversations
  • Confirmed standard LLM models still work (no regression)
  • Tested both streaming and non-streaming modes

🤖 Generated with Claude Code

lubauss and others added 4 commits January 19, 2026 08:28
This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround
This patch enables continuous batching (with prefix caching) for
multimodal LLM models like Qwen3-VL and Gemma 3.

Changes:
- Add MLLMModelWrapper to extract logits from LanguageModelOutput
- Fix tokenizer.encode to work with processors (Qwen3VLProcessor)
- Fix tokenizer.decode to use nested tokenizer for processors
- Fix _get_stop_tokens to check both processor and tokenizer

Performance improvement on M4 Max 128GB with Qwen3-VL-30B:
- First request (cache miss): ~22s for 17K tokens
- Subsequent requests (cache hit): ~0.8-1.2s
- Speedup: 10-28x faster with prefix caching

Multi-turn conversation (6 turns, 90K char document):
- 90.7% faster on average
- 10.76x speedup vs uncached

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@lubauss lubauss merged commit f590f7e into main Jan 19, 2026
@lubauss lubauss deleted the feat/mllm-continuous-batching branch January 19, 2026 19:45
lubauss added a commit that referenced this pull request Jan 20, 2026
* fix: Enable vision and streaming for MLLM models

This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add Gemma 3 to MLLM detection patterns

Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add Gemma 3 support section with long context patch instructions

- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround

* feat: Enable continuous batching for MLLM models

This patch enables continuous batching (with prefix caching) for
multimodal LLM models like Qwen3-VL and Gemma 3.

Changes:
- Add MLLMModelWrapper to extract logits from LanguageModelOutput
- Fix tokenizer.encode to work with processors (Qwen3VLProcessor)
- Fix tokenizer.decode to use nested tokenizer for processors
- Fix _get_stop_tokens to check both processor and tokenizer

Performance improvement on M4 Max 128GB with Qwen3-VL-30B:
- First request (cache miss): ~22s for 17K tokens
- Subsequent requests (cache hit): ~0.8-1.2s
- Speedup: 10-28x faster with prefix caching

Multi-turn conversation (6 turns, 90K char document):
- 90.7% faster on average
- 10.76x speedup vs uncached

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
lubauss added a commit that referenced this pull request Jan 20, 2026
* fix: Enable vision and streaming for MLLM models

This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add Gemma 3 to MLLM detection patterns

Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add Gemma 3 support section with long context patch instructions

- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround

* feat: Enable continuous batching for MLLM models

This patch enables continuous batching (with prefix caching) for
multimodal LLM models like Qwen3-VL and Gemma 3.

Changes:
- Add MLLMModelWrapper to extract logits from LanguageModelOutput
- Fix tokenizer.encode to work with processors (Qwen3VLProcessor)
- Fix tokenizer.decode to use nested tokenizer for processors
- Fix _get_stop_tokens to check both processor and tokenizer

Performance improvement on M4 Max 128GB with Qwen3-VL-30B:
- First request (cache miss): ~22s for 17K tokens
- Subsequent requests (cache hit): ~0.8-1.2s
- Speedup: 10-28x faster with prefix caching

Multi-turn conversation (6 turns, 90K char document):
- 90.7% faster on average
- 10.76x speedup vs uncached

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant