fix(mllm): concatenate text parts instead of overwriting in chat()#12
Closed
lubauss wants to merge 7 commits intowaybarrios:mainfrom
Closed
fix(mllm): concatenate text parts instead of overwriting in chat()#12lubauss wants to merge 7 commits intowaybarrios:mainfrom
lubauss wants to merge 7 commits intowaybarrios:mainfrom
Conversation
* fix: Enable vision and streaming for MLLM models This patch fixes two critical issues with multimodal language models (MLLM): ## Vision Fix (server.py, simple.py) - Preserve original messages when calling MLLM models - The engine was passing only the prompt string, losing image data - Now passes full message objects with images to MLLM.chat() ## Streaming Fix (mllm.py, simple.py) - Add stream_chat() method to MLLMMultimodalLM class - Uses mlx_vlm.stream_generate() for true token-by-token streaming - Update engine to call stream_chat() for MLLM models - Properly yields GenerationOutput with new_text for SSE streaming Tested with: - mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit - Text streaming: 5 tokens streamed correctly - Vision streaming: Image analysis works with streaming Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Add Gemma 3 to MLLM detection patterns Gemma 3 models are multimodal but weren't being detected as VLMs. This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx correctly loads them with vision support via mlx-vlm. Tested with mlx-community/gemma-3-27b-it-4bit: - Vision: ✅ Working (cat, Kali, Ganesha images) - Streaming: ✅ Working (40 chunks) - Long context: ✅ Up to ~5K tokens Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: Add Gemma 3 support section with long context patch instructions - Document Gemma 3 MLLM detection (already patched in utils.py) - Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var - Include benchmark results showing 5x improvement (10K → 50K tokens) - Explain Metal GPU timeout limitation and workaround * feat: Enable continuous batching for MLLM models This patch enables continuous batching (with prefix caching) for multimodal LLM models like Qwen3-VL and Gemma 3. Changes: - Add MLLMModelWrapper to extract logits from LanguageModelOutput - Fix tokenizer.encode to work with processors (Qwen3VLProcessor) - Fix tokenizer.decode to use nested tokenizer for processors - Fix _get_stop_tokens to check both processor and tokenizer Performance improvement on M4 Max 128GB with Qwen3-VL-30B: - First request (cache miss): ~22s for 17K tokens - Subsequent requests (cache hit): ~0.8-1.2s - Speedup: 10-28x faster with prefix caching Multi-turn conversation (6 turns, 90K char document): - 90.7% faster on average - 10.76x speedup vs uncached Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix: Enable vision and streaming for MLLM models This patch fixes two critical issues with multimodal language models (MLLM): ## Vision Fix (server.py, simple.py) - Preserve original messages when calling MLLM models - The engine was passing only the prompt string, losing image data - Now passes full message objects with images to MLLM.chat() ## Streaming Fix (mllm.py, simple.py) - Add stream_chat() method to MLLMMultimodalLM class - Uses mlx_vlm.stream_generate() for true token-by-token streaming - Update engine to call stream_chat() for MLLM models - Properly yields GenerationOutput with new_text for SSE streaming Tested with: - mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit - Text streaming: 5 tokens streamed correctly - Vision streaming: Image analysis works with streaming Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Add Gemma 3 to MLLM detection patterns Gemma 3 models are multimodal but weren't being detected as VLMs. This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx correctly loads them with vision support via mlx-vlm. Tested with mlx-community/gemma-3-27b-it-4bit: - Vision: ✅ Working (cat, Kali, Ganesha images) - Streaming: ✅ Working (40 chunks) - Long context: ✅ Up to ~5K tokens Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: Add Gemma 3 support section with long context patch instructions - Document Gemma 3 MLLM detection (already patched in utils.py) - Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var - Include benchmark results showing 5x improvement (10K → 50K tokens) - Explain Metal GPU timeout limitation and workaround --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…ching (#4) Gemma 3's model __call__() requires pixel_values as a positional argument, unlike Qwen2-VL which makes it optional. This caused "missing required positional argument: 'pixel_values'" errors when using continuous batching with text-only requests. The MLLMModelWrapper now injects pixel_values=None for text-only requests, enabling Gemma 3 to work with continuous batching and prefix caching. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…batch mode (#5) Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
#6) Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Synced from local patches in .venv-vllm-mlx Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
waybarrios
pushed a commit
that referenced
this pull request
Jan 26, 2026
…ats (#12) - Added StreamOptions model with include_usage: bool - Added stream_options to ChatCompletionRequest - Added usage field to ChatCompletionChunk (optional, for final chunk) - Modified stream_chat_completion() to track tokens and send usage in final chunk This enables clients to request token usage statistics in streaming responses, matching OpenAI API behavior. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
7 tasks
sean-esk
pushed a commit
to sean-esk/vllm-mlx
that referenced
this pull request
Mar 7, 2026
* eval: upgrade coding/reasoning/general suites with standard benchmarks Replace easy eval questions (score clustering at 90-100%) with harder standard benchmark problems for better model differentiation: - Coding: 10 HumanEval+ problems (medium to hard), max_tokens 800→1200 - Reasoning: 10 MATH-500 problems (levels 2-5) with fraction support - General: 10 MMLU-Pro multiple choice questions with answer_letter check Delete old result JSONs since scores are incomparable across suite versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: cache clearing between eval suites, selective enable_thinking - Add POST /v1/cache/clear endpoint to server (clears SimpleEngine prompt cache) - Eval calls cache/clear between suites to prevent KV cache pollution - Only disable thinking for general suite (MCQ); coding/reasoning keep thinking enabled - Bump general max_tokens to 2048 for thinking models - Re-run all 11 models with corrected eval pipeline - Update scorecard methodology text Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * eval: re-run all 11 models with enable_thinking=false globally Disable thinking mode for ALL eval suites (not just general) to get accurate scores without thinking tokens eating into max_tokens. Update methodology text and regenerate scorecard. Key improvements: - Qwen3.5-122B-8bit: Reasoning 20%→90%, Coding 80%→90% - Qwen3.5-35B-8bit: Coding 20%→90%, Reasoning 10%→80% - GLM-4.7-Flash: Coding 10%→100%, Reasoning 40%→90% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: add TODOs for anomalous eval scores to investigate - MiniMax-M2.5: 10% coding despite strong scores elsewhere - GLM-4.7-Flash: 50% general despite 100% coding / 90% reasoning - GPT-OSS-20B: 17% tools / 20% reasoning with minimax parser Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This was referenced Apr 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Synced from local development patches.
Files changed
vllm_mlx/engine/batched.pyvllm_mlx/api/utils.pyvllm_mlx/models/mllm.py🤖 Generated with Claude Code