fix(mllm): concatenate text parts instead of overwriting in chat() by lubauss · Pull Request #12 · waybarrios/vllm-mlx

lubauss · 2026-01-20T01:09:42Z

Summary

Synced from local development patches.

Files changed

vllm_mlx/engine/batched.py
vllm_mlx/api/utils.py
vllm_mlx/models/mllm.py

🤖 Generated with Claude Code

* fix: Enable vision and streaming for MLLM models This patch fixes two critical issues with multimodal language models (MLLM): ## Vision Fix (server.py, simple.py) - Preserve original messages when calling MLLM models - The engine was passing only the prompt string, losing image data - Now passes full message objects with images to MLLM.chat() ## Streaming Fix (mllm.py, simple.py) - Add stream_chat() method to MLLMMultimodalLM class - Uses mlx_vlm.stream_generate() for true token-by-token streaming - Update engine to call stream_chat() for MLLM models - Properly yields GenerationOutput with new_text for SSE streaming Tested with: - mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit - Text streaming: 5 tokens streamed correctly - Vision streaming: Image analysis works with streaming Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Add Gemma 3 to MLLM detection patterns Gemma 3 models are multimodal but weren't being detected as VLMs. This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx correctly loads them with vision support via mlx-vlm. Tested with mlx-community/gemma-3-27b-it-4bit: - Vision: ✅ Working (cat, Kali, Ganesha images) - Streaming: ✅ Working (40 chunks) - Long context: ✅ Up to ~5K tokens Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: Add Gemma 3 support section with long context patch instructions - Document Gemma 3 MLLM detection (already patched in utils.py) - Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var - Include benchmark results showing 5x improvement (10K → 50K tokens) - Explain Metal GPU timeout limitation and workaround * feat: Enable continuous batching for MLLM models This patch enables continuous batching (with prefix caching) for multimodal LLM models like Qwen3-VL and Gemma 3. Changes: - Add MLLMModelWrapper to extract logits from LanguageModelOutput - Fix tokenizer.encode to work with processors (Qwen3VLProcessor) - Fix tokenizer.decode to use nested tokenizer for processors - Fix _get_stop_tokens to check both processor and tokenizer Performance improvement on M4 Max 128GB with Qwen3-VL-30B: - First request (cache miss): ~22s for 17K tokens - Subsequent requests (cache hit): ~0.8-1.2s - Speedup: 10-28x faster with prefix caching Multi-turn conversation (6 turns, 90K char document): - 90.7% faster on average - 10.76x speedup vs uncached Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Enable vision and streaming for MLLM models This patch fixes two critical issues with multimodal language models (MLLM): ## Vision Fix (server.py, simple.py) - Preserve original messages when calling MLLM models - The engine was passing only the prompt string, losing image data - Now passes full message objects with images to MLLM.chat() ## Streaming Fix (mllm.py, simple.py) - Add stream_chat() method to MLLMMultimodalLM class - Uses mlx_vlm.stream_generate() for true token-by-token streaming - Update engine to call stream_chat() for MLLM models - Properly yields GenerationOutput with new_text for SSE streaming Tested with: - mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit - Text streaming: 5 tokens streamed correctly - Vision streaming: Image analysis works with streaming Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Add Gemma 3 to MLLM detection patterns Gemma 3 models are multimodal but weren't being detected as VLMs. This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx correctly loads them with vision support via mlx-vlm. Tested with mlx-community/gemma-3-27b-it-4bit: - Vision: ✅ Working (cat, Kali, Ganesha images) - Streaming: ✅ Working (40 chunks) - Long context: ✅ Up to ~5K tokens Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: Add Gemma 3 support section with long context patch instructions - Document Gemma 3 MLLM detection (already patched in utils.py) - Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var - Include benchmark results showing 5x improvement (10K → 50K tokens) - Explain Metal GPU timeout limitation and workaround --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…ching (#4) Gemma 3's model __call__() requires pixel_values as a positional argument, unlike Qwen2-VL which makes it optional. This caused "missing required positional argument: 'pixel_values'" errors when using continuous batching with text-only requests. The MLLMModelWrapper now injects pixel_values=None for text-only requests, enabling Gemma 3 to work with continuous batching and prefix caching. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…batch mode (#5) Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

#6) Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Synced from local patches in .venv-vllm-mlx Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…ats (#12) - Added StreamOptions model with include_usage: bool - Added stream_options to ChatCompletionRequest - Added usage field to ChatCompletionChunk (optional, for final chunk) - Modified stream_chat_completion() to track tokens and send usage in final chunk This enables clients to request token usage statistics in streaming responses, matching OpenAI API behavior. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* eval: upgrade coding/reasoning/general suites with standard benchmarks Replace easy eval questions (score clustering at 90-100%) with harder standard benchmark problems for better model differentiation: - Coding: 10 HumanEval+ problems (medium to hard), max_tokens 800→1200 - Reasoning: 10 MATH-500 problems (levels 2-5) with fraction support - General: 10 MMLU-Pro multiple choice questions with answer_letter check Delete old result JSONs since scores are incomparable across suite versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: cache clearing between eval suites, selective enable_thinking - Add POST /v1/cache/clear endpoint to server (clears SimpleEngine prompt cache) - Eval calls cache/clear between suites to prevent KV cache pollution - Only disable thinking for general suite (MCQ); coding/reasoning keep thinking enabled - Bump general max_tokens to 2048 for thinking models - Re-run all 11 models with corrected eval pipeline - Update scorecard methodology text Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * eval: re-run all 11 models with enable_thinking=false globally Disable thinking mode for ALL eval suites (not just general) to get accurate scores without thinking tokens eating into max_tokens. Update methodology text and regenerate scorecard. Key improvements: - Qwen3.5-122B-8bit: Reasoning 20%→90%, Coding 80%→90% - Qwen3.5-35B-8bit: Coding 20%→90%, Reasoning 10%→80% - GLM-4.7-Flash: Coding 10%→100%, Reasoning 40%→90% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: add TODOs for anomalous eval scores to investigate - MiniMax-M2.5: 10% coding despite strong scores elsewhere - GLM-4.7-Flash: 50% general despite 100% coding / 90% reasoning - GPT-OSS-20B: 17% tools / 20% reasoning with minimax parser Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

lubauss and others added 7 commits January 19, 2026 20:45

feat(gemma3): auto-configure sliding window and fix pixel_values for …

d37cb11

…batch mode (#5) Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

feat(batched): add hybrid multimodal support for MLLM image processing (

89e1a68

#6) Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

fix(gemma3): remove auto SLIDING_WINDOW=0 that breaks multimodal (#7)

42563fe

Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

fix(mllm): concatenate text parts instead of overwriting in chat()

56bffc3

Synced from local patches in .venv-vllm-mlx Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

lubauss closed this Jan 20, 2026

lubauss deleted the patch/fixmllm-concatenate-text-parts-instead-of-overwrit branch January 20, 2026 01:09

janhilgard mentioned this pull request Feb 11, 2026

Security hardening: fix auth bypass, SSRF, MCP vulnerabilities (issue #68) #70

Closed

7 tasks

This was referenced Apr 14, 2026

Security audit: authentication bypass, SSRF, and other vulnerabilities #68

Open

fix(server): sanitize logs and error details #341

Merged

security: sanitize log injection and stop leaking exception details to clients #342

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mllm): concatenate text parts instead of overwriting in chat()#12

fix(mllm): concatenate text parts instead of overwriting in chat()#12
lubauss wants to merge 7 commits intowaybarrios:mainfrom
lubauss:patch/fixmllm-concatenate-text-parts-instead-of-overwrit

lubauss commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lubauss commented Jan 20, 2026

Summary

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant