feat: MLLM prefix caching with 3x speedup by lubauss · Pull Request #13 · waybarrios/vllm-mlx

lubauss · 2026-01-20T02:32:39Z

Summary

Synced from local development patches.

Files changed

vllm_mlx/engine/batched.py
vllm_mlx/api/utils.py
vllm_mlx/models/mllm.py

🤖 Generated with Claude Code

* fix: Enable vision and streaming for MLLM models This patch fixes two critical issues with multimodal language models (MLLM): ## Vision Fix (server.py, simple.py) - Preserve original messages when calling MLLM models - The engine was passing only the prompt string, losing image data - Now passes full message objects with images to MLLM.chat() ## Streaming Fix (mllm.py, simple.py) - Add stream_chat() method to MLLMMultimodalLM class - Uses mlx_vlm.stream_generate() for true token-by-token streaming - Update engine to call stream_chat() for MLLM models - Properly yields GenerationOutput with new_text for SSE streaming Tested with: - mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit - Text streaming: 5 tokens streamed correctly - Vision streaming: Image analysis works with streaming Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Add Gemma 3 to MLLM detection patterns Gemma 3 models are multimodal but weren't being detected as VLMs. This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx correctly loads them with vision support via mlx-vlm. Tested with mlx-community/gemma-3-27b-it-4bit: - Vision: ✅ Working (cat, Kali, Ganesha images) - Streaming: ✅ Working (40 chunks) - Long context: ✅ Up to ~5K tokens Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: Add Gemma 3 support section with long context patch instructions - Document Gemma 3 MLLM detection (already patched in utils.py) - Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var - Include benchmark results showing 5x improvement (10K → 50K tokens) - Explain Metal GPU timeout limitation and workaround * feat: Enable continuous batching for MLLM models This patch enables continuous batching (with prefix caching) for multimodal LLM models like Qwen3-VL and Gemma 3. Changes: - Add MLLMModelWrapper to extract logits from LanguageModelOutput - Fix tokenizer.encode to work with processors (Qwen3VLProcessor) - Fix tokenizer.decode to use nested tokenizer for processors - Fix _get_stop_tokens to check both processor and tokenizer Performance improvement on M4 Max 128GB with Qwen3-VL-30B: - First request (cache miss): ~22s for 17K tokens - Subsequent requests (cache hit): ~0.8-1.2s - Speedup: 10-28x faster with prefix caching Multi-turn conversation (6 turns, 90K char document): - 90.7% faster on average - 10.76x speedup vs uncached Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Enable vision and streaming for MLLM models This patch fixes two critical issues with multimodal language models (MLLM): ## Vision Fix (server.py, simple.py) - Preserve original messages when calling MLLM models - The engine was passing only the prompt string, losing image data - Now passes full message objects with images to MLLM.chat() ## Streaming Fix (mllm.py, simple.py) - Add stream_chat() method to MLLMMultimodalLM class - Uses mlx_vlm.stream_generate() for true token-by-token streaming - Update engine to call stream_chat() for MLLM models - Properly yields GenerationOutput with new_text for SSE streaming Tested with: - mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit - Text streaming: 5 tokens streamed correctly - Vision streaming: Image analysis works with streaming Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Add Gemma 3 to MLLM detection patterns Gemma 3 models are multimodal but weren't being detected as VLMs. This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx correctly loads them with vision support via mlx-vlm. Tested with mlx-community/gemma-3-27b-it-4bit: - Vision: ✅ Working (cat, Kali, Ganesha images) - Streaming: ✅ Working (40 chunks) - Long context: ✅ Up to ~5K tokens Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: Add Gemma 3 support section with long context patch instructions - Document Gemma 3 MLLM detection (already patched in utils.py) - Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var - Include benchmark results showing 5x improvement (10K → 50K tokens) - Explain Metal GPU timeout limitation and workaround --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…ching (#4) Gemma 3's model __call__() requires pixel_values as a positional argument, unlike Qwen2-VL which makes it optional. This caused "missing required positional argument: 'pixel_values'" errors when using continuous batching with text-only requests. The MLLMModelWrapper now injects pixel_values=None for text-only requests, enabling Gemma 3 to work with continuous batching and prefix caching. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…batch mode (#5) Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

#6) Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Synced from local patches in .venv-vllm-mlx Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…positions (#13) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…rrios#13) * feat: add seed_oss, deepseek_v31, qwen3_coder_xml tool parsers Port 3 upstream vLLM tool parsers for popular MLX models: - seed_oss: GPT-OSS-20B XML format (<seed:tool_call> + <seed:think>) - deepseek_v31: DeepSeek V3.1/R1-0528 unicode special tokens - qwen3_coder_xml: Qwen3-Coder XML format (<tool_call>/<function=...>) Includes 72 upstream regression tests and eval config updates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: GLM47 streaming test, multi-step streaming tests, path note - Fix GLM47 test_streaming_no_tool_calls to match current strip_think_tags behavior (strips leading whitespace from content deltas) - Add multi-step streaming tests for seed_oss and qwen3coder that verify header + { + params + } are all emitted across multiple calls - Add note that run_all_models.sh paths are machine-specific Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: review feedback — GLM47 whitespace, streaming tests, path note - Fix GLM47 streaming: strip_think_tags was eating inter-word spaces on normal content deltas; now only strips when </think> is actually present - Add multi-step streaming tests for seed_oss and qwen3coder that verify complete tool call emission (header + { + params + }) with fine-grained deltas matching realistic token boundaries - Add note that run_all_models.sh paths are machine-specific Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: streaming completeness, GLM47 whitespace, coarse-delta resilience Streaming completeness (seed_oss + qwen3coder): - When the function body is already complete at header-detection time, emit the full tool call (name + arguments) in one chunk instead of header-only. Prevents truncated output when coarse deltas or max_tokens leave no further parser calls. - When tool_call_start is detected, fall through to header parsing instead of returning None — the header may already be available. GLM47 streaming: - Only call strip_think_tags when </think> is actually present in the delta, preventing inter-word spaces from being eaten on normal content. Tests: - Add coarse-delta streaming tests that verify complete arguments are emitted even with a single large chunk (seed_oss + qwen3coder). - Fix GLM47 streaming test to expect preserved whitespace. Other: - Remove misleading MODEL_DIR env var reference from run_all_models.sh. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: harmony parser regex for GPT-OSS actual template format The GPT-OSS chat template generates tool calls as: <|start|>assistant to=functions.NAME<|channel|>commentary json<|message|>ARGS<|call|> But the harmony regex expected: <|channel|>commentary to=functions.NAME <|message|>ARGS<|call|> The to=functions.NAME comes before <|channel|>commentary in reality, not after. This mismatch caused 17% tool calling score. Fix: support both formats (real + legacy test format) via alternation. Also accept <|end|> as final channel terminator alongside <|return|>. Revert GPT-OSS eval config from seed_oss back to harmony. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: harmony native tool format, VLM model loading fallback - Set HarmonyToolParser.SUPPORTS_NATIVE_TOOL_FORMAT = True so multi-turn tool history uses native harmony tokens instead of plain text conversion ("[Calling tool: ...]"), which broke GPT-OSS tool flow understanding. - Extend load_model_with_fallback to catch "Missing N parameters" errors (not just "parameters not in model") for VLM-packaged models like Qwen3.5-9B and Mistral-Small-3.2 that need strict=False. - Update harmony and native format tests accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: review round 1 — operator precedence, float type coercion - Add explicit parentheses in tokenizer.py fallback condition to clarify `or`/`and` precedence (behavior was correct but ambiguous to read). - Fix _convert_param_value() in seed_oss and qwen3coder parsers: when schema says "number"/"float", always return float instead of silently coercing 3.0 → int(3). Removes lossy `fv - int(fv) != 0` check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Refactor streaming to use tested granular event builders instead of inline dict construction, fixing the gap where tested code wasn't production code (waybarrios#13). Fix text omission in completed events (waybarrios#6), add [DONE] sentinel (waybarrios#8), use typed output models to prevent cross-type field leakage (waybarrios#4, waybarrios#5), fix content join separator (waybarrios#10), remove dead code branches (waybarrios#9, waybarrios#11), and warn on unrecognized content types (waybarrios#7). Add Codex CLI setup guide. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

lubauss and others added 8 commits January 19, 2026 20:45

feat(gemma3): auto-configure sliding window and fix pixel_values for …

d37cb11

…batch mode (#5) Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

feat(batched): add hybrid multimodal support for MLLM image processing (

89e1a68

#6) Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

fix(gemma3): remove auto SLIDING_WINDOW=0 that breaks multimodal (#7)

42563fe

Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

fix(mllm): concatenate text parts instead of overwriting in chat() (#8)

0e29a85

Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

feat: MLLM prefix caching with 3x speedup

f5eb541

Synced from local patches in .venv-vllm-mlx Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

lubauss closed this Jan 20, 2026

lubauss deleted the patch/feat-mllm-prefix-caching-with-3x-speedup branch January 20, 2026 02:32

waybarrios pushed a commit that referenced this pull request Jan 26, 2026

fix(mllm): properly build multi-turn messages with images in correct …

108a03c

…positions (#13) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

This was referenced Apr 14, 2026

Security audit: authentication bypass, SSRF, and other vulnerabilities #68

Open

security: default server bind to 127.0.0.1 instead of 0.0.0.0 #338

Closed

janhilgard mentioned this pull request Apr 15, 2026

fix(server): bind localhost by default #337

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: MLLM prefix caching with 3x speedup#13

feat: MLLM prefix caching with 3x speedup#13
lubauss wants to merge 8 commits intowaybarrios:mainfrom
lubauss:patch/feat-mllm-prefix-caching-with-3x-speedup

lubauss commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lubauss commented Jan 20, 2026

Summary

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant