Skip to content

feat: MLLM prefix caching with 3x speedup#13

Closed
lubauss wants to merge 8 commits intowaybarrios:mainfrom
lubauss:patch/feat-mllm-prefix-caching-with-3x-speedup
Closed

feat: MLLM prefix caching with 3x speedup#13
lubauss wants to merge 8 commits intowaybarrios:mainfrom
lubauss:patch/feat-mllm-prefix-caching-with-3x-speedup

Conversation

@lubauss
Copy link
Copy Markdown
Contributor

@lubauss lubauss commented Jan 20, 2026

Summary

Synced from local development patches.

Files changed

  • vllm_mlx/engine/batched.py
  • vllm_mlx/api/utils.py
  • vllm_mlx/models/mllm.py

🤖 Generated with Claude Code

lubauss and others added 8 commits January 19, 2026 20:45
* fix: Enable vision and streaming for MLLM models

This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add Gemma 3 to MLLM detection patterns

Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add Gemma 3 support section with long context patch instructions

- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround

* feat: Enable continuous batching for MLLM models

This patch enables continuous batching (with prefix caching) for
multimodal LLM models like Qwen3-VL and Gemma 3.

Changes:
- Add MLLMModelWrapper to extract logits from LanguageModelOutput
- Fix tokenizer.encode to work with processors (Qwen3VLProcessor)
- Fix tokenizer.decode to use nested tokenizer for processors
- Fix _get_stop_tokens to check both processor and tokenizer

Performance improvement on M4 Max 128GB with Qwen3-VL-30B:
- First request (cache miss): ~22s for 17K tokens
- Subsequent requests (cache hit): ~0.8-1.2s
- Speedup: 10-28x faster with prefix caching

Multi-turn conversation (6 turns, 90K char document):
- 90.7% faster on average
- 10.76x speedup vs uncached

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix: Enable vision and streaming for MLLM models

This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add Gemma 3 to MLLM detection patterns

Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add Gemma 3 support section with long context patch instructions

- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…ching (#4)

Gemma 3's model __call__() requires pixel_values as a positional argument,
unlike Qwen2-VL which makes it optional. This caused "missing required
positional argument: 'pixel_values'" errors when using continuous batching
with text-only requests.

The MLLMModelWrapper now injects pixel_values=None for text-only requests,
enabling Gemma 3 to work with continuous batching and prefix caching.

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…batch mode (#5)

Synced from local patches in .venv-vllm-mlx

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
#6)

Synced from local patches in .venv-vllm-mlx

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Synced from local patches in .venv-vllm-mlx

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Synced from local patches in .venv-vllm-mlx

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Synced from local patches in .venv-vllm-mlx

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@lubauss lubauss closed this Jan 20, 2026
@lubauss lubauss deleted the patch/feat-mllm-prefix-caching-with-3x-speedup branch January 20, 2026 02:32
waybarrios pushed a commit that referenced this pull request Jan 26, 2026
…positions (#13)

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
sean-esk pushed a commit to sean-esk/vllm-mlx that referenced this pull request Mar 7, 2026
…rrios#13)

* feat: add seed_oss, deepseek_v31, qwen3_coder_xml tool parsers

Port 3 upstream vLLM tool parsers for popular MLX models:
- seed_oss: GPT-OSS-20B XML format (<seed:tool_call> + <seed:think>)
- deepseek_v31: DeepSeek V3.1/R1-0528 unicode special tokens
- qwen3_coder_xml: Qwen3-Coder XML format (<tool_call>/<function=...>)

Includes 72 upstream regression tests and eval config updates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: GLM47 streaming test, multi-step streaming tests, path note

- Fix GLM47 test_streaming_no_tool_calls to match current strip_think_tags
  behavior (strips leading whitespace from content deltas)
- Add multi-step streaming tests for seed_oss and qwen3coder that verify
  header + { + params + } are all emitted across multiple calls
- Add note that run_all_models.sh paths are machine-specific

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: review feedback — GLM47 whitespace, streaming tests, path note

- Fix GLM47 streaming: strip_think_tags was eating inter-word spaces on
  normal content deltas; now only strips when </think> is actually present
- Add multi-step streaming tests for seed_oss and qwen3coder that verify
  complete tool call emission (header + { + params + }) with fine-grained
  deltas matching realistic token boundaries
- Add note that run_all_models.sh paths are machine-specific

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: streaming completeness, GLM47 whitespace, coarse-delta resilience

Streaming completeness (seed_oss + qwen3coder):
- When the function body is already complete at header-detection time,
  emit the full tool call (name + arguments) in one chunk instead of
  header-only.  Prevents truncated output when coarse deltas or
  max_tokens leave no further parser calls.
- When tool_call_start is detected, fall through to header parsing
  instead of returning None — the header may already be available.

GLM47 streaming:
- Only call strip_think_tags when </think> is actually present in the
  delta, preventing inter-word spaces from being eaten on normal content.

Tests:
- Add coarse-delta streaming tests that verify complete arguments are
  emitted even with a single large chunk (seed_oss + qwen3coder).
- Fix GLM47 streaming test to expect preserved whitespace.

Other:
- Remove misleading MODEL_DIR env var reference from run_all_models.sh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: harmony parser regex for GPT-OSS actual template format

The GPT-OSS chat template generates tool calls as:
  <|start|>assistant to=functions.NAME<|channel|>commentary json<|message|>ARGS<|call|>

But the harmony regex expected:
  <|channel|>commentary to=functions.NAME <|message|>ARGS<|call|>

The to=functions.NAME comes before <|channel|>commentary in reality,
not after. This mismatch caused 17% tool calling score.

Fix: support both formats (real + legacy test format) via alternation.
Also accept <|end|> as final channel terminator alongside <|return|>.
Revert GPT-OSS eval config from seed_oss back to harmony.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: harmony native tool format, VLM model loading fallback

- Set HarmonyToolParser.SUPPORTS_NATIVE_TOOL_FORMAT = True so multi-turn
  tool history uses native harmony tokens instead of plain text conversion
  ("[Calling tool: ...]"), which broke GPT-OSS tool flow understanding.

- Extend load_model_with_fallback to catch "Missing N parameters" errors
  (not just "parameters not in model") for VLM-packaged models like
  Qwen3.5-9B and Mistral-Small-3.2 that need strict=False.

- Update harmony and native format tests accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: review round 1 — operator precedence, float type coercion

- Add explicit parentheses in tokenizer.py fallback condition to clarify
  `or`/`and` precedence (behavior was correct but ambiguous to read).

- Fix _convert_param_value() in seed_oss and qwen3coder parsers: when
  schema says "number"/"float", always return float instead of silently
  coercing 3.0 → int(3). Removes lossy `fv - int(fv) != 0` check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Your Name <you@example.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
mtomcal added a commit to mtomcal/vllm-mlx that referenced this pull request Apr 4, 2026
Refactor streaming to use tested granular event builders instead of
inline dict construction, fixing the gap where tested code wasn't
production code (waybarrios#13). Fix text omission in completed events (waybarrios#6),
add [DONE] sentinel (waybarrios#8), use typed output models to prevent
cross-type field leakage (waybarrios#4, waybarrios#5), fix content join separator (waybarrios#10),
remove dead code branches (waybarrios#9, waybarrios#11), and warn on unrecognized content
types (waybarrios#7). Add Codex CLI setup guide.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant