Skip to content

fix(mllm): concatenate text parts instead of overwriting in chat()#12

Closed
lubauss wants to merge 7 commits intowaybarrios:mainfrom
lubauss:patch/fixmllm-concatenate-text-parts-instead-of-overwrit
Closed

fix(mllm): concatenate text parts instead of overwriting in chat()#12
lubauss wants to merge 7 commits intowaybarrios:mainfrom
lubauss:patch/fixmllm-concatenate-text-parts-instead-of-overwrit

Conversation

@lubauss
Copy link
Copy Markdown
Contributor

@lubauss lubauss commented Jan 20, 2026

Summary

Synced from local development patches.

Files changed

  • vllm_mlx/engine/batched.py
  • vllm_mlx/api/utils.py
  • vllm_mlx/models/mllm.py

🤖 Generated with Claude Code

lubauss and others added 7 commits January 19, 2026 20:45
* fix: Enable vision and streaming for MLLM models

This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add Gemma 3 to MLLM detection patterns

Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add Gemma 3 support section with long context patch instructions

- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround

* feat: Enable continuous batching for MLLM models

This patch enables continuous batching (with prefix caching) for
multimodal LLM models like Qwen3-VL and Gemma 3.

Changes:
- Add MLLMModelWrapper to extract logits from LanguageModelOutput
- Fix tokenizer.encode to work with processors (Qwen3VLProcessor)
- Fix tokenizer.decode to use nested tokenizer for processors
- Fix _get_stop_tokens to check both processor and tokenizer

Performance improvement on M4 Max 128GB with Qwen3-VL-30B:
- First request (cache miss): ~22s for 17K tokens
- Subsequent requests (cache hit): ~0.8-1.2s
- Speedup: 10-28x faster with prefix caching

Multi-turn conversation (6 turns, 90K char document):
- 90.7% faster on average
- 10.76x speedup vs uncached

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix: Enable vision and streaming for MLLM models

This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add Gemma 3 to MLLM detection patterns

Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add Gemma 3 support section with long context patch instructions

- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…ching (#4)

Gemma 3's model __call__() requires pixel_values as a positional argument,
unlike Qwen2-VL which makes it optional. This caused "missing required
positional argument: 'pixel_values'" errors when using continuous batching
with text-only requests.

The MLLMModelWrapper now injects pixel_values=None for text-only requests,
enabling Gemma 3 to work with continuous batching and prefix caching.

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…batch mode (#5)

Synced from local patches in .venv-vllm-mlx

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
#6)

Synced from local patches in .venv-vllm-mlx

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Synced from local patches in .venv-vllm-mlx

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Synced from local patches in .venv-vllm-mlx

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@lubauss lubauss closed this Jan 20, 2026
@lubauss lubauss deleted the patch/fixmllm-concatenate-text-parts-instead-of-overwrit branch January 20, 2026 01:09
waybarrios pushed a commit that referenced this pull request Jan 26, 2026
…ats (#12)

- Added StreamOptions model with include_usage: bool
- Added stream_options to ChatCompletionRequest
- Added usage field to ChatCompletionChunk (optional, for final chunk)
- Modified stream_chat_completion() to track tokens and send usage in final chunk

This enables clients to request token usage statistics in streaming responses,
matching OpenAI API behavior.

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
sean-esk pushed a commit to sean-esk/vllm-mlx that referenced this pull request Mar 7, 2026
* eval: upgrade coding/reasoning/general suites with standard benchmarks

Replace easy eval questions (score clustering at 90-100%) with harder
standard benchmark problems for better model differentiation:

- Coding: 10 HumanEval+ problems (medium to hard), max_tokens 800→1200
- Reasoning: 10 MATH-500 problems (levels 2-5) with fraction support
- General: 10 MMLU-Pro multiple choice questions with answer_letter check

Delete old result JSONs since scores are incomparable across suite versions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: cache clearing between eval suites, selective enable_thinking

- Add POST /v1/cache/clear endpoint to server (clears SimpleEngine prompt cache)
- Eval calls cache/clear between suites to prevent KV cache pollution
- Only disable thinking for general suite (MCQ); coding/reasoning keep thinking enabled
- Bump general max_tokens to 2048 for thinking models
- Re-run all 11 models with corrected eval pipeline
- Update scorecard methodology text

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* eval: re-run all 11 models with enable_thinking=false globally

Disable thinking mode for ALL eval suites (not just general) to get
accurate scores without thinking tokens eating into max_tokens.
Update methodology text and regenerate scorecard.

Key improvements:
- Qwen3.5-122B-8bit: Reasoning 20%→90%, Coding 80%→90%
- Qwen3.5-35B-8bit: Coding 20%→90%, Reasoning 10%→80%
- GLM-4.7-Flash: Coding 10%→100%, Reasoning 40%→90%

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: add TODOs for anomalous eval scores to investigate

- MiniMax-M2.5: 10% coding despite strong scores elsewhere
- GLM-4.7-Flash: 50% general despite 100% coding / 90% reasoning
- GPT-OSS-20B: 17% tools / 20% reasoning with minimax parser

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Your Name <you@example.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant