Conversation
Synced from local patches in .venv-vllm-mlx Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
lubauss
pushed a commit
that referenced
this pull request
Jan 20, 2026
lubauss
added a commit
that referenced
this pull request
Jan 20, 2026
Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
lubauss
added a commit
that referenced
this pull request
Jan 20, 2026
Synced from local patches in .venv-vllm-mlx Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
lubauss
pushed a commit
that referenced
this pull request
Jan 20, 2026
lubauss
added a commit
that referenced
this pull request
Jan 20, 2026
* Fix --api-key argument for serve command (fixes #7) * Document --api-key, --rate-limit and --timeout options in CLI reference * fix: Enable vision and streaming for MLLM models + Gemma 3 support (#2) * fix: Enable vision and streaming for MLLM models This patch fixes two critical issues with multimodal language models (MLLM): ## Vision Fix (server.py, simple.py) - Preserve original messages when calling MLLM models - The engine was passing only the prompt string, losing image data - Now passes full message objects with images to MLLM.chat() ## Streaming Fix (mllm.py, simple.py) - Add stream_chat() method to MLLMMultimodalLM class - Uses mlx_vlm.stream_generate() for true token-by-token streaming - Update engine to call stream_chat() for MLLM models - Properly yields GenerationOutput with new_text for SSE streaming Tested with: - mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit - Text streaming: 5 tokens streamed correctly - Vision streaming: Image analysis works with streaming Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Add Gemma 3 to MLLM detection patterns Gemma 3 models are multimodal but weren't being detected as VLMs. This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx correctly loads them with vision support via mlx-vlm. Tested with mlx-community/gemma-3-27b-it-4bit: - Vision: ✅ Working (cat, Kali, Ganesha images) - Streaming: ✅ Working (40 chunks) - Long context: ✅ Up to ~5K tokens Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: Add Gemma 3 support section with long context patch instructions - Document Gemma 3 MLLM detection (already patched in utils.py) - Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var - Include benchmark results showing 5x improvement (10K → 50K tokens) - Explain Metal GPU timeout limitation and workaround --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> * fix: disable skip_prompt_processing for multimodal to prevent garbled output For MLLM with images, skip_prompt_processing cannot be used because: - Vision encoder must run each time to provide visual context - The skip path only calls language_model() which has no vision - Using it produces garbled output like 'TheTheTheThe...' Text-only caching still works with 6x+ speedup. Multimodal correctly gets no speedup but produces coherent output. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Wayner Barrios <waybarrios@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Merging local patches to main