merge: sync upstream + 14-model benchmark (v0.2.7 → v0.3.x)#67
Merged
raullenchai merged 69 commits intomainfrom Apr 6, 2026
Merged
merge: sync upstream + 14-model benchmark (v0.2.7 → v0.3.x)#67raullenchai merged 69 commits intomainfrom
raullenchai merged 69 commits intomainfrom
Conversation
Replace per-token tokenizer.decode([token]) with a streaming detokenizer that buffers partial UTF-8 byte sequences. This fixes corrupted multi-byte characters (e.g. Czech 'ď' → '��') during SSE streaming, caused by byte-level tokens being decoded individually instead of accumulated until a complete UTF-8 character boundary. Uses mlx_lm's NaiveStreamingDetokenizer (or the optimized BPEStreamingDetokenizer when available via tokenizer.detokenizer) with a per-request pool that is cleaned up on request completion. Both LLM scheduler and MLLM scheduler are fixed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allow users to serve a model under a different name in API responses, matching vLLM's --served-model-name behavior.
The cache directory was derived from _model_name which could be overridden by --served-model-name, causing cache misses when the served name changed. Use the actual model path instead.
…ol streaming - Add strict=False fallback in tokenizer loader for models with extra weights (e.g., vision tower params), enabling Qwen3.5 to load via mlx-lm as a text-only model - Fix streaming tool call parsing when both --reasoning-parser and --tool-call-parser are enabled (previously mutually exclusive branches) - Make memory pressure threshold dynamic based on system RAM instead of hardcoded 200GB Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes AttributeError when ArraysCache.is_trimmable() returns True but the trim() method doesn't exist. Added hasattr check for trim before calling it in scheduler.py lines 772 and 802. Closes #145
…odels Qwen3.5 uses a hybrid architecture (Attention + Mamba/SSM layers), where `model.make_cache()` returns a mix of `KVCache` and `ArraysCache` objects. `ArraysCache.__init__()` requires a `size` parameter, but `BatchMambaCache` conditionally skipped it when `HAS_MAMBA_CACHE=True`. Since `MambaCache` was removed in mlx-lm >= 0.30.6 and falls back to `ArraysCache`, the `HAS_MAMBA_CACHE` flag is unreliable. This caused `--continuous-batching` mode to crash in an infinite error loop: `ArraysCache.__init__() missing 1 required positional argument: 'size'` The fix unconditionally passes `size` to `super().__init__()`, which is safe for both `ArraysCache` (requires it) and legacy `MambaCache` (accepts it). Without this fix, continuous batching and prefix caching are completely broken for Qwen3.5 models on Apple Silicon. Related upstream issues: - ml-explore/mlx-lm#980 (prefix cache fails for hybrid models) - QwenLM/Qwen3.6#37 (ArraysCache vs KVCache in hybrid arch)
mlx-lm 0.31.0 added prompt_checkpoints support, changing the BatchGenerator.insert() tuple from 6 elements to 7. This causes "ValueError: too many values to unpack (expected 6)" in _chunked_next when processing any request. Changes: - scheduler.py line ~395: unpack 7 values (add _prompt_checkpoints) - scheduler.py line ~406: pass max_kv_size=None to _make_cache() (signature changed in mlx-lm 0.31.0 to require 3 args) Tested on Mac Mini M4 Pro 64GB with: - mlx-lm 0.31.0 - mlx 0.31.1 - Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit - vllm-mlx 0.2.5 (this fork) Fixes the same issue as jundot/omlx#110.
Three bugs fixed:
1. video_url content type silently ignored in MLLM chat() and stream_chat().
The OpenAI API video format uses {"type": "video_url", "video_url": {"url": ...}}
but only "video" type was handled. Fixes #120.
2. Video frames extracted AFTER chat template built, causing token count
mismatch (template has 0 image tokens but vision encoder produces N*frame
features). Restructured to two-pass approach: extract video frames first,
then build chat template with correct frame counts.
3. server.py has_media always False in MLLM mode because images/videos are
extracted from messages internally (set to []). Added MLLM-specific check
so video_fps/video_max_frames params still reach chat() via chat_kwargs.
For models with video_token_id (Qwen-family), video inputs now flow through mlx-vlm's native video pipeline instead of being treated as individual images. This activates: - 3D conv frame pairing (temporal_patch_size=2) - M-RoPE temporal position IDs (interleaved layout) - Timestamp-frame interleaving in the prompt - Proper video_grid_thw for the vision encoder Falls back to frame-as-images for non-video models. Adds _generate_native_video() and _translate_messages_for_native_video() to MLXMultimodalLM, plus unit tests for video URL parsing, frame count alignment, and message translation.
… (#180) * feat: MLLM+MTP per-request routing for text and vision When both --mllm and --enable-mtp are set, SimpleEngine builds a parallel mlx_lm TextModel sharing the VLM backbone weights (zero-copy). Text-only requests route to mlx_lm with MTP speculative decoding; media requests route to the mlx_vlm MLLM path. Key components: - text_model_from_vlm.py: Build mlx_lm TextModel from VLM weights - Per-request routing in stream_chat() via _has_media_content() - _stream_generate_text() for MTP-accelerated text generation - MTP passthrough: --enable-mtp flag through CLI/server/engine/LLM Tested on Qwen3.5-35B-A3B VLM+MTP (8-bit): - Text (MTP): 65.3 tok/s - Vision (MLLM): 63.8 tok/s - Memory: 38.7 GB (zero-copy, same as single model) * feat: system prompt KV caching for SimpleEngine MTP text path Persist backbone KV cache after prefilling system prompt tokens. On subsequent requests with the same system prompt, restore the snapshot and only prefill the suffix (user + history) tokens. For a 10K-token system prompt on the 122B model, this saves ~57s per request by avoiding redundant system prompt prefill. Implementation: - Detect system prefix via ChatML boundary markers - Hash prefix text for cache key validation - On cache miss: prefill system tokens, snapshot backbone KV state - On cache hit: restore snapshot into fresh cache, send suffix only - Token prefix validation ensures correct split at tokenization boundary - Single-entry cache (one system prompt at a time) - Stats exposed via get_stats() → system_kv_cache - Cache cleared on stop(), invalidated on system prompt change * feat: SpecPrefill — attention-based sparse prefill for TTFT reduction Uses a small draft model to identify important prompt tokens via attention scoring, then sparse-prefills the target model with only those tokens while preserving original positional encoding via manual RoPE. Reduces TTFT 2.8-3.1x on 122B and 1.8x on 35B at 20% keep rate. Implementation: - specprefill.py: Core module with score_tokens(), select_chunks(), sparse_prefill(), cleanup_rope() (~640 lines) - SimpleEngine integration: draft model loading, threshold-based activation, composition with system prompt KV cache, graceful fallback on error - Per-request API: specprefill (bool) + specprefill_keep_pct (float) via extra_body for per-request control - CLI: --specprefill, --specprefill-threshold, --specprefill-keep-pct, --specprefill-draft-model, --prefill-step-size Closes #179. Related: #178 (TTFT), #57 (speculative decoding). * feat: multi-architecture support for SpecPrefill scoring and sparse prefill Add support for three model architecture families with auto-detection: - Qwen3.5: gate split + q_norm + RoPE (existing, now refactored) - Nemotron-H: content-based attention (no RoPE), mixer attr, compacted cache - GPT-OSS/Llama: standard q_proj + RoPE (GQA, YarnRoPE compatible) Key changes: - Architecture-specific query extractors (_qwen35, _llama, _nemotron_h) - Auto-detection in score_tokens() via model attributes (q_norm/rope/mixer) - _get_attn_module()/_set_attn_module() abstract self_attn vs mixer access - _find_attention_layers() handles block_type="*" (Nemotron-H attention) - _build_layer_to_cache_map() handles compacted cache indexing - sparse_prefill() skips RoPE patching for architectures without it - cleanup_rope() is no-op for RoPE-less architectures - Remove score_tokens_self() stub (CritiPrefill not viable for MoE) Tested on Qwen3.5 4B (positions + pipeline). Nemotron-H and GPT-OSS code paths ready for empirical validation. * fix: handle GPT-OSS sliding window caches and head attribute naming Two bugs found during cross-architecture testing on GPT-OSS 120B: 1. _llama_extract_queries() used eager evaluation in getattr fallback chain: getattr(attn, "num_attention_heads", attn.num_heads) evaluates attn.num_heads before checking if num_attention_heads exists. Fixed to use safe nested getattr with None default. 2. _compute_importance() concatenated score matrices with different shapes when mixing sliding window (128-token RotatingKVCache) and full attention (unlimited KVCache) layers. Fixed by skipping layers whose cache spans fewer tokens than the full prompt. Validated on GPT-OSS 120B + 20B draft: importance-based selection produces coherent output while uniform selection degrades, confirming scoring signal from 18 full-attention layers is sufficient. * fix: preserve tail tokens for models with RotatingKVCache Models with sliding window attention (e.g., GPT-OSS alternating sliding/full layers) use RotatingKVCache that evicts old entries. When sparse prefill inserts more tokens than the window size, the cache loses context needed for decode. sparse_prefill() now auto-detects RotatingKVCache and augments the selection to include the last max_size positions, ensuring sliding window layers have valid recent context. Validated: GPT-OSS 120B + 20B draft produces coherent output on 2294-token prompts (was garbage before this fix). Qwen3.5 and Nemotron-H unaffected (no RotatingKVCache in their cache). * feat: SpecPrefill support for non-MTP models (standard LLM path) Add _stream_generate_specprefill() method for models that don't use MTP speculative decoding (Nemotron, GPT-OSS, etc). The existing SpecPrefill integration only worked in the MTP text path (_stream_generate_text). Changes: - stream_generate() now pops specprefill/specprefill_keep_pct from kwargs and dispatches to the new method when conditions are met - _stream_generate_specprefill() follows the same pattern as the MTP path: score → select → sparse_prefill → autoregressive generation - Graceful fallback to normal generation on any error - Per-request overrides (specprefill, specprefill_keep_pct) via extra_body - Threshold and upper-bound checks identical to MTP path
…strict=False loader
Add Qwen3.5 model support (text-only) and fix reasoning+tool streaming
…enerate wiring - Forward tools to apply_chat_template in native video path (fixes silent tool-call drop, regression from PR #124) - Pop tools, use_cache, video_fps, video_max_frames from kwargs before native video branch in chat() and stream_chat() to prevent leaking into mlx_vlm.generate() - Extract _collect_video_inputs() to deduplicate video extraction between chat() and stream_chat() - Split _generate_native_video into _prepare_native_video_inputs (preprocessing) + _generate_native_video (generation) wired through mlx_vlm.video_generate for clearer intent and easier adoption of upstream improvements - Add ImportError guard on video_generate import in _generate_native_video to match codebase convention - Document blocking stream_chat native video path — no upstream streaming API, engine wraps in asyncio.to_thread() - Add tests for multi-message videos, multiple videos per message, video_url translation, Pydantic handling, tool forwarding, video_generate import verification
Add --served-model-name CLI parameter
…injection - ensure_mamba_support() now no-op: mlx-lm >= 0.30.6 ArraysCache has native batch support; old patch broke hybrid models (ArraysCache + KVCache) - Add inject_mtp_support(): dynamically create MTP module, load weights, and monkey-patch model class with return_hidden/mtp_forward/make_mtp_cache - Add _try_inject_mtp_post_load: auto-detect and inject MTP weights stripped by sanitize() during mlx_lm.load() - Add strict=False fallback for models with extra MTP parameters - validate_mtp_support: support model.language_model.args hierarchy - Improve engine loop error logging with full traceback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: native Qwen3-VL video support in MLLM mode
…injection fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection
Truncating the string causes similar-but-not-the-same base64 JPGs to return the same hash, causing vllm-mlx to use the same cached image for all of them, resulting in duplicated and incorrect responses.
fix: report prompt_tokens correctly for LLM models in SimpleEngine
…4-image-hash fix: Don’t truncate base64 images before hashing
…mpat fix: compatibility with mlx-lm 0.31.x (prompt_checkpoints tuple)
…batching fix: bump mlx-lm minimum to 0.31.0 for hybrid model batching
fix: pass size to ArraysCache in BatchMambaCache for Qwen3.5 hybrid models
fix: rename platform.py to vllm_platform.py to avoid stdlib shadowing
fix: Use streaming detokenizer for UTF-8-safe incremental decode
Tool call XML (e.g. <minimax:tool_call>, <tool_call>) was leaking into streaming text deltas via the /v1/messages endpoint. The raw markup appeared in the client's conversation context alongside the structured tool_use block, doubling token consumption for every tool call. Add StreamingToolCallFilter that buffers streaming text and suppresses content inside tool call blocks. Handles tags split across multiple deltas, multiple tool calls per response, and preserves <think> blocks. Supports MiniMax (<minimax:tool_call>) and Qwen (<tool_call>) formats. 14 unit tests included. Fixes #129
Add [Calling tool: ...)] to the streaming filter tag list. MiniMax-M2.5 uses this format for some tool calls alongside its native XML format. Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
MiniMax generates multiple tool call formats: - <minimax:tool_call> XML (native) - <tool_call> (Qwen) - [Calling tool: ...] and [Calling tool=...] (bracket variants) - [TOOL_CALL]...[/TOOL_CALL] (block format) Consolidate bracket variants under single [Calling tool prefix with newline as delimiter. Add [TOOL_CALL] block format. Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
Add <function=name>...</function> (Llama-style) to filtered tags. Now covers all formats supported by parse_tool_calls(): - MiniMax XML, Qwen XML, Qwen3 bracket, Llama function, Nemotron (via <tool_call>), and [TOOL_CALL] block. Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
Add StreamingThinkRouter that separates thinking from response text. Models that inject <think> in the generation prompt (MiniMax, Qwen3, DeepSeek-R1) are auto-detected from the chat template. Stream pipeline: raw text → tool call filter → think router → emit Thinking content emits as Anthropic thinking content blocks (thinking_delta) so clients render them distinctly from responses.
uv.lock is not tracked upstream - accidentally included. Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
_stream_anthropic_messages() never read prompt_tokens from the engine, always reporting 0 input_tokens. Now tracks prompt_tokens alongside completion_tokens and includes input_tokens in message_delta usage. Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
…ock emission Addresses PR #232 review feedback from Thump604: 1. StreamingThinkRouter unit tests (18 tests): - start_in_thinking mode and transition to text - Partial tag handling (held back, split across deltas, false alarms) - Multiple think blocks, token-by-token streaming - Flush behavior and state reset 2. Integration tests (12 tests): - Full pipeline: tool_filter → think_router → SSE events - Pure text, thinking→text, start_in_thinking→text - Tool call suppression with accumulated text preserved - Mixed thinking + text + tool calls - Block index increment verification 3. Refactored _emit_content_pieces() helper: - Extracts block transition logic (was repeated 3x in server.py) - Handles block_start/stop/delta emission - Returns updated state (block_type, index) for caller 44 tests passing across filter, router, and integration suites. Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
1. [Calling tool] close tag changed from "\n" to "]\n" to prevent premature close on multi-line JSON args. Added tests for bracket- style and multi-line tool calls. 2. Buffer safety cap (1MB) on unclosed tool call blocks with warning log when exceeded. Prevents unbounded memory growth from pathological input. Added test for cap behavior. 3. accumulated_text now tracks raw delta_text before special token cleaning, ensuring tool call parsing is independent of the SPECIAL_TOKENS_PATTERN. Matches integration test behavior. 47 tests passing (44 existing + 3 new). Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
…ent-leak Approved by janhilgard, fixes streaming tool call XML leaking into content. Merging.
… blocks, detokenizer, v0.2.7 Merge 27 upstream commits (d235c37..b4fa030) into our fork: - feat: route <think> blocks to Anthropic thinking content blocks (#232) - fix: suppress tool call XML from streaming text content (#129) - fix: streaming detokenizer for UTF-8-safe incremental decode - fix: rename platform.py to vllm_platform.py (stdlib shadowing) - fix: don't truncate base64 images before hashing - fix: bump mlx-lm minimum to 0.31.0 for hybrid model batching - fix: pass size to ArraysCache in BatchMambaCache for Qwen3.5 - fix: compatibility with mlx-lm 0.31.x (prompt_checkpoints tuple) - fix: report prompt_tokens correctly for LLM models in SimpleEngine - fix: clean up detokenizer pool in abort, reset, error recovery - bump version to 0.2.7 Conflict resolution (8 files, 17 conflicts): - Keep all fork features (DeltaNet snapshots, fast SSE, tool injection, cloud routing, MTP routing, alias registry, Anthropic adapter) - Incorporate upstream's StreamingToolCallFilter, StreamingThinkRouter, NaiveStreamingDetokenizer, platform rename, base64 hash fix Self-reviewed 3 rounds. All 1968 tests pass (17 skipped). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The merge accidentally dropped the `return model, tokenizer` after the successful `load()` call in tokenizer.py. This caused all model loading to return None and crash with "cannot unpack non-iterable NoneType". Also update test_api_models owned_by assertion to match our branding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verified merge integrity with end-to-end benchmarks: - Qwen3.5-35B-A3B 8bit: 83.1 tok/s, 100% tools, 0% leak - MiniMax-M2.5 4bit: 51.7 tok/s, 100% tools, 0% leak - Qwen3.5-4B 4bit: 161.5 tok/s, 100% tools, 0% leak - GLM-4.5-Air 4bit: 100% tools (decode anomaly — model stops early) All results consistent with pre-merge README data. 1968 unit tests + 7 e2e tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Benchmarked 14 models post-merge on Mac Studio M3 Ultra (256GB): Qwen family (100% tools): - Qwen3.5-4B 4bit: 161.5 tok/s, 2.9 GB - Qwen3.5-9B 4bit: 99.8 tok/s, 5.4 GB - Qwen3.5-27B 4bit: 39.0 tok/s, 14.8 GB - Qwen3.5-35B-A3B 8b: 83.1 tok/s, 35.0 GB - Qwen3-Coder-Next 4b: 74.5 tok/s, 42.4 GB - MiniMax-M2.5 4bit: 51.7 tok/s, 120.4 GB Non-Qwen: - Llama-3.2-3B: 226.5 tok/s (fastest, no tools) - Hermes-3-8B: 123.4 tok/s (no tools) - Phi-4-mini: 174.0 tok/s (no tools, 100% leak) - Gemma-3-12B: 48.4 tok/s (no tools) - Mistral-Small: pending (see json) - Devstral-24B: 29.6 tok/s (no tools) - GPT-OSS-20B: 58.5 tok/s (no tools) - GLM-4.5-Air: ~49 tok/s (100% tools) Agent integration verified: - LangChain: basic chat + tool calling ✓ - OpenAI SDK: streaming + tool calling ✓ - Aider-style: code editing + multi-turn ✓ - OpenCode-style: streaming tool calls ✓ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Merge 43 upstream commits from waybarrios/vllm-mlx (22dcbf8..b4fa030) plus bug fixes and benchmarks.
Upstream features merged:
rapid-mlx run <agent>one-command startup #150)<think>→ thinking content blocks (#232)Our fixes:
returnin load_model_with_fallback success pathVerification:
🤖 Generated with Claude Code