Sync upstream: SpecPrefill, native video, MTP injection by raullenchai · Pull Request #58 · raullenchai/Rapid-MLX

raullenchai · 2026-03-26T20:58:26Z

Summary

Merge 16 upstream commits (22dcbf8..d235c37) from waybarrios/vllm-mlx:

SpecPrefill (feat: SpecPrefill — attention-based sparse prefill for TTFT reduction waybarrios/vllm-mlx#180) — attention-based sparse prefill for TTFT reduction
Native Qwen3-VL video (feat: native Qwen3-VL video support in MLLM mode waybarrios/vllm-mlx#150) — temporal 3D conv + M-RoPE pipeline
MTP auto-injection (fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection waybarrios/vllm-mlx#97) — disable MambaCache monkey-patch for hybrid models
--served-model-name (Add --served-model-name CLI parameter waybarrios/vllm-mlx#125) — custom model name for API responses
Qwen3.5 text-only loading (Add Qwen3.5 model support (text-only) and fix reasoning+tool streaming waybarrios/vllm-mlx#127) — dynamic memory threshold
Adaptive cache clearing (fix(mllm_scheduler): add adaptive periodic cache clearing waybarrios/vllm-mlx#157) — periodic MLLM scheduler cache management
Metal resource leak fix (Fix Metal resource leak under high concurrency waybarrios/vllm-mlx#92) — high concurrency fix

Conflict Resolution

9 files had conflicts (34 conflict points). Strategy:

Keep all fork features (DeltaNet snapshots, fast SSE templates, tool injection, cloud routing, prompt cache, logprobs streaming)
Incorporate upstream's new params (mtp, specprefill_*, served_model_name, _model_path)
Keep our _model_name or request.model fallback pattern
Keep our fast SSE streaming path over upstream's Pydantic-based streaming

New files from upstream

vllm_mlx/specprefill.py — SpecPrefill implementation
vllm_mlx/text_model_from_vlm.py — text-only loading from VLM weights
tests/test_video.py, tests/test_mllm_mtp_routing.py, tests/test_text_model_from_vlm.py

Test plan

ruff check --select E9,F63,F7,F82 — no syntax/undefined errors
56/58 tests pass (2 event_loop tests need running server — expected)
Manual test with Qwen3.5 model to verify SpecPrefill + DeltaNet snapshots coexist
Verify --served-model-name flag works

🤖 Generated with Claude Code

Allow users to serve a model under a different name in API responses, matching vLLM's --served-model-name behavior.

The cache directory was derived from _model_name which could be overridden by --served-model-name, causing cache misses when the served name changed. Use the actual model path instead.

…ol streaming - Add strict=False fallback in tokenizer loader for models with extra weights (e.g., vision tower params), enabling Qwen3.5 to load via mlx-lm as a text-only model - Fix streaming tool call parsing when both --reasoning-parser and --tool-call-parser are enabled (previously mutually exclusive branches) - Make memory pressure threshold dynamic based on system RAM instead of hardcoded 200GB Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…#157)

Three bugs fixed: 1. video_url content type silently ignored in MLLM chat() and stream_chat(). The OpenAI API video format uses {"type": "video_url", "video_url": {"url": ...}} but only "video" type was handled. Fixes waybarrios#120. 2. Video frames extracted AFTER chat template built, causing token count mismatch (template has 0 image tokens but vision encoder produces N*frame features). Restructured to two-pass approach: extract video frames first, then build chat template with correct frame counts. 3. server.py has_media always False in MLLM mode because images/videos are extracted from messages internally (set to []). Added MLLM-specific check so video_fps/video_max_frames params still reach chat() via chat_kwargs.

For models with video_token_id (Qwen-family), video inputs now flow through mlx-vlm's native video pipeline instead of being treated as individual images. This activates: - 3D conv frame pairing (temporal_patch_size=2) - M-RoPE temporal position IDs (interleaved layout) - Timestamp-frame interleaving in the prompt - Proper video_grid_thw for the vision encoder Falls back to frame-as-images for non-video models. Adds _generate_native_video() and _translate_messages_for_native_video() to MLXMultimodalLM, plus unit tests for video URL parsing, frame count alignment, and message translation.

…tion

…waybarrios#180) * feat: MLLM+MTP per-request routing for text and vision When both --mllm and --enable-mtp are set, SimpleEngine builds a parallel mlx_lm TextModel sharing the VLM backbone weights (zero-copy). Text-only requests route to mlx_lm with MTP speculative decoding; media requests route to the mlx_vlm MLLM path. Key components: - text_model_from_vlm.py: Build mlx_lm TextModel from VLM weights - Per-request routing in stream_chat() via _has_media_content() - _stream_generate_text() for MTP-accelerated text generation - MTP passthrough: --enable-mtp flag through CLI/server/engine/LLM Tested on Qwen3.5-35B-A3B VLM+MTP (8-bit): - Text (MTP): 65.3 tok/s - Vision (MLLM): 63.8 tok/s - Memory: 38.7 GB (zero-copy, same as single model) * feat: system prompt KV caching for SimpleEngine MTP text path Persist backbone KV cache after prefilling system prompt tokens. On subsequent requests with the same system prompt, restore the snapshot and only prefill the suffix (user + history) tokens. For a 10K-token system prompt on the 122B model, this saves ~57s per request by avoiding redundant system prompt prefill. Implementation: - Detect system prefix via ChatML boundary markers - Hash prefix text for cache key validation - On cache miss: prefill system tokens, snapshot backbone KV state - On cache hit: restore snapshot into fresh cache, send suffix only - Token prefix validation ensures correct split at tokenization boundary - Single-entry cache (one system prompt at a time) - Stats exposed via get_stats() → system_kv_cache - Cache cleared on stop(), invalidated on system prompt change * feat: SpecPrefill — attention-based sparse prefill for TTFT reduction Uses a small draft model to identify important prompt tokens via attention scoring, then sparse-prefills the target model with only those tokens while preserving original positional encoding via manual RoPE. Reduces TTFT 2.8-3.1x on 122B and 1.8x on 35B at 20% keep rate. Implementation: - specprefill.py: Core module with score_tokens(), select_chunks(), sparse_prefill(), cleanup_rope() (~640 lines) - SimpleEngine integration: draft model loading, threshold-based activation, composition with system prompt KV cache, graceful fallback on error - Per-request API: specprefill (bool) + specprefill_keep_pct (float) via extra_body for per-request control - CLI: --specprefill, --specprefill-threshold, --specprefill-keep-pct, --specprefill-draft-model, --prefill-step-size Closes waybarrios#179. Related: waybarrios#178 (TTFT), #57 (speculative decoding). * feat: multi-architecture support for SpecPrefill scoring and sparse prefill Add support for three model architecture families with auto-detection: - Qwen3.5: gate split + q_norm + RoPE (existing, now refactored) - Nemotron-H: content-based attention (no RoPE), mixer attr, compacted cache - GPT-OSS/Llama: standard q_proj + RoPE (GQA, YarnRoPE compatible) Key changes: - Architecture-specific query extractors (_qwen35, _llama, _nemotron_h) - Auto-detection in score_tokens() via model attributes (q_norm/rope/mixer) - _get_attn_module()/_set_attn_module() abstract self_attn vs mixer access - _find_attention_layers() handles block_type="*" (Nemotron-H attention) - _build_layer_to_cache_map() handles compacted cache indexing - sparse_prefill() skips RoPE patching for architectures without it - cleanup_rope() is no-op for RoPE-less architectures - Remove score_tokens_self() stub (CritiPrefill not viable for MoE) Tested on Qwen3.5 4B (positions + pipeline). Nemotron-H and GPT-OSS code paths ready for empirical validation. * fix: handle GPT-OSS sliding window caches and head attribute naming Two bugs found during cross-architecture testing on GPT-OSS 120B: 1. _llama_extract_queries() used eager evaluation in getattr fallback chain: getattr(attn, "num_attention_heads", attn.num_heads) evaluates attn.num_heads before checking if num_attention_heads exists. Fixed to use safe nested getattr with None default. 2. _compute_importance() concatenated score matrices with different shapes when mixing sliding window (128-token RotatingKVCache) and full attention (unlimited KVCache) layers. Fixed by skipping layers whose cache spans fewer tokens than the full prompt. Validated on GPT-OSS 120B + 20B draft: importance-based selection produces coherent output while uniform selection degrades, confirming scoring signal from 18 full-attention layers is sufficient. * fix: preserve tail tokens for models with RotatingKVCache Models with sliding window attention (e.g., GPT-OSS alternating sliding/full layers) use RotatingKVCache that evicts old entries. When sparse prefill inserts more tokens than the window size, the cache loses context needed for decode. sparse_prefill() now auto-detects RotatingKVCache and augments the selection to include the last max_size positions, ensuring sliding window layers have valid recent context. Validated: GPT-OSS 120B + 20B draft produces coherent output on 2294-token prompts (was garbage before this fix). Qwen3.5 and Nemotron-H unaffected (no RotatingKVCache in their cache). * feat: SpecPrefill support for non-MTP models (standard LLM path) Add _stream_generate_specprefill() method for models that don't use MTP speculative decoding (Nemotron, GPT-OSS, etc). The existing SpecPrefill integration only worked in the MTP text path (_stream_generate_text). Changes: - stream_generate() now pops specprefill/specprefill_keep_pct from kwargs and dispatches to the new method when conditions are met - _stream_generate_specprefill() follows the same pattern as the MTP path: score → select → sparse_prefill → autoregressive generation - Graceful fallback to normal generation on any error - Per-request overrides (specprefill, specprefill_keep_pct) via extra_body - Threshold and upper-bound checks identical to MTP path

…en_ids in strict=False loader

…s#127) Add Qwen3.5 model support (text-only) and fix reasoning+tool streaming

…y, video_generate wiring - Forward tools to apply_chat_template in native video path (fixes silent tool-call drop, regression from PR waybarrios#124) - Pop tools, use_cache, video_fps, video_max_frames from kwargs before native video branch in chat() and stream_chat() to prevent leaking into mlx_vlm.generate() - Extract _collect_video_inputs() to deduplicate video extraction between chat() and stream_chat() - Split _generate_native_video into _prepare_native_video_inputs (preprocessing) + _generate_native_video (generation) wired through mlx_vlm.video_generate for clearer intent and easier adoption of upstream improvements - Add ImportError guard on video_generate import in _generate_native_video to match codebase convention - Document blocking stream_chat native video path — no upstream streaming API, engine wraps in asyncio.to_thread() - Add tests for multi-message videos, multiple videos per message, video_url translation, Pydantic handling, tool forwarding, video_generate import verification

…name Add --served-model-name CLI parameter

…injection - ensure_mamba_support() now no-op: mlx-lm >= 0.30.6 ArraysCache has native batch support; old patch broke hybrid models (ArraysCache + KVCache) - Add inject_mtp_support(): dynamically create MTP module, load weights, and monkey-patch model class with return_hidden/mtp_forward/make_mtp_cache - Add _try_inject_mtp_post_load: auto-detect and inject MTP weights stripped by sanitize() during mlx_lm.load() - Add strict=False fallback for models with extra MTP parameters - validate_mtp_support: support model.language_model.args hierarchy - Improve engine loop error logging with full traceback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…port feat: native Qwen3-VL video support in MLLM mode

…ching-mtp-injection fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection

…ection, served-model-name Merge 16 upstream commits (22dcbf8..d235c37) into our fork: - feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180) - feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (waybarrios#150) - fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection (waybarrios#97) - feat: Add --served-model-name CLI parameter (waybarrios#125) - feat: Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrios#127) - fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157) - fix: Metal resource leak under high concurrency (waybarrios#92) Conflict resolution strategy: keep all fork features (DeltaNet snapshots, fast SSE templates, tool injection, cloud routing, prompt cache, etc.) while incorporating upstream's new functionality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…, use_cache double-pop P1: _validate_model_name() now accepts _model_alias and _model_path so alias-based requests don't 404 before legacy check runs. P1: build_text_model() resolves Hub repo IDs via snapshot_download (no-op if cached) so MLLM+MTP routing works for non-local models. P2: Non-streaming chat() now routes text-only requests through _stream_generate_text() matching stream_chat() behavior. P2: Remove duplicate kwargs.pop("use_cache", True) in mllm.py that overwrote the caller's value after the first pop consumed the key. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

P1: Move text-only MTP routing before _generation_lock acquisition. _stream_generate_text() acquires the lock internally, so calling it inside the lock caused an asyncio.Lock deadlock (not re-entrant). P2: Use last chunk's accumulated text instead of concatenating deltas. _stream_generate_text() yields full accumulated text on each chunk, not deltas. Also use chunk.completion_tokens directly instead of len(tokens) which was always 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

_stream_generate_text() now accepts a stop parameter and checks accumulated text against stop sequences in the yield loop, matching the behavior of MLXLanguageModel.stream_generate(). Both callers (chat() and stream_chat()) now pass stop through. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When a stop sequence is found in accumulated_text, also trim new_text by the same overshoot so streaming clients never see the stop string. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

otarkhan and others added 27 commits February 28, 2026 20:09

Add --served-model-name CLI parameter

85bae64

Allow users to serve a model under a different name in API responses, matching vLLM's --served-model-name behavior.

Fix prefix cache dir using served name instead of model path

41b4e76

The cache directory was derived from _model_name which could be overridden by --served-model-name, causing cache misses when the served name changed. Use the actual model path instead.

fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios…

80c6849

…#157)

style: ruff format + lint fixes for new code

eb56c7d

Fix video native init, import guard, empty source and has_media detec…

92b3556

…tion

remove streaming tool fix (covered by waybarrios#148) and fix eos_tok…

d90486e

…en_ids in strict=False loader

Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrio…

90eac21

…s#127) Add Qwen3.5 model support (text-only) and fix reasoning+tool streaming

fix lint CI to use python 3.13 for black compatibility

913bfd0

format engine_core.py long line

0b07872

resolve merge conflicts with main

6e413f6

Merge pull request waybarrios#125 from otarkhan/feature/served-model-…

c609b59

…name Add --served-model-name CLI parameter

resolve merge conflicts with main

35c77ec

format test_video.py

ede4e30

Merge pull request waybarrios#150 from patanet7/feat/native-video-sup…

2a79216

…port feat: native Qwen3-VL video support in MLLM mode

remove dead code in _load_strict_false

74c2f02

Merge pull request waybarrios#97 from janhilgard/fix/hybrid-model-bat…

d235c37

…ching-mtp-injection fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection

fix: truncate new_text on stop hit so SSE streams omit stop sequence

4ce9f23

When a stop sequence is found in accumulated_text, also trim new_text by the same overshoot so streaming clients never see the stop string. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync upstream: SpecPrefill, native video, MTP injection#58

Sync upstream: SpecPrefill, native video, MTP injection#58
raullenchai wants to merge 27 commits intomainfrom
feat/upstream-sync-march26

raullenchai commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

raullenchai commented Mar 26, 2026

Summary

Conflict Resolution

New files from upstream

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants