feat: backport production features — MTP, tool parsers, sampling, prefill#278
Conversation
462b8a8 to
c7cb3aa
Compare
Thump604
left a comment
There was a problem hiding this comment.
I pulled this branch locally and do not think it is sweep-merge ready as one omnibus backport. Three concrete blockers stood out:
-
vllm_mlx/reasoning/qwen3_parser.pychanges no-tag output from pure content to pure reasoning. That codifies the exact Qwen failure mode we just debugged locally: the model can return an answer without channel tags, and this change makes that answer invisible in non-streaming unless another layer recovers it. I would not merge that behavior change unconditionally. Either gate it on request-level policy (enable_thinking/ explicit recovery flag) or leave the parser's no-tag case as content and handle recovery in the server. -
The new
frequency_penalty/presence_penaltyAPI is not implemented with the semantics the field names promise. Inserver.py, both are collapsed onto a singlerepetition_penalty, and if both are set one of them is silently ignored. That is not an upstream-safe API expansion under OpenAI-compatible names. Either implement the real semantics or only exposerepetition_penaltyin this PR. -
vllm_mlx/mllm_scheduler.pynow has two adaptive Metal cache-clear blocks back-to-back and increments_step_counttwice per step. That looks like a bad merge and changes the clear cadence accidentally.
I'm happy to re-review after those are fixed or split out. The rest of the bundle is large enough that I would strongly prefer it land as smaller PRs unless there is a hard dependency forcing the aggregation.
c7cb3aa to
e46bdb8
Compare
janhilgard
left a comment
There was a problem hiding this comment.
Thanks for the thorough review! All three blockers fixed:
-
qwen3_parser.py — Reverted to upstream behavior: no-tag output = pure content. The reasoning-as-default change was too aggressive without gating on
enable_thinking. -
frequency_penalty / presence_penalty — Removed both fields from the API models entirely. Only
repetition_penalty(mlx-lm native) is now exposed. No misleading OpenAI-compatible names without proper semantics. -
mllm_scheduler.py duplicate cache-clear — Removed the second adaptive block that caused double
_step_countincrement. Single cache-clear block remains with the original interval logic.
Thump604
left a comment
There was a problem hiding this comment.
Re-reviewed the follow-up delta from c7cb3aa to e46bdb8.
The three blockers from my earlier review are addressed:
qwen3_parser.pyno longer changes the no-tag case- the misleading
frequency_penalty/presence_penaltyAPI expansion is gone; this is now justrepetition_penalty - the duplicate adaptive cache-clear / double
_step_countissue inmllm_scheduler.pyis gone
I do not see a new blocker in this follow-up range.
d56426f to
c00f4ce
Compare
…fill New files: - patches/qwen3_5_mllm.py: BatchKVCache offset fix for Qwen3.5 - patches/qwen3_5_mtp.py: Runtime MTP injection for Qwen3.5 - tool_parsers/minimax_tool_parser.py: MiniMax-M2 tool parser - scripts/add_mtp_weights_qwen35.py: Extract MTP weights from BF16 Key changes: - mllm_batch_generator: chunked prefill, mid-batch extend, MTP hooks, patch registration, repetition penalty, prefill abort, think-suffix stripping for prefix cache - mllm_scheduler: request status, cache config, prefill abort - server: enable_thinking, tool_choice=none, tool argument coercion - engines: MTP injection, enable_thinking, gpu_memory_utilization - memory_cache: block LCP for hybrid models (SSM can't be rewound) Prefix cache fix: enable_thinking=True adds <think>\n to generation prompt, breaking PREFIX match across conversation turns. Strip these tokens from cache keys in both store and fetch paths so stored entries match as clean prefixes. Tested: 3.12s → 0.39s (8x) for 1400-token prompts on Qwen3.5-122B hybrid model. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
c00f4ce to
f61d34e
Compare
|
Hey @janhilgard — thanks for backporting these features to upstream! Great to see them landing here. Just a note: these features (MTP injection, MiniMax tool parser, prefix cache think-suffix stripping, gpu_memory_utilization, DeltaNet/SSM prefix cache fixes) originated from our fork Rapid-MLX, where they've been developed and battle-tested over the past few months. Would appreciate a mention in the PR description for attribution — something like "backported from Rapid-MLX" would be great. Open source works best when we credit each other's work. 🙏 Happy to collaborate more directly going forward! |
|
Hey @raullenchai! Thanks for flagging this! totally fair. Rapid-MLX is already in our README acknowledgments. Always down to collaborate more, feel free to reach out anytime. |
|
Hey @raullenchai — thanks for the note! I want to respectfully clarify the timeline here. Our fork (
For DeltaNet/SSM prefix cache fixes specifically — yes, Rapid-MLX had those earlier (March 15). Our think-suffix stripping for Qwen3.5 hybrid models was developed independently in April. These features have been running in production on our Apple Silicon setup (Qwen3.5-122B, Gemma 4, etc.) since February/March. PR #278 is a backport of that production-tested code to upstream. Both forks clearly arrived at similar solutions independently — convergent engineering from working on the same codebase. Happy to collaborate going forward! |
|
Hi both, appreciate the transparency on the timeline. That said, I don't think there's a need to get into who did what first. Both forks have been doing great work independently, and that's exactly what open source is about. @janhilgard, you're right that many of these features came through in earlier PRs that I hadn't had the bandwidth to review until now. That's on me. @raullenchai, Rapid-MLX has been doing solid work and the acknowledgment in the README is well deserved. Worth noting that this repo is the upstream parent of Rapid-MLX, so there's naturally a lot of shared DNA between the two. At the end of the day, we're all building on the same codebase and pushing it forward. No competition here, just collaboration. Let's keep it that way. |
Summary
Backport battle-tested features from the production fork to upstream. These features have been running in production for days with multiple models (Qwen3.5-122B-A10B, Gemma 4 26B-A4B).
New files (4)
patches/qwen3_5_mllm.py— BatchKVCache offset fix for Qwen3.5 attention (samemx.array.__iadd__bug class as Gemma 4 fix in fix: patch Gemma 4 attention and RotatingKVCache for BatchKVCache #256)patches/qwen3_5_mtp.py— Runtime MTP (Multi-Token Prediction) injection for Qwen3.5 modelstool_parsers/minimax_tool_parser.py— MiniMax-M2 tool call parserscripts/add_mtp_weights_qwen35.py— Extract MTP weights from BF16 Qwen3.5 source modelsModified files (17)
MLLM batch generator — Chunked prefill, mid-batch extend, MTP injection hooks, patch registration, repetition penalty, prefill abort on disconnect, per-request logits processors, think-suffix stripping for prefix cache,
last_query_indextemplate normalization for prefix-stable assistant turnsMLLM scheduler — Request status tracking, cache config forwarding, prefill abort support, fixed duplicate cache-clear block
Server —
enable_thinkingper-request,tool_choice=none, tool argument type coercion,repetition_penaltyEngines — MTP injection in batched engine,
enable_thinkingin simple engine,gpu_memory_utilizationconfigCLI — MiniMax parser choice,
--gpu-memory-utilizationflagAPI models —
enable_thinking,repetition_penaltyPrefix cache — Block LCP for hybrid models (SSM can't be rewound), think-suffix stripping for clean PREFIX matches, chat template normalization for Qwen3.5
Other — Qwen3.5/MiniMax MLLM patterns, TextModelArgs from VLM config dict, MTP validator in scheduler
MTP speculative decoding results
Critical: MTP weights must be native BF16 extracted from source model — dequantized 4-bit weights give 0% acceptance.
Prefix cache: three fixes for Qwen3.5
Qwen3.5 is a hybrid model (~75% GatedDeltaNet SSM layers, ~25% attention layers). Three bugs prevented prefix cache from working:
Fix 1: Think-suffix stripping
enable_thinking=Trueadds<think>\n(2 tokens) to the generation prompt. Stored key ends with<think>\n, but next request has actual response at that position — tokens diverge, no PREFIX match.Fix: Strip think suffix from cache keys in both store and fetch.
_compute_think_suffix_len()detects suffix at init. Suffix tokens re-appended toremaining_idsafter fetch so model still sees full generation prompt.Fix 2:
last_query_indextemplate normalizationQwen3.5 chat template computes
last_query_index(position of last non-tool-response user message) and conditionally wraps assistant turns after that index in<think>...\n</think>\n\n. When a user text message is appended after tool results,last_query_indexjumps forward, retroactively removing<think>blocks from earlier assistant turns — shifting tokens mid-sequence.Fix:
_normalize_chat_template_for_prefix_cache()regex-replaces the conditional with the plain (ELSE) branch. All historical assistant messages use<|im_start|>assistant\ncontentwithout injected<think>blocks. Generation prompt still adds<think>\nat the end.Critical: VLM processors (e.g.
Qwen3VLProcessor) keep a separate copy ofchat_templatefrom their tokenizer.processor.apply_chat_template()reads fromprocessor.chat_template, NOTprocessor.tokenizer.chat_template. Both must be patched.Fix 3: LCP blocked for hybrid models
SSM state can't be rewound — setting layers to
Nonecrashes onmerge(). LCP match is blocked; the other two fixes enable clean PREFIX matches instead.Results
Changes from v1 (addressing review feedback)
frequency_penalty/presence_penaltyAPI fields — onlyrepetition_penaltyis exposed, avoiding misleading OpenAI-compatible names without proper semantics_step_countincrementChanges from v2 (prefix cache fix)
merge(). New approach strips<think>\ntokens from cache keys so PREFIX match works cleanly without touching SSM state.Changes from v3 (template normalization fix)
last_query_indextemplate normalization — Qwen3.5 chat template retroactively changes assistant turn formatting based onlast_query_index, breaking prefix cache for tool_result→user_text transitionsprocessor.chat_templateis a separate object fromprocessor.tokenizer.chat_template; both must be patched for normalization to take effectTest plan
uvx black --check vllm_mlx/passespython -m py_compilepasses for all modified files<think>= 2 tokens--enable-mtpflag--tool-call-parser minimaxenable_thinkingparameter respected per-request🤖 Generated with Claude Code