feat: backport production features — MTP, tool parsers, sampling, prefill by janhilgard · Pull Request #278 · waybarrios/vllm-mlx

janhilgard · 2026-04-10T22:19:49Z

Summary

Backport battle-tested features from the production fork to upstream. These features have been running in production for days with multiple models (Qwen3.5-122B-A10B, Gemma 4 26B-A4B).

New files (4)

patches/qwen3_5_mllm.py — BatchKVCache offset fix for Qwen3.5 attention (same mx.array.__iadd__ bug class as Gemma 4 fix in fix: patch Gemma 4 attention and RotatingKVCache for BatchKVCache #256)
patches/qwen3_5_mtp.py — Runtime MTP (Multi-Token Prediction) injection for Qwen3.5 models
tool_parsers/minimax_tool_parser.py — MiniMax-M2 tool call parser
scripts/add_mtp_weights_qwen35.py — Extract MTP weights from BF16 Qwen3.5 source models

Modified files (17)

MLLM batch generator — Chunked prefill, mid-batch extend, MTP injection hooks, patch registration, repetition penalty, prefill abort on disconnect, per-request logits processors, think-suffix stripping for prefix cache, last_query_index template normalization for prefix-stable assistant turns

MLLM scheduler — Request status tracking, cache config forwarding, prefill abort support, fixed duplicate cache-clear block

Server — enable_thinking per-request, tool_choice=none, tool argument type coercion, repetition_penalty

Engines — MTP injection in batched engine, enable_thinking in simple engine, gpu_memory_utilization config

CLI — MiniMax parser choice, --gpu-memory-utilization flag

API models — enable_thinking, repetition_penalty

Prefix cache — Block LCP for hybrid models (SSM can't be rewound), think-suffix stripping for clean PREFIX matches, chat template normalization for Qwen3.5

Other — Qwen3.5/MiniMax MLLM patterns, TextModelArgs from VLM config dict, MTP validator in scheduler

MTP speculative decoding results

Model	Acceptance rate	Notes
Qwen3.5-122B-A10B (MoE)	77-78%	5.0 GB BF16 MTP weights
Qwen3.5-35B-A3B (MoE)	79-85%	1.7 GB BF16 MTP weights
Qwen3.5-27B (Dense)	78-80%	849 MB BF16 MTP weights

Critical: MTP weights must be native BF16 extracted from source model — dequantized 4-bit weights give 0% acceptance.

Prefix cache: three fixes for Qwen3.5

Qwen3.5 is a hybrid model (~75% GatedDeltaNet SSM layers, ~25% attention layers). Three bugs prevented prefix cache from working:

Fix 1: Think-suffix stripping

enable_thinking=True adds <think>\n (2 tokens) to the generation prompt. Stored key ends with <think>\n, but next request has actual response at that position — tokens diverge, no PREFIX match.

Fix: Strip think suffix from cache keys in both store and fetch. _compute_think_suffix_len() detects suffix at init. Suffix tokens re-appended to remaining_ids after fetch so model still sees full generation prompt.

Fix 2: `last_query_index` template normalization

Qwen3.5 chat template computes last_query_index (position of last non-tool-response user message) and conditionally wraps assistant turns after that index in <think>...\n</think>\n\n. When a user text message is appended after tool results, last_query_index jumps forward, retroactively removing <think> blocks from earlier assistant turns — shifting tokens mid-sequence.

Fix: _normalize_chat_template_for_prefix_cache() regex-replaces the conditional with the plain (ELSE) branch. All historical assistant messages use <|im_start|>assistant\ncontent without injected <think> blocks. Generation prompt still adds <think>\n at the end.

Critical: VLM processors (e.g. Qwen3VLProcessor) keep a separate copy of chat_template from their tokenizer. processor.apply_chat_template() reads from processor.chat_template, NOT processor.tokenizer.chat_template. Both must be patched.

Fix 3: LCP blocked for hybrid models

SSM state can't be rewound — setting layers to None crashes on merge(). LCP match is blocked; the other two fixes enable clean PREFIX matches instead.

Results

Scenario	Before	After
Real Claude Code (87K tokens)	256s prefill	4.1s (PREFIX HIT cached=87467)
tool_result → tool_result	PREFIX HIT	PREFIX HIT
tool_result → user_text	MISS	PREFIX HIT (cached=441 remaining=177)
user_text → user_text	PREFIX HIT	PREFIX HIT

Changes from v1 (addressing review feedback)

Reverted qwen3_parser.py no-tag behavior — no-tag output stays as pure content (not reasoning), matching upstream semantics
Removed frequency_penalty/presence_penalty API fields — only repetition_penalty is exposed, avoiding misleading OpenAI-compatible names without proper semantics
Fixed duplicate cache-clear block in mllm_scheduler.py — removed second adaptive Metal cache-clear that caused double _step_count increment

Changes from v2 (prefix cache fix)

Replaced LCP hybrid match with think-suffix stripping — LCP hybrid match (setting SSM layers to None) crashed on merge(). New approach strips <think>\n tokens from cache keys so PREFIX match works cleanly without touching SSM state.

Changes from v3 (template normalization fix)

Added last_query_index template normalization — Qwen3.5 chat template retroactively changes assistant turn formatting based on last_query_index, breaking prefix cache for tool_result→user_text transitions
Fixed VLM processor dual-template bug — processor.chat_template is a separate object from processor.tokenizer.chat_template; both must be patched for normalization to take effect

Test plan

🤖 Generated with Claude Code

Thump604

I pulled this branch locally and do not think it is sweep-merge ready as one omnibus backport. Three concrete blockers stood out:

vllm_mlx/reasoning/qwen3_parser.py changes no-tag output from pure content to pure reasoning. That codifies the exact Qwen failure mode we just debugged locally: the model can return an answer without channel tags, and this change makes that answer invisible in non-streaming unless another layer recovers it. I would not merge that behavior change unconditionally. Either gate it on request-level policy (enable_thinking / explicit recovery flag) or leave the parser's no-tag case as content and handle recovery in the server.
The new frequency_penalty / presence_penalty API is not implemented with the semantics the field names promise. In server.py, both are collapsed onto a single repetition_penalty, and if both are set one of them is silently ignored. That is not an upstream-safe API expansion under OpenAI-compatible names. Either implement the real semantics or only expose repetition_penalty in this PR.
vllm_mlx/mllm_scheduler.py now has two adaptive Metal cache-clear blocks back-to-back and increments _step_count twice per step. That looks like a bad merge and changes the clear cadence accidentally.

I'm happy to re-review after those are fixed or split out. The rest of the bundle is large enough that I would strongly prefer it land as smaller PRs unless there is a hard dependency forcing the aggregation.

janhilgard

Thanks for the thorough review! All three blockers fixed:

qwen3_parser.py — Reverted to upstream behavior: no-tag output = pure content. The reasoning-as-default change was too aggressive without gating on enable_thinking.
frequency_penalty / presence_penalty — Removed both fields from the API models entirely. Only repetition_penalty (mlx-lm native) is now exposed. No misleading OpenAI-compatible names without proper semantics.
mllm_scheduler.py duplicate cache-clear — Removed the second adaptive block that caused double _step_count increment. Single cache-clear block remains with the original interval logic.

Thump604

Re-reviewed the follow-up delta from c7cb3aa to e46bdb8.

The three blockers from my earlier review are addressed:

qwen3_parser.py no longer changes the no-tag case
the misleading frequency_penalty / presence_penalty API expansion is gone; this is now just repetition_penalty
the duplicate adaptive cache-clear / double _step_count issue in mllm_scheduler.py is gone

I do not see a new blocker in this follow-up range.

…fill New files: - patches/qwen3_5_mllm.py: BatchKVCache offset fix for Qwen3.5 - patches/qwen3_5_mtp.py: Runtime MTP injection for Qwen3.5 - tool_parsers/minimax_tool_parser.py: MiniMax-M2 tool parser - scripts/add_mtp_weights_qwen35.py: Extract MTP weights from BF16 Key changes: - mllm_batch_generator: chunked prefill, mid-batch extend, MTP hooks, patch registration, repetition penalty, prefill abort, think-suffix stripping for prefix cache - mllm_scheduler: request status, cache config, prefill abort - server: enable_thinking, tool_choice=none, tool argument coercion - engines: MTP injection, enable_thinking, gpu_memory_utilization - memory_cache: block LCP for hybrid models (SSM can't be rewound) Prefix cache fix: enable_thinking=True adds <think>\n to generation prompt, breaking PREFIX match across conversation turns. Strip these tokens from cache keys in both store and fetch paths so stored entries match as clean prefixes. Tested: 3.12s → 0.39s (8x) for 1400-token prompts on Qwen3.5-122B hybrid model. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

raullenchai · 2026-04-11T15:20:55Z

Hey @janhilgard — thanks for backporting these features to upstream! Great to see them landing here.

Just a note: these features (MTP injection, MiniMax tool parser, prefix cache think-suffix stripping, gpu_memory_utilization, DeltaNet/SSM prefix cache fixes) originated from our fork Rapid-MLX, where they've been developed and battle-tested over the past few months.

Would appreciate a mention in the PR description for attribution — something like "backported from Rapid-MLX" would be great. Open source works best when we credit each other's work. 🙏

Happy to collaborate more directly going forward!

waybarrios · 2026-04-11T15:24:40Z

Hey @raullenchai! Thanks for flagging this! totally fair. Rapid-MLX is already in our README acknowledgments. Always down to collaborate more, feel free to reach out anytime.

janhilgard · 2026-04-11T15:28:59Z

Hey @raullenchai — thanks for the note!

I want to respectfully clarify the timeline here. Our fork (janhilgard/vllm-mlx) was created on January 20, 2026 — 35 days before Rapid-MLX was forked (February 25, 2026). The features listed in your comment were developed in our fork first:

Feature	janhilgard/vllm-mlx	Rapid-MLX	Delta
MTP speculative decoding	Feb 14, 2026	Feb 25, 2026	11 days earlier
MiniMax-M2 tool parser	Feb 18, 2026	Feb 19, 2026	1 day earlier
gpu_memory_utilization	Feb 23, 2026	Feb 25, 2026	2 days earlier
Gemma 4 tool parser	Apr 5, 2026	Apr 6, 2026	1 day earlier

For DeltaNet/SSM prefix cache fixes specifically — yes, Rapid-MLX had those earlier (March 15). Our think-suffix stripping for Qwen3.5 hybrid models was developed independently in April.

These features have been running in production on our Apple Silicon setup (Qwen3.5-122B, Gemma 4, etc.) since February/March. PR #278 is a backport of that production-tested code to upstream.

Both forks clearly arrived at similar solutions independently — convergent engineering from working on the same codebase. Happy to collaborate going forward!

waybarrios · 2026-04-11T15:38:37Z

Hi both, appreciate the transparency on the timeline. That said, I don't think there's a need to get into who did what first. Both forks have been doing great work independently, and that's exactly what open source is about.

@janhilgard, you're right that many of these features came through in earlier PRs that I hadn't had the bandwidth to review until now. That's on me. @raullenchai, Rapid-MLX has been doing solid work and the acknowledgment in the README is well deserved. Worth noting that this repo is the upstream parent of Rapid-MLX, so there's naturally a lot of shared DNA between the two.

At the end of the day, we're all building on the same codebase and pushing it forward. No competition here, just collaboration. Let's keep it that way.

janhilgard force-pushed the feat/production-backport branch 2 times, most recently from 462b8a8 to c7cb3aa Compare April 10, 2026 22:31

janhilgard requested a review from Thump604 April 10, 2026 22:33

Thump604 requested changes Apr 10, 2026

View reviewed changes

janhilgard force-pushed the feat/production-backport branch from c7cb3aa to e46bdb8 Compare April 10, 2026 22:49

janhilgard commented Apr 10, 2026

View reviewed changes

Thump604 approved these changes Apr 10, 2026

View reviewed changes

janhilgard force-pushed the feat/production-backport branch 4 times, most recently from d56426f to c00f4ce Compare April 11, 2026 08:00

janhilgard force-pushed the feat/production-backport branch from c00f4ce to f61d34e Compare April 11, 2026 08:33

janhilgard self-assigned this Apr 11, 2026

Thump604 merged commit 99649d4 into waybarrios:main Apr 11, 2026
7 checks passed

janhilgard mentioned this pull request Apr 11, 2026

Fix Anthropic streaming leaking <think> and <tool_call> markup + add tests #132

Closed

janhilgard mentioned this pull request Apr 11, 2026

fix: parse tool calls in streaming reasoning branch #177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: backport production features — MTP, tool parsers, sampling, prefill#278

feat: backport production features — MTP, tool parsers, sampling, prefill#278
Thump604 merged 1 commit intowaybarrios:mainfrom
janhilgard:feat/production-backport

janhilgard commented Apr 10, 2026 •

edited

Loading

Uh oh!

Thump604 left a comment

Uh oh!

janhilgard left a comment

Uh oh!

Thump604 left a comment

Uh oh!

Uh oh!

raullenchai commented Apr 11, 2026

Uh oh!

waybarrios commented Apr 11, 2026 •

edited

Loading

Uh oh!

janhilgard commented Apr 11, 2026

Uh oh!

waybarrios commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

janhilgard commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New files (4)

Modified files (17)

MTP speculative decoding results

Prefix cache: three fixes for Qwen3.5

Fix 1: Think-suffix stripping

Fix 2: last_query_index template normalization

Fix 3: LCP blocked for hybrid models

Results

Changes from v1 (addressing review feedback)

Changes from v2 (prefix cache fix)

Changes from v3 (template normalization fix)

Test plan

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

raullenchai commented Apr 11, 2026

Uh oh!

waybarrios commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

janhilgard commented Apr 11, 2026

Uh oh!

waybarrios commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

janhilgard commented Apr 10, 2026 •

edited

Loading

Fix 2: `last_query_index` template normalization

waybarrios commented Apr 11, 2026 •

edited

Loading