Skip to content

feat: backport production features — MTP, tool parsers, sampling, prefill#278

Merged
Thump604 merged 1 commit intowaybarrios:mainfrom
janhilgard:feat/production-backport
Apr 11, 2026
Merged

feat: backport production features — MTP, tool parsers, sampling, prefill#278
Thump604 merged 1 commit intowaybarrios:mainfrom
janhilgard:feat/production-backport

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard commented Apr 10, 2026

Summary

Backport battle-tested features from the production fork to upstream. These features have been running in production for days with multiple models (Qwen3.5-122B-A10B, Gemma 4 26B-A4B).

New files (4)

  • patches/qwen3_5_mllm.py — BatchKVCache offset fix for Qwen3.5 attention (same mx.array.__iadd__ bug class as Gemma 4 fix in fix: patch Gemma 4 attention and RotatingKVCache for BatchKVCache #256)
  • patches/qwen3_5_mtp.py — Runtime MTP (Multi-Token Prediction) injection for Qwen3.5 models
  • tool_parsers/minimax_tool_parser.py — MiniMax-M2 tool call parser
  • scripts/add_mtp_weights_qwen35.py — Extract MTP weights from BF16 Qwen3.5 source models

Modified files (17)

MLLM batch generator — Chunked prefill, mid-batch extend, MTP injection hooks, patch registration, repetition penalty, prefill abort on disconnect, per-request logits processors, think-suffix stripping for prefix cache, last_query_index template normalization for prefix-stable assistant turns

MLLM scheduler — Request status tracking, cache config forwarding, prefill abort support, fixed duplicate cache-clear block

Serverenable_thinking per-request, tool_choice=none, tool argument type coercion, repetition_penalty

Engines — MTP injection in batched engine, enable_thinking in simple engine, gpu_memory_utilization config

CLI — MiniMax parser choice, --gpu-memory-utilization flag

API modelsenable_thinking, repetition_penalty

Prefix cache — Block LCP for hybrid models (SSM can't be rewound), think-suffix stripping for clean PREFIX matches, chat template normalization for Qwen3.5

Other — Qwen3.5/MiniMax MLLM patterns, TextModelArgs from VLM config dict, MTP validator in scheduler

MTP speculative decoding results

Model Acceptance rate Notes
Qwen3.5-122B-A10B (MoE) 77-78% 5.0 GB BF16 MTP weights
Qwen3.5-35B-A3B (MoE) 79-85% 1.7 GB BF16 MTP weights
Qwen3.5-27B (Dense) 78-80% 849 MB BF16 MTP weights

Critical: MTP weights must be native BF16 extracted from source model — dequantized 4-bit weights give 0% acceptance.

Prefix cache: three fixes for Qwen3.5

Qwen3.5 is a hybrid model (~75% GatedDeltaNet SSM layers, ~25% attention layers). Three bugs prevented prefix cache from working:

Fix 1: Think-suffix stripping

enable_thinking=True adds <think>\n (2 tokens) to the generation prompt. Stored key ends with <think>\n, but next request has actual response at that position — tokens diverge, no PREFIX match.

Fix: Strip think suffix from cache keys in both store and fetch. _compute_think_suffix_len() detects suffix at init. Suffix tokens re-appended to remaining_ids after fetch so model still sees full generation prompt.

Fix 2: last_query_index template normalization

Qwen3.5 chat template computes last_query_index (position of last non-tool-response user message) and conditionally wraps assistant turns after that index in <think>...\n</think>\n\n. When a user text message is appended after tool results, last_query_index jumps forward, retroactively removing <think> blocks from earlier assistant turns — shifting tokens mid-sequence.

Fix: _normalize_chat_template_for_prefix_cache() regex-replaces the conditional with the plain (ELSE) branch. All historical assistant messages use <|im_start|>assistant\ncontent without injected <think> blocks. Generation prompt still adds <think>\n at the end.

Critical: VLM processors (e.g. Qwen3VLProcessor) keep a separate copy of chat_template from their tokenizer. processor.apply_chat_template() reads from processor.chat_template, NOT processor.tokenizer.chat_template. Both must be patched.

Fix 3: LCP blocked for hybrid models

SSM state can't be rewound — setting layers to None crashes on merge(). LCP match is blocked; the other two fixes enable clean PREFIX matches instead.

Results

Scenario Before After
Real Claude Code (87K tokens) 256s prefill 4.1s (PREFIX HIT cached=87467)
tool_result → tool_result PREFIX HIT PREFIX HIT
tool_result → user_text MISS PREFIX HIT (cached=441 remaining=177)
user_text → user_text PREFIX HIT PREFIX HIT

Changes from v1 (addressing review feedback)

  1. Reverted qwen3_parser.py no-tag behavior — no-tag output stays as pure content (not reasoning), matching upstream semantics
  2. Removed frequency_penalty/presence_penalty API fields — only repetition_penalty is exposed, avoiding misleading OpenAI-compatible names without proper semantics
  3. Fixed duplicate cache-clear block in mllm_scheduler.py — removed second adaptive Metal cache-clear that caused double _step_count increment

Changes from v2 (prefix cache fix)

  1. Replaced LCP hybrid match with think-suffix stripping — LCP hybrid match (setting SSM layers to None) crashed on merge(). New approach strips <think>\n tokens from cache keys so PREFIX match works cleanly without touching SSM state.

Changes from v3 (template normalization fix)

  1. Added last_query_index template normalization — Qwen3.5 chat template retroactively changes assistant turn formatting based on last_query_index, breaking prefix cache for tool_result→user_text transitions
  2. Fixed VLM processor dual-template bugprocessor.chat_template is a separate object from processor.tokenizer.chat_template; both must be patched for normalization to take effect

Test plan

  • uvx black --check vllm_mlx/ passes
  • python -m py_compile passes for all modified files
  • Prefix cache: 62x speedup on Qwen3.5-122B (256s → 4.1s for 87K tokens)
  • Prefix cache: tool_result → user_text transition hits (cached=441 remaining=177)
  • Prefix cache: tool_result → tool_result transition hits (cached=87467 remaining=478)
  • Billing header strip works with prefix cache
  • Think-suffix stripping detects <think> = 2 tokens
  • Template normalization patches both processor and tokenizer copies
  • Qwen3.5 model loads with --enable-mtp flag
  • MiniMax tool parser works with --tool-call-parser minimax
  • enable_thinking parameter respected per-request
  • Repetition penalty works correctly
  • Prefill abort fires on client disconnect

🤖 Generated with Claude Code

@janhilgard janhilgard force-pushed the feat/production-backport branch 2 times, most recently from 462b8a8 to c7cb3aa Compare April 10, 2026 22:31
@janhilgard janhilgard requested a review from Thump604 April 10, 2026 22:33
Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled this branch locally and do not think it is sweep-merge ready as one omnibus backport. Three concrete blockers stood out:

  1. vllm_mlx/reasoning/qwen3_parser.py changes no-tag output from pure content to pure reasoning. That codifies the exact Qwen failure mode we just debugged locally: the model can return an answer without channel tags, and this change makes that answer invisible in non-streaming unless another layer recovers it. I would not merge that behavior change unconditionally. Either gate it on request-level policy (enable_thinking / explicit recovery flag) or leave the parser's no-tag case as content and handle recovery in the server.

  2. The new frequency_penalty / presence_penalty API is not implemented with the semantics the field names promise. In server.py, both are collapsed onto a single repetition_penalty, and if both are set one of them is silently ignored. That is not an upstream-safe API expansion under OpenAI-compatible names. Either implement the real semantics or only expose repetition_penalty in this PR.

  3. vllm_mlx/mllm_scheduler.py now has two adaptive Metal cache-clear blocks back-to-back and increments _step_count twice per step. That looks like a bad merge and changes the clear cadence accidentally.

I'm happy to re-review after those are fixed or split out. The rest of the bundle is large enough that I would strongly prefer it land as smaller PRs unless there is a hard dependency forcing the aggregation.

@janhilgard janhilgard force-pushed the feat/production-backport branch from c7cb3aa to e46bdb8 Compare April 10, 2026 22:49
Copy link
Copy Markdown
Collaborator Author

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough review! All three blockers fixed:

  1. qwen3_parser.py — Reverted to upstream behavior: no-tag output = pure content. The reasoning-as-default change was too aggressive without gating on enable_thinking.

  2. frequency_penalty / presence_penalty — Removed both fields from the API models entirely. Only repetition_penalty (mlx-lm native) is now exposed. No misleading OpenAI-compatible names without proper semantics.

  3. mllm_scheduler.py duplicate cache-clear — Removed the second adaptive block that caused double _step_count increment. Single cache-clear block remains with the original interval logic.

Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed the follow-up delta from c7cb3aa to e46bdb8.

The three blockers from my earlier review are addressed:

  • qwen3_parser.py no longer changes the no-tag case
  • the misleading frequency_penalty / presence_penalty API expansion is gone; this is now just repetition_penalty
  • the duplicate adaptive cache-clear / double _step_count issue in mllm_scheduler.py is gone

I do not see a new blocker in this follow-up range.

@janhilgard janhilgard force-pushed the feat/production-backport branch 4 times, most recently from d56426f to c00f4ce Compare April 11, 2026 08:00
…fill

New files:
- patches/qwen3_5_mllm.py: BatchKVCache offset fix for Qwen3.5
- patches/qwen3_5_mtp.py: Runtime MTP injection for Qwen3.5
- tool_parsers/minimax_tool_parser.py: MiniMax-M2 tool parser
- scripts/add_mtp_weights_qwen35.py: Extract MTP weights from BF16

Key changes:
- mllm_batch_generator: chunked prefill, mid-batch extend, MTP hooks,
  patch registration, repetition penalty, prefill abort, think-suffix
  stripping for prefix cache
- mllm_scheduler: request status, cache config, prefill abort
- server: enable_thinking, tool_choice=none, tool argument coercion
- engines: MTP injection, enable_thinking, gpu_memory_utilization
- memory_cache: block LCP for hybrid models (SSM can't be rewound)

Prefix cache fix: enable_thinking=True adds <think>\n to generation
prompt, breaking PREFIX match across conversation turns.  Strip these
tokens from cache keys in both store and fetch paths so stored entries
match as clean prefixes.  Tested: 3.12s → 0.39s (8x) for 1400-token
prompts on Qwen3.5-122B hybrid model.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@raullenchai
Copy link
Copy Markdown

Hey @janhilgard — thanks for backporting these features to upstream! Great to see them landing here.

Just a note: these features (MTP injection, MiniMax tool parser, prefix cache think-suffix stripping, gpu_memory_utilization, DeltaNet/SSM prefix cache fixes) originated from our fork Rapid-MLX, where they've been developed and battle-tested over the past few months.

Would appreciate a mention in the PR description for attribution — something like "backported from Rapid-MLX" would be great. Open source works best when we credit each other's work. 🙏

Happy to collaborate more directly going forward!

@waybarrios
Copy link
Copy Markdown
Owner

waybarrios commented Apr 11, 2026

Hey @raullenchai! Thanks for flagging this! totally fair. Rapid-MLX is already in our README acknowledgments. Always down to collaborate more, feel free to reach out anytime.

@janhilgard
Copy link
Copy Markdown
Collaborator Author

Hey @raullenchai — thanks for the note!

I want to respectfully clarify the timeline here. Our fork (janhilgard/vllm-mlx) was created on January 20, 2026 — 35 days before Rapid-MLX was forked (February 25, 2026). The features listed in your comment were developed in our fork first:

Feature janhilgard/vllm-mlx Rapid-MLX Delta
MTP speculative decoding Feb 14, 2026 Feb 25, 2026 11 days earlier
MiniMax-M2 tool parser Feb 18, 2026 Feb 19, 2026 1 day earlier
gpu_memory_utilization Feb 23, 2026 Feb 25, 2026 2 days earlier
Gemma 4 tool parser Apr 5, 2026 Apr 6, 2026 1 day earlier

For DeltaNet/SSM prefix cache fixes specifically — yes, Rapid-MLX had those earlier (March 15). Our think-suffix stripping for Qwen3.5 hybrid models was developed independently in April.

These features have been running in production on our Apple Silicon setup (Qwen3.5-122B, Gemma 4, etc.) since February/March. PR #278 is a backport of that production-tested code to upstream.

Both forks clearly arrived at similar solutions independently — convergent engineering from working on the same codebase. Happy to collaborate going forward!

@waybarrios
Copy link
Copy Markdown
Owner

Hi both, appreciate the transparency on the timeline. That said, I don't think there's a need to get into who did what first. Both forks have been doing great work independently, and that's exactly what open source is about.

@janhilgard, you're right that many of these features came through in earlier PRs that I hadn't had the bandwidth to review until now. That's on me. @raullenchai, Rapid-MLX has been doing solid work and the acknowledgment in the README is well deserved. Worth noting that this repo is the upstream parent of Rapid-MLX, so there's naturally a lot of shared DNA between the two.

At the end of the day, we're all building on the same codebase and pushing it forward. No competition here, just collaboration. Let's keep it that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants