fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection by janhilgard · Pull Request #97 · waybarrios/vllm-mlx

janhilgard · 2026-02-18T20:28:10Z

Summary

ensure_mamba_support() → no-op: mlx-lm >= 0.30.6 ArraysCache natively supports all batch operations (extract, merge, filter, prepare). The old monkey-patch replaced ArraysCache with BatchMambaCache in _make_cache, which broke hybrid models like Qwen3.5-397B that mix ArraysCache + KVCache layers.
Add inject_mtp_support(): Dynamically creates MTP module, quantizes it to match the base model, loads weights from model-mtp.safetensors, and monkey-patches the model class with return_hidden, mtp_forward, and make_mtp_cache.
Add MTP auto-injection on load: _try_inject_mtp_post_load() detects when sanitize() strips MTP weights during mlx_lm.load() and re-injects them. Also adds strict=False fallback for models with extra MTP parameters.
validate_mtp_support(): Support model.language_model.args hierarchy for models with nested configs (e.g., multimodal models).
Engine error logging: Add full traceback to engine loop error handler for easier debugging.

Motivation

When testing Qwen3.5-397B (hybrid GatedDeltaNet + full attention architecture), two issues surfaced:

The _make_cache monkey-patch assumed all layers use the same cache type. Hybrid models have a mix of ArraysCache (for linear attention/SSM layers) and KVCache (for full attention layers). The patch converted ALL ArraysCache to BatchMambaCache, breaking the hybrid cache structure. Since mlx-lm 0.30.6, ArraysCache has native batch support, making the patch unnecessary.
mlx_lm.load() calls sanitize() which strips unrecognized weights (including MTP). Models with num_nextn_predict_layers > 0 in config but no MTP module definition in mlx-lm's model code would silently lose MTP capability. The auto-injection detects this and recovers MTP support.

Test plan

Qwen3-Next-80B with MTP enabled — ~78 tok/s, MTP correctly injected
GPT-OSS-20B — continuous batching works without regression
Qwen3-VL-8B — multimodal model loads correctly
Verify ensure_mamba_support() logs skip message instead of patching
Test strict=False fallback path with a model that has extra MTP weights

🤖 Generated with Claude Code

Thump604 · 2026-03-14T18:54:00Z

Tested the mamba_cache.py change (disabling the monkey-patch) in isolation on Mac Studio M2 Ultra (128GB) with Nemotron-3-Super-120B-A12B at 4.5-bit via vllm-mlx serving.

Results: All tests pass — chat completions, tool calling, and streaming work correctly with the disabled _make_cache patch. The Nemotron tokenizer fallback loads correctly and the model serves without issues.

Setup: Combined with ml-explore/mlx-lm PRs #988 (SSM precision) and #992 (LatentMoE), which are required for Nemotron Super to produce coherent output.

One concern about the broader PR: The server.py change removes exclude_none=True from the MLLM message dict conversion. That exclude_none was added specifically to prevent a Qwen3-VL crash where image_url: null on text parts triggers an error in the chat template. Reverting it may reintroduce that regression for multimodal models. Could this be split out or preserved?

The mamba_cache.py and scheduler.py changes look clean and correct.

Thump604 · 2026-03-14T19:06:29Z

Follow-up: Applied the full PR (all 9 files) and tested both Nemotron 3 Super 120B (text-only LLM path) and Qwen 3.5 35B (MLLM path).

Nemotron (LLM path): Chat, tool calling, and streaming all pass. No regressions.

Qwen 3.5 35B (MLLM path): Chat works. Tool calling doesn't emit tool_calls — but this is a pre-existing issue, not a PR #97 regression. Confirmed by reverting just server.py to the original and retesting: same behavior. The exclude_none concern I raised earlier is a non-issue.

Full PR is safe to merge. All changes tested on Mac Studio M2 Ultra 128GB.

…injection - ensure_mamba_support() now no-op: mlx-lm >= 0.30.6 ArraysCache has native batch support; old patch broke hybrid models (ArraysCache + KVCache) - Add inject_mtp_support(): dynamically create MTP module, load weights, and monkey-patch model class with return_hidden/mtp_forward/make_mtp_cache - Add _try_inject_mtp_post_load: auto-detect and inject MTP weights stripped by sanitize() during mlx_lm.load() - Add strict=False fallback for models with extra MTP parameters - validate_mtp_support: support model.language_model.args hierarchy - Improve engine loop error logging with full traceback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

janhilgard · 2026-03-21T22:55:43Z

@Thump604 Thanks for the thorough testing on both Nemotron 120B and Qwen 3.5 35B — really appreciate the follow-up confirming exclude_none is a non-issue.

Good news: PR has been freshly rebased onto main and is mergeable. Note that the current diff is now 4 files (engine_core.py, patches/qwen3_next_mtp.py, utils/mamba_cache.py, utils/tokenizer.py) — the server.py change you tested is no longer part of this PR since it was already addressed upstream.

@waybarrios This one has been validated by an independent tester on two different architectures (Nemotron hybrid SSM+attention, Qwen 3.5 MoE). Ready for review when you get a chance.

waybarrios · 2026-03-22T05:30:22Z

reviewed the full diff. the mamba_cache no-op, MTP injection, and engine traceback logging all look good. tested by @Thump604 on two architectures (Nemotron 120B and Qwen 3.5 35B), CI green, LGTM from @janhilgard

found dead code in tokenizer.py inside _load_strict_false where _try_inject_mtp_post_load and a second return were unreachable after the first return. pushed a fix removing those 3 lines since _try_inject_mtp above the return already handles MTP injection for the strict=False path

# before (dead code after return)
_try_inject_mtp(model, model_path, config)
return model, tokenizer

_try_inject_mtp_post_load(model, model_name)  # never reached
return model, tokenizer                        # never reached

# after
_try_inject_mtp(model, model_path, config)
return model, tokenizer

merging now

Port SimpleEngine features to BatchedEngine for continuous batching: - Per-request MTP routing: text-only → TextModel (MTP), media → MLLM - message_utils.py: shared _normalize_messages (developer→system, merge consecutive same-role, hoist system to [0]) - SpecPrefill config + draft model lifecycle in BatchedEngine - System KV cache with ChatML boundary detection Replaces PR waybarrios#192 (rebased against main after merge of waybarrios#180, waybarrios#97).

…ection, served-model-name Merge 16 upstream commits (22dcbf8..d235c37) into our fork: - feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180) - feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (waybarrios#150) - fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection (waybarrios#97) - feat: Add --served-model-name CLI parameter (waybarrios#125) - feat: Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrios#127) - fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157) - fix: Metal resource leak under high concurrency (waybarrios#92) Conflict resolution strategy: keep all fork features (DeltaNet snapshots, fast SSE templates, tool injection, cloud routing, prompt cache, etc.) while incorporating upstream's new functionality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

janhilgard force-pushed the fix/hybrid-model-batching-mtp-injection branch from 432089b to 6293297 Compare February 18, 2026 20:33

janhilgard force-pushed the fix/hybrid-model-batching-mtp-injection branch from 6293297 to c70b80b Compare March 21, 2026 22:22

remove dead code in _load_strict_false

74c2f02

waybarrios merged commit d235c37 into waybarrios:main Mar 22, 2026
6 checks passed

This was referenced Mar 22, 2026

feat: MTP per-request routing + system KV cache in BatchedEngine #192

Closed

feat: BatchedEngine parity — MTP routing, normalization, SpecPrefill #203

Closed

fix: MLLM continuous batching for hybrid models #165

Closed

raullenchai mentioned this pull request Mar 26, 2026

Sync upstream: SpecPrefill, native video, MTP injection raullenchai/Rapid-MLX#58

Open

4 tasks

Thump604 mentioned this pull request Mar 30, 2026

feat: full sampling parameter support (top_k, min_p, presence_penalty, repetition_penalty) #213

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection#97

fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection#97
waybarrios merged 2 commits intowaybarrios:mainfrom
janhilgard:fix/hybrid-model-batching-mtp-injection

janhilgard commented Feb 18, 2026

Uh oh!

Thump604 commented Mar 14, 2026

Uh oh!

Thump604 commented Mar 14, 2026

Uh oh!

janhilgard commented Mar 21, 2026

Uh oh!

waybarrios commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

janhilgard commented Feb 18, 2026

Summary

Motivation

Test plan

Uh oh!

Thump604 commented Mar 14, 2026

Uh oh!

Thump604 commented Mar 14, 2026

Uh oh!

janhilgard commented Mar 21, 2026

Uh oh!

waybarrios commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants