Skip to content

fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection#97

Merged
waybarrios merged 2 commits intowaybarrios:mainfrom
janhilgard:fix/hybrid-model-batching-mtp-injection
Mar 22, 2026
Merged

fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection#97
waybarrios merged 2 commits intowaybarrios:mainfrom
janhilgard:fix/hybrid-model-batching-mtp-injection

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

Summary

  • ensure_mamba_support() → no-op: mlx-lm >= 0.30.6 ArraysCache natively supports all batch operations (extract, merge, filter, prepare). The old monkey-patch replaced ArraysCache with BatchMambaCache in _make_cache, which broke hybrid models like Qwen3.5-397B that mix ArraysCache + KVCache layers.
  • Add inject_mtp_support(): Dynamically creates MTP module, quantizes it to match the base model, loads weights from model-mtp.safetensors, and monkey-patches the model class with return_hidden, mtp_forward, and make_mtp_cache.
  • Add MTP auto-injection on load: _try_inject_mtp_post_load() detects when sanitize() strips MTP weights during mlx_lm.load() and re-injects them. Also adds strict=False fallback for models with extra MTP parameters.
  • validate_mtp_support(): Support model.language_model.args hierarchy for models with nested configs (e.g., multimodal models).
  • Engine error logging: Add full traceback to engine loop error handler for easier debugging.

Motivation

When testing Qwen3.5-397B (hybrid GatedDeltaNet + full attention architecture), two issues surfaced:

  1. The _make_cache monkey-patch assumed all layers use the same cache type. Hybrid models have a mix of ArraysCache (for linear attention/SSM layers) and KVCache (for full attention layers). The patch converted ALL ArraysCache to BatchMambaCache, breaking the hybrid cache structure. Since mlx-lm 0.30.6, ArraysCache has native batch support, making the patch unnecessary.

  2. mlx_lm.load() calls sanitize() which strips unrecognized weights (including MTP). Models with num_nextn_predict_layers > 0 in config but no MTP module definition in mlx-lm's model code would silently lose MTP capability. The auto-injection detects this and recovers MTP support.

Test plan

  • Qwen3-Next-80B with MTP enabled — ~78 tok/s, MTP correctly injected
  • GPT-OSS-20B — continuous batching works without regression
  • Qwen3-VL-8B — multimodal model loads correctly
  • Verify ensure_mamba_support() logs skip message instead of patching
  • Test strict=False fallback path with a model that has extra MTP weights

🤖 Generated with Claude Code

@janhilgard janhilgard force-pushed the fix/hybrid-model-batching-mtp-injection branch from 432089b to 6293297 Compare February 18, 2026 20:33
@Thump604
Copy link
Copy Markdown
Collaborator

Tested the mamba_cache.py change (disabling the monkey-patch) in isolation on Mac Studio M2 Ultra (128GB) with Nemotron-3-Super-120B-A12B at 4.5-bit via vllm-mlx serving.

Results: All tests pass — chat completions, tool calling, and streaming work correctly with the disabled _make_cache patch. The Nemotron tokenizer fallback loads correctly and the model serves without issues.

Setup: Combined with ml-explore/mlx-lm PRs #988 (SSM precision) and #992 (LatentMoE), which are required for Nemotron Super to produce coherent output.

One concern about the broader PR: The server.py change removes exclude_none=True from the MLLM message dict conversion. That exclude_none was added specifically to prevent a Qwen3-VL crash where image_url: null on text parts triggers an error in the chat template. Reverting it may reintroduce that regression for multimodal models. Could this be split out or preserved?

The mamba_cache.py and scheduler.py changes look clean and correct.

@Thump604
Copy link
Copy Markdown
Collaborator

Follow-up: Applied the full PR (all 9 files) and tested both Nemotron 3 Super 120B (text-only LLM path) and Qwen 3.5 35B (MLLM path).

Nemotron (LLM path): Chat, tool calling, and streaming all pass. No regressions.

Qwen 3.5 35B (MLLM path): Chat works. Tool calling doesn't emit tool_calls — but this is a pre-existing issue, not a PR #97 regression. Confirmed by reverting just server.py to the original and retesting: same behavior. The exclude_none concern I raised earlier is a non-issue.

Full PR is safe to merge. All changes tested on Mac Studio M2 Ultra 128GB.

…injection

- ensure_mamba_support() now no-op: mlx-lm >= 0.30.6 ArraysCache has
  native batch support; old patch broke hybrid models (ArraysCache + KVCache)
- Add inject_mtp_support(): dynamically create MTP module, load weights,
  and monkey-patch model class with return_hidden/mtp_forward/make_mtp_cache
- Add _try_inject_mtp_post_load: auto-detect and inject MTP weights
  stripped by sanitize() during mlx_lm.load()
- Add strict=False fallback for models with extra MTP parameters
- validate_mtp_support: support model.language_model.args hierarchy
- Improve engine loop error logging with full traceback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@janhilgard janhilgard force-pushed the fix/hybrid-model-batching-mtp-injection branch from 6293297 to c70b80b Compare March 21, 2026 22:22
@janhilgard
Copy link
Copy Markdown
Collaborator Author

@Thump604 Thanks for the thorough testing on both Nemotron 120B and Qwen 3.5 35B — really appreciate the follow-up confirming exclude_none is a non-issue.

Good news: PR has been freshly rebased onto main and is mergeable. Note that the current diff is now 4 files (engine_core.py, patches/qwen3_next_mtp.py, utils/mamba_cache.py, utils/tokenizer.py) — the server.py change you tested is no longer part of this PR since it was already addressed upstream.

@waybarrios This one has been validated by an independent tester on two different architectures (Nemotron hybrid SSM+attention, Qwen 3.5 MoE). Ready for review when you get a chance.

@waybarrios
Copy link
Copy Markdown
Owner

reviewed the full diff. the mamba_cache no-op, MTP injection, and engine traceback logging all look good. tested by @Thump604 on two architectures (Nemotron 120B and Qwen 3.5 35B), CI green, LGTM from @janhilgard

found dead code in tokenizer.py inside _load_strict_false where _try_inject_mtp_post_load and a second return were unreachable after the first return. pushed a fix removing those 3 lines since _try_inject_mtp above the return already handles MTP injection for the strict=False path

# before (dead code after return)
_try_inject_mtp(model, model_path, config)
return model, tokenizer

_try_inject_mtp_post_load(model, model_name)  # never reached
return model, tokenizer                        # never reached

# after
_try_inject_mtp(model, model_path, config)
return model, tokenizer

merging now

@waybarrios waybarrios merged commit d235c37 into waybarrios:main Mar 22, 2026
6 checks passed
Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Mar 22, 2026
Port SimpleEngine features to BatchedEngine for continuous batching:

- Per-request MTP routing: text-only → TextModel (MTP), media → MLLM
- message_utils.py: shared _normalize_messages (developer→system,
  merge consecutive same-role, hoist system to [0])
- SpecPrefill config + draft model lifecycle in BatchedEngine
- System KV cache with ChatML boundary detection

Replaces PR waybarrios#192 (rebased against main after merge of waybarrios#180, waybarrios#97).
raullenchai pushed a commit to raullenchai/Rapid-MLX that referenced this pull request Mar 26, 2026
…ection, served-model-name

Merge 16 upstream commits (22dcbf8..d235c37) into our fork:

- feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180)
- feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (waybarrios#150)
- fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection (waybarrios#97)
- feat: Add --served-model-name CLI parameter (waybarrios#125)
- feat: Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrios#127)
- fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157)
- fix: Metal resource leak under high concurrency (waybarrios#92)

Conflict resolution strategy: keep all fork features (DeltaNet snapshots,
fast SSE templates, tool injection, cloud routing, prompt cache, etc.)
while incorporating upstream's new functionality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants