fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection#97
Conversation
432089b to
6293297
Compare
|
Tested the Results: All tests pass — chat completions, tool calling, and streaming work correctly with the disabled Setup: Combined with ml-explore/mlx-lm PRs #988 (SSM precision) and #992 (LatentMoE), which are required for Nemotron Super to produce coherent output. One concern about the broader PR: The The |
|
Follow-up: Applied the full PR (all 9 files) and tested both Nemotron 3 Super 120B (text-only LLM path) and Qwen 3.5 35B (MLLM path). Nemotron (LLM path): Chat, tool calling, and streaming all pass. No regressions. Qwen 3.5 35B (MLLM path): Chat works. Tool calling doesn't emit tool_calls — but this is a pre-existing issue, not a PR #97 regression. Confirmed by reverting just Full PR is safe to merge. All changes tested on Mac Studio M2 Ultra 128GB. |
…injection - ensure_mamba_support() now no-op: mlx-lm >= 0.30.6 ArraysCache has native batch support; old patch broke hybrid models (ArraysCache + KVCache) - Add inject_mtp_support(): dynamically create MTP module, load weights, and monkey-patch model class with return_hidden/mtp_forward/make_mtp_cache - Add _try_inject_mtp_post_load: auto-detect and inject MTP weights stripped by sanitize() during mlx_lm.load() - Add strict=False fallback for models with extra MTP parameters - validate_mtp_support: support model.language_model.args hierarchy - Improve engine loop error logging with full traceback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6293297 to
c70b80b
Compare
|
@Thump604 Thanks for the thorough testing on both Nemotron 120B and Qwen 3.5 35B — really appreciate the follow-up confirming Good news: PR has been freshly rebased onto main and is mergeable. Note that the current diff is now 4 files ( @waybarrios This one has been validated by an independent tester on two different architectures (Nemotron hybrid SSM+attention, Qwen 3.5 MoE). Ready for review when you get a chance. |
|
reviewed the full diff. the mamba_cache no-op, MTP injection, and engine traceback logging all look good. tested by @Thump604 on two architectures (Nemotron 120B and Qwen 3.5 35B), CI green, LGTM from @janhilgard found dead code in tokenizer.py inside _load_strict_false where _try_inject_mtp_post_load and a second return were unreachable after the first return. pushed a fix removing those 3 lines since _try_inject_mtp above the return already handles MTP injection for the strict=False path # before (dead code after return)
_try_inject_mtp(model, model_path, config)
return model, tokenizer
_try_inject_mtp_post_load(model, model_name) # never reached
return model, tokenizer # never reached
# after
_try_inject_mtp(model, model_path, config)
return model, tokenizermerging now |
Port SimpleEngine features to BatchedEngine for continuous batching: - Per-request MTP routing: text-only → TextModel (MTP), media → MLLM - message_utils.py: shared _normalize_messages (developer→system, merge consecutive same-role, hoist system to [0]) - SpecPrefill config + draft model lifecycle in BatchedEngine - System KV cache with ChatML boundary detection Replaces PR waybarrios#192 (rebased against main after merge of waybarrios#180, waybarrios#97).
…ection, served-model-name Merge 16 upstream commits (22dcbf8..d235c37) into our fork: - feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180) - feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (waybarrios#150) - fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection (waybarrios#97) - feat: Add --served-model-name CLI parameter (waybarrios#125) - feat: Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrios#127) - fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157) - fix: Metal resource leak under high concurrency (waybarrios#92) Conflict resolution strategy: keep all fork features (DeltaNet snapshots, fast SSE templates, tool injection, cloud routing, prompt cache, etc.) while incorporating upstream's new functionality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
ensure_mamba_support()→ no-op: mlx-lm >= 0.30.6ArraysCachenatively supports all batch operations (extract,merge,filter,prepare). The old monkey-patch replacedArraysCachewithBatchMambaCachein_make_cache, which broke hybrid models like Qwen3.5-397B that mixArraysCache+KVCachelayers.inject_mtp_support(): Dynamically creates MTP module, quantizes it to match the base model, loads weights frommodel-mtp.safetensors, and monkey-patches the model class withreturn_hidden,mtp_forward, andmake_mtp_cache._try_inject_mtp_post_load()detects whensanitize()strips MTP weights duringmlx_lm.load()and re-injects them. Also addsstrict=Falsefallback for models with extra MTP parameters.validate_mtp_support(): Supportmodel.language_model.argshierarchy for models with nested configs (e.g., multimodal models).Motivation
When testing Qwen3.5-397B (hybrid GatedDeltaNet + full attention architecture), two issues surfaced:
The
_make_cachemonkey-patch assumed all layers use the same cache type. Hybrid models have a mix ofArraysCache(for linear attention/SSM layers) andKVCache(for full attention layers). The patch converted ALLArraysCachetoBatchMambaCache, breaking the hybrid cache structure. Since mlx-lm 0.30.6,ArraysCachehas native batch support, making the patch unnecessary.mlx_lm.load()callssanitize()which strips unrecognized weights (including MTP). Models withnum_nextn_predict_layers > 0in config but no MTP module definition in mlx-lm's model code would silently lose MTP capability. The auto-injection detects this and recovers MTP support.Test plan
ensure_mamba_support()logs skip message instead of patchingstrict=Falsefallback path with a model that has extra MTP weights🤖 Generated with Claude Code