Add Qwen3.5 model support (text-only) and fix reasoning+tool streaming#127
Conversation
…ol streaming - Add strict=False fallback in tokenizer loader for models with extra weights (e.g., vision tower params), enabling Qwen3.5 to load via mlx-lm as a text-only model - Fix streaming tool call parsing when both --reasoning-parser and --tool-call-parser are enabled (previously mutually exclusive branches) - Make memory pressure threshold dynamic based on system RAM instead of hardcoded 200GB Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…reaming fix, dynamic memory - strict=False fallback for models with extra vision tower weights - Fix streaming when both --reasoning-parser and --tool-call-parser enabled - Dynamic memory pressure threshold based on Metal max_recommended_working_set_size Tested with Qwen3.5-397B-A17B-4bit on 512GB unified memory. Cherry-picked from: waybarrios#127 Original author: otarkhan Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ios#127) - strict=False loading fallback (conflict resolved: kept PR waybarrios#140's version with memory cleanup) - Dynamic memory pressure threshold (was hardcoded 200GB, now 85% of Metal max) - Fix reasoning+tool streaming coexistence
…en_ids in strict=False loader
|
thanks for this, the tokenizer strict=False fallback and the dynamic memory threshold are solid changes i made two small adjustments and pushed them to your branch
model, config = load_model(model_path, strict=False)
tokenizer = load_tokenizer(
model_path,
tokenizer_config or {},
eos_token_ids=config.get("eos_token_id", None),
)also removed the redundant the rest of the PR (tokenizer fallback + dynamic memory pressure) looks good to merge. let me know if you have any questions |
…ection, served-model-name Merge 16 upstream commits (22dcbf8..d235c37) into our fork: - feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180) - feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (waybarrios#150) - fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection (waybarrios#97) - feat: Add --served-model-name CLI parameter (waybarrios#125) - feat: Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrios#127) - fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157) - fix: Metal resource leak under high concurrency (waybarrios#92) Conflict resolution strategy: keep all fork features (DeltaNet snapshots, fast SSE templates, tool injection, cloud routing, prompt cache, etc.) while incorporating upstream's new functionality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
strict=Falsefallback in the tokenizer loader for models whose weight files contain extra parameters (e.g., vision tower weights). When mlx-lm rejects a model with "parameters not in model", the loader retries withstrict=Falseto discard the unrecognized weights and load the model as text-only.--reasoning-parserand--tool-call-parserwere enabled simultaneously, tool calls were never parsed during streaming — the raw XML markup (e.g.,<tool_call><function=bash>...</function></tool_call>) leaked through as content. Now the reasoning parser's content output is piped through the tool parser before being emitted.max_recommended_working_set_sizeand sets the threshold to 85% of that value, scaling correctly with system RAM.Why text-only for Qwen3.5?
Qwen3.5 models (e.g.,
Qwen3.5-397B-A17B) are vision-text models that include a vision tower (SigLIP encoder). Loading them through the MLLM path (mlx-vlm) fails because:qwen3_5_moearchitecture yetArraysCacheinstead ofKVCache, which is incompatible with the MLLM batch generator's continuous batching (slicing, merging, and cache management all assumeKVCache)Loading via mlx-lm with
strict=Falsediscards the ~333 vision tower parameters and serves the model as text-only. This works well with continuous batching, prefix caching, reasoning, and tool calling.Files changed
vllm_mlx/utils/tokenizer.py— Add_load_strict_false()and fallback for "parameters not in model" errorvllm_mlx/server.py— Integrate tool call parsing into the reasoning parser streaming branchvllm_mlx/engine_core.py— Dynamic memory pressure threshold based on Metal device infoTesting
Manually tested with:
vllm-mlx serve mlx-community/Qwen3.5-397B-A17B-4bit \ --host 0.0.0.0 \ --port 8081 \ --max-num-seqs 5 \ --cache-memory-percent 0.2 \ --enable-prefix-cache \ --reasoning-parser qwen3 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --continuous-batching \ --max-tokens 262144 \ --served-model-name Qwen3.5-397B-A17BVerified:
<think>blocks) extracted correctly