Skip to content

Add Qwen3.5 model support (text-only) and fix reasoning+tool streaming#127

Merged
waybarrios merged 2 commits intowaybarrios:mainfrom
otarkhan:feature/qwen3.5-support
Mar 21, 2026
Merged

Add Qwen3.5 model support (text-only) and fix reasoning+tool streaming#127
waybarrios merged 2 commits intowaybarrios:mainfrom
otarkhan:feature/qwen3.5-support

Conversation

@otarkhan
Copy link
Copy Markdown
Contributor

Summary

  • Qwen3.5 text-only loading: Adds a strict=False fallback in the tokenizer loader for models whose weight files contain extra parameters (e.g., vision tower weights). When mlx-lm rejects a model with "parameters not in model", the loader retries with strict=False to discard the unrecognized weights and load the model as text-only.
  • Fix streaming tool calls with reasoning parser: The streaming code path had mutually exclusive branches for reasoning parsing and tool call parsing. When both --reasoning-parser and --tool-call-parser were enabled simultaneously, tool calls were never parsed during streaming — the raw XML markup (e.g., <tool_call><function=bash>...</function></tool_call>) leaked through as content. Now the reasoning parser's content output is piped through the tool parser before being emitted.
  • Dynamic memory pressure threshold: The emergency memory pressure threshold was hardcoded at 200GB, which caused constant forced cache clears on high-memory systems running large models. It now dynamically queries Metal's max_recommended_working_set_size and sets the threshold to 85% of that value, scaling correctly with system RAM.

Why text-only for Qwen3.5?

Qwen3.5 models (e.g., Qwen3.5-397B-A17B) are vision-text models that include a vision tower (SigLIP encoder). Loading them through the MLLM path (mlx-vlm) fails because:

  1. mlx-vlm does not support the qwen3_5_moe architecture yet
  2. Even if it did, the MoE layers produce ArraysCache instead of KVCache, which is incompatible with the MLLM batch generator's continuous batching (slicing, merging, and cache management all assume KVCache)

Loading via mlx-lm with strict=False discards the ~333 vision tower parameters and serves the model as text-only. This works well with continuous batching, prefix caching, reasoning, and tool calling.

Files changed

  • vllm_mlx/utils/tokenizer.py — Add _load_strict_false() and fallback for "parameters not in model" error
  • vllm_mlx/server.py — Integrate tool call parsing into the reasoning parser streaming branch
  • vllm_mlx/engine_core.py — Dynamic memory pressure threshold based on Metal device info

Testing

Manually tested with:

vllm-mlx serve mlx-community/Qwen3.5-397B-A17B-4bit \
    --host 0.0.0.0 \
    --port 8081 \
    --max-num-seqs 5 \
    --cache-memory-percent 0.2 \
    --enable-prefix-cache \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --continuous-batching \
    --max-tokens 262144 \
    --served-model-name Qwen3.5-397B-A17B

Verified:

  • Model loads successfully with vision tower weights discarded
  • Reasoning (<think> blocks) extracted correctly
  • Tool calls parsed correctly during streaming (multi-turn tool use works)
  • No memory pressure warnings on 512GB unified memory system
  • Prefix cache functional with continuous batching

…ol streaming

- Add strict=False fallback in tokenizer loader for models with extra
  weights (e.g., vision tower params), enabling Qwen3.5 to load via
  mlx-lm as a text-only model
- Fix streaming tool call parsing when both --reasoning-parser and
  --tool-call-parser are enabled (previously mutually exclusive branches)
- Make memory pressure threshold dynamic based on system RAM instead
  of hardcoded 200GB

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@otarkhan otarkhan mentioned this pull request Feb 28, 2026
jackzampolin added a commit to jackzampolin/vllm-mlx that referenced this pull request Mar 11, 2026
…reaming fix, dynamic memory

- strict=False fallback for models with extra vision tower weights
- Fix streaming when both --reasoning-parser and --tool-call-parser enabled
- Dynamic memory pressure threshold based on Metal max_recommended_working_set_size

Tested with Qwen3.5-397B-A17B-4bit on 512GB unified memory.

Cherry-picked from: waybarrios#127
Original author: otarkhan

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
seanpianka added a commit to seanpianka/vllm-mlx that referenced this pull request Mar 14, 2026
…ios#127)

- strict=False loading fallback (conflict resolved: kept PR waybarrios#140's version with memory cleanup)
- Dynamic memory pressure threshold (was hardcoded 200GB, now 85% of Metal max)
- Fix reasoning+tool streaming coexistence
@waybarrios
Copy link
Copy Markdown
Owner

thanks for this, the tokenizer strict=False fallback and the dynamic memory threshold are solid changes

i made two small adjustments and pushed them to your branch

  1. removed the server.py streaming changes since PR fix: integrate tool call parsing with reasoning parser in streaming mode #148 already covers the reasoning+tool parsing fix with a slightly different approach that handles more edge cases (tool calls inside reasoning blocks, transition chunks). keeping both would conflict so i dropped yours in favor of fix: integrate tool call parsing with reasoning parser in streaming mode #148

  2. fixed a bug in _load_strict_false where the model config was being discarded. load_model returns (model, config) and the config has the eos_token_id that needs to be passed to load_tokenizer, otherwise the model might not stop generating at the right token. the fix looks like this

model, config = load_model(model_path, strict=False)
tokenizer = load_tokenizer(
    model_path,
    tokenizer_config or {},
    eos_token_ids=config.get("eos_token_id", None),
)

also removed the redundant from pathlib import Path since its already imported at module level

the rest of the PR (tokenizer fallback + dynamic memory pressure) looks good to merge. let me know if you have any questions

@waybarrios waybarrios merged commit 90eac21 into waybarrios:main Mar 21, 2026
raullenchai pushed a commit to raullenchai/Rapid-MLX that referenced this pull request Mar 26, 2026
…ection, served-model-name

Merge 16 upstream commits (22dcbf8..d235c37) into our fork:

- feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180)
- feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (waybarrios#150)
- fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection (waybarrios#97)
- feat: Add --served-model-name CLI parameter (waybarrios#125)
- feat: Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrios#127)
- fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157)
- fix: Metal resource leak under high concurrency (waybarrios#92)

Conflict resolution strategy: keep all fork features (DeltaNet snapshots,
fast SSE templates, tool injection, cloud routing, prompt cache, etc.)
while incorporating upstream's new functionality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants