Add Qwen3.5 model support (text-only) and fix reasoning+tool streaming by otarkhan · Pull Request #127 · waybarrios/vllm-mlx

otarkhan · 2026-02-28T20:10:30Z

Summary

Qwen3.5 text-only loading: Adds a strict=False fallback in the tokenizer loader for models whose weight files contain extra parameters (e.g., vision tower weights). When mlx-lm rejects a model with "parameters not in model", the loader retries with strict=False to discard the unrecognized weights and load the model as text-only.
Fix streaming tool calls with reasoning parser: The streaming code path had mutually exclusive branches for reasoning parsing and tool call parsing. When both --reasoning-parser and --tool-call-parser were enabled simultaneously, tool calls were never parsed during streaming — the raw XML markup (e.g., <tool_call><function=bash>...</function></tool_call>) leaked through as content. Now the reasoning parser's content output is piped through the tool parser before being emitted.
Dynamic memory pressure threshold: The emergency memory pressure threshold was hardcoded at 200GB, which caused constant forced cache clears on high-memory systems running large models. It now dynamically queries Metal's max_recommended_working_set_size and sets the threshold to 85% of that value, scaling correctly with system RAM.

Why text-only for Qwen3.5?

Qwen3.5 models (e.g., Qwen3.5-397B-A17B) are vision-text models that include a vision tower (SigLIP encoder). Loading them through the MLLM path (mlx-vlm) fails because:

mlx-vlm does not support the qwen3_5_moe architecture yet
Even if it did, the MoE layers produce ArraysCache instead of KVCache, which is incompatible with the MLLM batch generator's continuous batching (slicing, merging, and cache management all assume KVCache)

Loading via mlx-lm with strict=False discards the ~333 vision tower parameters and serves the model as text-only. This works well with continuous batching, prefix caching, reasoning, and tool calling.

Files changed

vllm_mlx/utils/tokenizer.py — Add _load_strict_false() and fallback for "parameters not in model" error
vllm_mlx/server.py — Integrate tool call parsing into the reasoning parser streaming branch
vllm_mlx/engine_core.py — Dynamic memory pressure threshold based on Metal device info

Testing

Manually tested with:

vllm-mlx serve mlx-community/Qwen3.5-397B-A17B-4bit \
    --host 0.0.0.0 \
    --port 8081 \
    --max-num-seqs 5 \
    --cache-memory-percent 0.2 \
    --enable-prefix-cache \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --continuous-batching \
    --max-tokens 262144 \
    --served-model-name Qwen3.5-397B-A17B

Verified:

Model loads successfully with vision tower weights discarded
Reasoning (<think> blocks) extracted correctly
Tool calls parsed correctly during streaming (multi-turn tool use works)
No memory pressure warnings on 512GB unified memory system
Prefix cache functional with continuous batching

…ol streaming - Add strict=False fallback in tokenizer loader for models with extra weights (e.g., vision tower params), enabling Qwen3.5 to load via mlx-lm as a text-only model - Fix streaming tool call parsing when both --reasoning-parser and --tool-call-parser are enabled (previously mutually exclusive branches) - Make memory pressure threshold dynamic based on system RAM instead of hardcoded 200GB Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…reaming fix, dynamic memory - strict=False fallback for models with extra vision tower weights - Fix streaming when both --reasoning-parser and --tool-call-parser enabled - Dynamic memory pressure threshold based on Metal max_recommended_working_set_size Tested with Qwen3.5-397B-A17B-4bit on 512GB unified memory. Cherry-picked from: waybarrios#127 Original author: otarkhan Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ios#127) - strict=False loading fallback (conflict resolved: kept PR waybarrios#140's version with memory cleanup) - Dynamic memory pressure threshold (was hardcoded 200GB, now 85% of Metal max) - Fix reasoning+tool streaming coexistence

…en_ids in strict=False loader

waybarrios · 2026-03-21T21:58:20Z

thanks for this, the tokenizer strict=False fallback and the dynamic memory threshold are solid changes

i made two small adjustments and pushed them to your branch

removed the server.py streaming changes since PR fix: integrate tool call parsing with reasoning parser in streaming mode #148 already covers the reasoning+tool parsing fix with a slightly different approach that handles more edge cases (tool calls inside reasoning blocks, transition chunks). keeping both would conflict so i dropped yours in favor of fix: integrate tool call parsing with reasoning parser in streaming mode #148
fixed a bug in _load_strict_false where the model config was being discarded. load_model returns (model, config) and the config has the eos_token_id that needs to be passed to load_tokenizer, otherwise the model might not stop generating at the right token. the fix looks like this

model, config = load_model(model_path, strict=False)
tokenizer = load_tokenizer(
    model_path,
    tokenizer_config or {},
    eos_token_ids=config.get("eos_token_id", None),
)

also removed the redundant from pathlib import Path since its already imported at module level

the rest of the PR (tokenizer fallback + dynamic memory pressure) looks good to merge. let me know if you have any questions

…ection, served-model-name Merge 16 upstream commits (22dcbf8..d235c37) into our fork: - feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180) - feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (waybarrios#150) - fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection (waybarrios#97) - feat: Add --served-model-name CLI parameter (waybarrios#125) - feat: Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrios#127) - fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157) - fix: Metal resource leak under high concurrency (waybarrios#92) Conflict resolution strategy: keep all fork features (DeltaNet snapshots, fast SSE templates, tool injection, cloud routing, prompt cache, etc.) while incorporating upstream's new functionality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

otarkhan mentioned this pull request Feb 28, 2026

Support for Qwen3.5 #119

Open

Thump604 mentioned this pull request Mar 16, 2026

fix: enable tool call parsing in streaming when reasoning parser is active #163

Closed

remove streaming tool fix (covered by waybarrios#148) and fix eos_tok…

d90486e

…en_ids in strict=False loader

waybarrios merged commit 90eac21 into waybarrios:main Mar 21, 2026

This was referenced Mar 22, 2026

feat: BatchedEngine parity — MTP routing, normalization, SpecPrefill #203

Closed

fix: MLLM continuous batching for hybrid models #165

Open

This was referenced Mar 24, 2026

Add OpenAI Responses API core computor-org/vllm-mlx#1

Merged

Fix successful MLX tokenizer loads computor-org/vllm-mlx#2

Merged

server: add OpenAI-compatible /v1/responses endpoint #214

Open

tokenizer: return successful mlx-lm load result #215

Open

raullenchai mentioned this pull request Mar 26, 2026

Sync upstream: SpecPrefill, native video, MTP injection raullenchai/Rapid-MLX#58

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3.5 model support (text-only) and fix reasoning+tool streaming#127

Add Qwen3.5 model support (text-only) and fix reasoning+tool streaming#127
waybarrios merged 2 commits intowaybarrios:mainfrom
otarkhan:feature/qwen3.5-support

otarkhan commented Feb 28, 2026

Uh oh!

waybarrios commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

otarkhan commented Feb 28, 2026

Summary

Why text-only for Qwen3.5?

Files changed

Testing

Uh oh!

waybarrios commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants