fix: pass tools to chat template in MLLM path#139
fix: pass tools to chat template in MLLM path#139kargarisaac wants to merge 1 commit intowaybarrios:mainfrom
Conversation
The MLLM code path in SimpleEngine.chat() and stream_chat() did not pass tool definitions to MLXMultimodalLM.chat()/stream_chat(), while the LLM path did. Additionally, MLXMultimodalLM.chat()/stream_chat() did not extract tools from **kwargs to pass to get_chat_template(). This meant any model loaded with --mllm (e.g. Qwen3.5 which requires --mllm due to vision_tower weights) silently dropped tool definitions from the prompt, making --enable-auto-tool-choice --tool-call-parser ineffective. Fix: pass template_tools in the MLLM branch of SimpleEngine, and extract/forward tools in MLXMultimodalLM's chat template calls. Tested with Qwen3.5-4B-8bit and Qwen3.5-9B-MLX-8bit - both now produce correct OpenAI-compatible tool_calls with the qwen3_coder parser. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Thump604
left a comment
There was a problem hiding this comment.
Solid fix. I can see exactly what was broken and why this solves it.
The bug: MLLM path in SimpleEngine.chat()/stream_chat() never passed tools to the model, even though the LLM branch did. MLXMultimodalLM wasn't extracting tools from kwargs to pass to get_chat_template(). Result: silently dropped tool definitions, breaking auto-tool-choice for Qwen3.5 VLM and any other MLLM that supports function calling.
The fix:
- SimpleEngine now builds mllm_kwargs/mllm_stream_kwargs dict and adds tools before calling model.chat()/stream_chat()
- MLXMultimodalLM extracts tools from kwargs and passes them through template_kwargs to get_chat_template()
- Same pattern as the LLM path — consistent
Code quality: Clean implementation. Shallow copy of kwargs avoids mutation. Conditional injection (if template_tools/if tools) avoids passing None. Removed noisy debug logging (the before/after message previews) without losing observability.
Test coverage: Verified on two model sizes (4B, 9B), checked prompt_tokens delta, confirmed tool_calls in output and finish_reason. Parser translation checked.
One consideration: This pattern (passing tools through kwargs chain) now exists in three places — SimpleEngine chat/stream_chat and MLXMultimodalLM chat/stream_chat. If the tools extraction logic needs to change later (e.g., filter/validate tools), it'll need updates in two files. Not a blocker, just noting for future refactors.
This unblocks tool calling for MLLM models. Ship it.
|
@waybarrios, @kargarisaac: status note plus coordination. This PR addresses the gap where MLLM-loaded models (e.g. Qwen3.5 which requires PR currently shows CONFLICTING merge status. Likely needs a rebase on current main since the SimpleEngine and MLXMultimodalLM code paths have had other changes since this branch was created. Coordination note: PR #116 (swaylenhayes, "Enable tool calling for MLLM/VLM chat paths") targets the same problem with a broader scope (passes both Last activity Mar 31. |
|
Hey @kargarisaac — good catch on the root cause! The MLLM path silently dropping tools was a real issue. This is now fixed in
Same fix as PR #116 (which also addressed this), both superseded by the production backport. This PR has merge conflicts with current |
|
Closing as superseded by the current MLLM tool-calling path in main. Jan's read is the right current one here: the underlying tools-propagation bug is already fixed via the later mainline work, so I don't think reviving this conflicting branch is the right use of review bandwidth. |
Summary
--mllm(e.g., Qwen3.5 which requires it due tovision_towerweights) silently dropped tool definitions from the prompt, making--enable-auto-tool-choice --tool-call-parserineffective for MLLM modelsSimpleEngine.chat()/stream_chat()MLLM branches did not passtemplate_toolsto the model (only the LLM branch did), andMLXMultimodalLM.chat()/stream_chat()did not extracttoolsfrom**kwargsto forward toget_chat_template()template_toolsin the MLLM branch ofSimpleEngine, and extract/forwardtoolsinMLXMultimodalLM's chat template callsFiles Changed
vllm_mlx/engine/simple.py— Passtemplate_toolsto MLLM model inchat()andstream_chat()vllm_mlx/models/mllm.py— Extracttoolsfrom**kwargsand pass toget_chat_template()inchat()andstream_chat()Test plan
Qwen3.5-4B-8bit— single tool call works, multi-tool parallel calls workQwen3.5-9B-MLX-8bit— single and multi-tool calls workprompt_tokensincreases when tools are provided (284 with tools vs 29 without)finish_reason: "tool_calls"and proper OpenAI wire format outputqwen3_coderparser correctly translates<function=name><parameter=key>value</parameter></function>to OpenAItool_callsJSONReproduction