fix: enable tool call parsing in streaming when reasoning parser is active by Thump604 · Pull Request #163 · waybarrios/vllm-mlx

Thump604 · 2026-03-16T16:16:20Z

Problem

When both --reasoning-parser and --tool-call-parser are configured, streaming tool calls are completely broken. The reasoning parser branch in stream_chat_completion() yields chunks directly without ever passing content through the tool parser.

Symptoms:

tool_calls is null in every SSE chunk
Tool call XML (<tool_call>, <function=...>) appears as raw text in content
finish_reason is "stop" instead of "tool_calls"
Non-streaming works correctly

Impact: Every CLI agent using streaming (Claude Code, OpenCode, Roo, etc.) cannot execute tool calls when a reasoning parser is active. This affects all models that use both reasoning and tool calling: Qwen3, Qwen3.5, DeepSeek-R1 with tools, etc.

Root Cause

In stream_chat_completion(), lines 1926-1952:

if _reasoning_parser and delta_text:
    # ... reasoning extraction ...
    yield chunk  # <-- emits directly, NEVER runs tool parser
else:
    # ... tool parsing only runs HERE ...

The reasoning parser branch and tool parser branch are in opposite sides of an if/else — mutually exclusive. When _reasoning_parser is set, the else branch (containing all tool parsing logic) is never reached.

Fix

After reasoning extraction produces delta_msg.content, pass that content through the tool parser before emitting the chunk. This creates a sequential pipeline matching upstream vLLM's architecture:

Phase 1: Reasoning extraction (separate thinking from content)
Phase 2: Tool call parsing on the content portion
Phase 3: Emit chunk with reasoning + structured tool_calls

Handles three tool parser states:

None (inside XML markup): suppress content, still emit reasoning
tool_calls detected: emit structured tool_calls with reasoning
Normal content: pass through as before

The else branch (no reasoning parser) is unchanged.

Testing

Tested on Apple Silicon (M2 Ultra, 128GB) with:

Model: Qwen3.5-122B-A10B-5bit
Parsers: --tool-call-parser qwen3_xml --reasoning-parser qwen3
Flags: --enable-auto-tool-choice --mllm

Before (broken)

tool_calls: null (all chunks)
content: "\n\n<tool_call>\n<function=write_file>..."
finish_reason: "stop"

After (fixed)

tool_calls: [{"name": "write_file", "arguments": "{\"path\": \"hello.txt\", ...}"}]
content: null (tool XML consumed by parser)
finish_reason: "tool_calls"

Non-streaming regression test: still works correctly.

Fixes Streaming mode returns tool calls as raw content text instead of structured tool_calls delta #107
Related: Fix streaming tool calls when reasoning parser is active #93, feat(server): implement sequential reasoning and tool-call parsing bridge #118, Add Qwen3.5 model support (text-only) and fix reasoning+tool streaming #127, fix: integrate tool call parsing with reasoning parser in streaming mode #148 (all attempt to fix the same issue)
Upstream vLLM uses the same sequential pipeline approach (reasoning + tool calls docs)

…ctive When both --reasoning-parser and --tool-call-parser are configured, streaming tool calls are completely broken. The reasoning parser branch in stream_chat_completion() yields chunks directly without ever passing content through the tool parser. Tool call XML appears as raw text in the content field, tool_calls is null in all chunks, and finish_reason is "stop" instead of "tool_calls". Root cause: The reasoning parser branch (if _reasoning_parser) and tool parser branch are in opposite sides of an if/else — mutually exclusive. Every model that uses both reasoning and tool calling (Qwen3, Qwen3.5, DeepSeek-R1 with tools, etc.) is affected in streaming mode. Fix: After reasoning extraction produces delta_msg.content, pass that content through the tool parser before emitting the chunk. This creates a sequential pipeline matching upstream vLLM's architecture: Phase 1: Reasoning extraction (separate thinking from content) Phase 2: Tool call parsing on content portion Phase 3: Emit chunk with reasoning + structured tool_calls Handles three tool parser states: - None (inside XML markup): suppress content, still emit reasoning - tool_calls detected: emit structured tool_calls with reasoning - Normal content: pass through as before Non-streaming path is unaffected (already works correctly). Tested with Qwen3.5-122B-A10B + qwen3_xml parser + qwen3 reasoning parser on Apple Silicon (M2 Ultra). Streaming now returns structured tool_calls, finish_reason: "tool_calls", with no raw XML in content. Fixes waybarrios#107

Line-length wrap on finish_reason ternary expression.

Thump604 · 2026-03-17T22:14:36Z

Closing — this fix was incorporated upstream in v0.2.6. Tool call parsing in the streaming reasoning path is now working in main.

Thump604 added 2 commits March 16, 2026 11:15

style: black formatting fix for server.py

5fc3de0

Line-length wrap on finish_reason ternary expression.

Thump604 closed this Mar 17, 2026

Thump604 mentioned this pull request Mar 30, 2026

feat: full sampling parameter support (top_k, min_p, presence_penalty, repetition_penalty) #213

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: enable tool call parsing in streaming when reasoning parser is active#163

fix: enable tool call parsing in streaming when reasoning parser is active#163
Thump604 wants to merge 2 commits intowaybarrios:mainfrom
Thump604:fix/streaming-reasoning-tool-coexistence

Thump604 commented Mar 16, 2026

Uh oh!

Thump604 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Thump604 commented Mar 16, 2026

Problem

Root Cause

Fix

Testing

Before (broken)

After (fixed)

Related

Uh oh!

Thump604 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant