Add streaming tool call parsing support by janhilgard · Pull Request #43 · waybarrios/vllm-mlx

janhilgard · 2026-02-04T23:48:03Z

Summary

Implement streaming tool call detection in stream_chat_completion()
Use tool parser's extract_tool_calls_streaming() method when enabled
Buffer content during tool_call generation, emit tool_calls chunk on completion
Add --enable-auto-tool-choice and --tool-call-parser CLI args to server.py
Add reasoning field to ChatCompletionChunkDelta for streaming reasoning content

Problem

When using stream: true, the server was sending raw <tool_call> tags as content instead of proper OpenAI-compatible tool_calls chunks. This broke Cline and other streaming clients expecting structured tool call responses.

Solution

The streaming function now:

Detects tool call patterns using the configured parser's extract_tool_calls_streaming() method
Buffers content when inside a tool_call block (returns None during buffering)
Emits proper tool_calls chunk when the tool call is complete
Sets finish_reason: "tool_calls" appropriately

Test plan

Test streaming with tools - tool_calls chunk emitted correctly
Test non-streaming with tools - still works
Test with GLM-4.7-Flash model and glm47 parser

🤖 Generated with Claude Code

Adds support for GLM-4.7 and GLM-4.7-Flash tool calling format: <tool_call>function_name <arg_key>param</arg_key><arg_value>value</arg_value> </tool_call> The parser: - Extracts function name and arguments from GLM47 XML format - Removes <think>...</think> tags from output - Supports streaming tool call detection - Registered as "glm47" and "glm4" parser names Usage: vllm-mlx serve model --enable-auto-tool-choice --tool-call-parser glm47 Based on vLLM's glm47_moe_tool_parser.py implementation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Implement streaming tool call detection in stream_chat_completion() - Use tool parser's extract_tool_calls_streaming() method when enabled - Buffer content during tool_call generation, emit tool_calls chunk on completion - Add fallback to extract_tool_calls() at stream end for edge cases - Add --enable-auto-tool-choice and --tool-call-parser CLI args to server.py - Add reasoning field to ChatCompletionChunkDelta for streaming reasoning content This enables Cline and other streaming clients to receive proper tool_calls in SSE format instead of raw <tool_call> tags in content. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

enryold · 2026-02-08T01:42:38Z

UP

janhilgard · 2026-02-09T07:40:20Z

Closing — all changes from this PR were already merged into main via PR #46 (b191aec).

janhilgard and others added 4 commits February 5, 2026 00:27

Fix: Don't return reasoning text as content when tool calls are present

a539aed

Fix black formatting

caa89c8

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

janhilgard closed this Feb 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add streaming tool call parsing support#43

Add streaming tool call parsing support#43
janhilgard wants to merge 4 commits intowaybarrios:mainfrom
janhilgard:feature/streaming-tool-call-parsing

janhilgard commented Feb 4, 2026

Uh oh!

enryold commented Feb 8, 2026

Uh oh!

janhilgard commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

janhilgard commented Feb 4, 2026

Summary

Problem

Solution

Test plan

Uh oh!

enryold commented Feb 8, 2026

Uh oh!

janhilgard commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants