Skip to content

Add streaming tool call parsing support#43

Closed
janhilgard wants to merge 4 commits intowaybarrios:mainfrom
janhilgard:feature/streaming-tool-call-parsing
Closed

Add streaming tool call parsing support#43
janhilgard wants to merge 4 commits intowaybarrios:mainfrom
janhilgard:feature/streaming-tool-call-parsing

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

Summary

  • Implement streaming tool call detection in stream_chat_completion()
  • Use tool parser's extract_tool_calls_streaming() method when enabled
  • Buffer content during tool_call generation, emit tool_calls chunk on completion
  • Add --enable-auto-tool-choice and --tool-call-parser CLI args to server.py
  • Add reasoning field to ChatCompletionChunkDelta for streaming reasoning content

Problem

When using stream: true, the server was sending raw <tool_call> tags as content instead of proper OpenAI-compatible tool_calls chunks. This broke Cline and other streaming clients expecting structured tool call responses.

Solution

The streaming function now:

  1. Detects tool call patterns using the configured parser's extract_tool_calls_streaming() method
  2. Buffers content when inside a tool_call block (returns None during buffering)
  3. Emits proper tool_calls chunk when the tool call is complete
  4. Sets finish_reason: "tool_calls" appropriately

Test plan

  • Test streaming with tools - tool_calls chunk emitted correctly
  • Test non-streaming with tools - still works
  • Test with GLM-4.7-Flash model and glm47 parser

🤖 Generated with Claude Code

janhilgard and others added 4 commits February 5, 2026 00:27
Adds support for GLM-4.7 and GLM-4.7-Flash tool calling format:
<tool_call>function_name
<arg_key>param</arg_key><arg_value>value</arg_value>
</tool_call>

The parser:
- Extracts function name and arguments from GLM47 XML format
- Removes <think>...</think> tags from output
- Supports streaming tool call detection
- Registered as "glm47" and "glm4" parser names

Usage:
  vllm-mlx serve model --enable-auto-tool-choice --tool-call-parser glm47

Based on vLLM's glm47_moe_tool_parser.py implementation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Implement streaming tool call detection in stream_chat_completion()
- Use tool parser's extract_tool_calls_streaming() method when enabled
- Buffer content during tool_call generation, emit tool_calls chunk on completion
- Add fallback to extract_tool_calls() at stream end for edge cases
- Add --enable-auto-tool-choice and --tool-call-parser CLI args to server.py
- Add reasoning field to ChatCompletionChunkDelta for streaming reasoning content

This enables Cline and other streaming clients to receive proper tool_calls
in SSE format instead of raw <tool_call> tags in content.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@enryold
Copy link
Copy Markdown

enryold commented Feb 8, 2026

UP

@janhilgard
Copy link
Copy Markdown
Collaborator Author

Closing — all changes from this PR were already merged into main via PR #46 (b191aec).

@janhilgard janhilgard closed this Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants