fix(server): stream structured tool calls without parser flags#304
Conversation
janhilgard
left a comment
There was a problem hiding this comment.
Review: fix(server): stream structured tool calls without parser flags
Overall: Good feature that fills a gap, but the approach has some complexity concerns.
What this does
When the server has no explicit --tool-call-parser or --enable-auto-tool-choice flags, but the request includes tools, this PR auto-instantiates an AutoToolParser as a fallback. This means streaming responses get structured tool_calls instead of raw markup leaking as content -- matching the existing non-streaming behavior.
Strengths
-
Correct diagnosis of the asymmetry. Non-streaming requests already fall through to generic
parse_tool_calls()which handles tool markup. Streaming had no equivalent fallback. -
_get_streaming_tool_parseris well-structured. The function checkstool_choice == "none", tries the configured parser first, then falls back toAutoToolParser. This layering is clean. -
_streaming_tool_markup_possibleconsolidates the heuristic. Replacing"<" in contentwith a proper marker tuple check is an improvement. The markers cover XML-based formats, Mistral, Qwen bracket format, MiniMax, and Anthropic invoke format. -
Two good tests. One verifies structured tool calls appear in streaming output; the other verifies plain text is not interfered with.
-
_tool_choice_disabledis a useful extraction. Encapsulating thetool_choice == "none"check prevents duplicated logic.
Potential issues
-
Performance regression for all streaming requests with tools. Previously, the fast path only checked
"<" in text. Now_streaming_tool_markup_possiblecallsany(marker in text for marker in _STREAMING_TOOL_MARKERS)ontool_accumulated_text + contentfor every chunk. For long conversations with tools,tool_accumulated_textcan grow to tens of thousands of characters. TheO(n * m)substring search on every token could add measurable latency. Consider checking only oncontent/delta_text(the new portion) rather than the full accumulated text. -
Auto-parser instantiation happens on every request. In
_get_streaming_tool_parser, when there is no configured parser but tools are present, a newAutoToolParseris created per request. For a hot path under load this could be a concern. Consider caching or reusing the parser instance. -
The marker list is missing the bare bracket pattern from PR #305.
_STREAMING_TOOL_MARKERSincludes[Calling tool:and[TOOL_CALLS]but not bare[func(patterns. If #305 merges first, this will need updating. -
Duplicated marker-check logic across two PRs. PR #305 adds
_STREAMING_BARE_BRACKET_MARKERand_STREAMING_BARE_BRACKET_PARTIAL, while this PR adds_STREAMING_TOOL_MARKERSand_streaming_tool_markup_possible. These two PRs will conflict. It would be good to coordinate which one lands first and have the other rebase.
Minor
- The
global _tool_parser_instanceremoval from both streaming paths is good cleanup, matching the move to_get_streaming_tool_parser. - The fallback end-of-stream check now uses
_streaming_tool_markup_possible(tool_accumulated_text)instead of a hardcoded list of 3 patterns -- this is a nice consolidation.
Solid improvement that closes a real gap in streaming tool call handling. The main concern is the coordination with PR #305 and the performance characteristics of marker scanning on accumulated text.
When tools are present but no explicit parser flags (--enable-auto-tool-choice, --tool-call-parser) are configured, non-streaming chat completions already fall back to generic tool parsing. Streaming skipped this fallback entirely and leaked raw tool markup as content. Changes: - Add _get_streaming_tool_parser() that mirrors the non-streaming fallback: use the configured parser when auto tool choice is on, otherwise instantiate the generic "auto" parser when tools are present. - Replace the old "<" in text heuristic with _streaming_tool_markup_possible(), which checks for known tool call start markers across model families. - Extract _tool_choice_disabled() to centralise the tool_choice=="none" check. - Add regression tests for streaming tool calls without parser flags and for plain text streaming with tools present. Fixes waybarrios#107
c487623 to
b888645
Compare
Summary
Why
Issue #107 shows a real mismatch between non-streaming and streaming behavior: non-streaming chat completions already fall back to generic tool parsing when no parser flags are configured, but streaming skipped tool parsing entirely and leaked raw tool markup as content.
This change makes the streaming path match the existing generic non-streaming behavior for the same request shape.
Validation
platform darwin -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0
rootdir: /tmp/vllm-mlx-issue107
configfile: pytest.ini (WARNING: ignoring pytest config in pyproject.toml!)
plugins: asyncio-1.3.0, anyio-4.13.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 42 items / 3 deselected / 39 selected
../../../tmp/vllm-mlx-issue107/tests/test_server.py .................... [ 51%]
................... [100%]
======================= 39 passed, 3 deselected in 2.51s =======================