fix: Responses API streaming tool call support for non-harmony models#36484
fix: Responses API streaming tool call support for non-harmony models#36484giulio-leone wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a significant fix for streaming tool calls in the Responses API for non-harmony models. The changes correctly address the root causes described, such as handling reasoning and tool parsers in sequence and converting tool call deltas into the proper API events. The addition of new unit tests is also a great step towards ensuring correctness.
However, I've identified a critical issue with the implementation for emitting tool call done events, which appears to be fragile and is not covered by the new tests. Please see my detailed comment.
| if args_delta == "}" and tc_idx < len( | ||
| tool_parser.prev_tool_call_arr | ||
| ): | ||
| tc_info = tool_parser.prev_tool_call_arr[tc_idx] | ||
| for event in emit_function_call_done_events( | ||
| tc_info.get("name", fn_name), | ||
| tc_info.get("arguments", "{}"), | ||
| tool_streaming_state, | ||
| ): | ||
| yield _increment_sequence_number_and_return( | ||
| event | ||
| ) |
There was a problem hiding this comment.
The logic for emitting ResponseFunctionCallArgumentsDoneEvent is fragile and not adequately tested. It relies on the condition args_delta == "}" to detect the end of a tool call's arguments, which has several potential problems:
- Brittleness: The arguments delta may contain more than just the closing brace (e.g., whitespace or be part of a larger string), causing the strict equality check to fail. This makes the logic highly dependent on a very specific and potentially unreliable output format from the parser.
- Incomplete State Handling: If a new tool call starts before the previous one's arguments are closed with a
}, the previous tool call item will be left in anin_progressstate without adoneevent. The logic at lines 1492-1496 advances to a new item but doesn't finalize the previous one. - Insufficient Testing: The new tests in
test_serving_responses.pydo not cover thisdoneevent emission logic. The mocked argument deltas don't trigger this condition, and the mock parser'sprev_tool_call_arris not populated, which would also prevent thedoneevent from being created.
This can lead to clients never receiving a done event for a tool call, which is a critical bug in a streaming API.
A more robust approach should be considered. For example, the parser could explicitly signal the completion of a tool call. A potential improvement could be to have the extract_tool_calls_streaming method return not just DeltaMessage, but a tuple (DeltaMessage, bool) where the boolean indicates if a tool call is complete. This would decouple the streaming logic from the internal parsing details of argument strings.
|
Hi @giulio-leone, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
see #29947 |
|
Thanks @chaunceyjiang — I see #29947 addresses tool calling in the Responses API more comprehensively. Happy to close this in favor of that PR if the maintainers prefer. My fix is narrower (streaming-only text→function_call detection), so it could serve as a quick stopgap or be superseded by #29947. |
|
Hi @giulio-leone, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
f463a9b to
b74a754
Compare
|
Friendly ping — CI is green, tests pass, rebased on latest. Ready for review whenever convenient. Happy to address any feedback. 🙏 |
|
@chaunceyjiang Thanks for the pointer. I see #29947 adds Responses API streaming tool call support as well. My PR focuses specifically on the non-harmonized tool parser path (e.g. Hermes, Llama) where |
|
Hi @giulio-leone, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
eec4d58 to
a2aa3d6
Compare
|
Hi @giulio-leone, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Handle tool-parser streaming events for non-harmony models, make function-call done-event emission robust against delta batching, and keep the serving path ruff-clean for pre-commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
a2aa3d6 to
41f1587
Compare
|
Refreshed the branch as a single signed commit to clear the DCO blocker and keep the latest fixes together. This push preserves the non-harmony Responses API tool-call streaming fix, the more robust function-call done-event emission, and the ruff/pre-commit cleanup in |
|
Follow-up on the remaining docs status: the only failing check left is |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
|
Follow-up for the bot review: I tightened the new streaming regressions so they now populate
That directly covers the previously untested Local validation completed here:
I also attempted targeted pytest execution, but this local macOS environment does not have the full vLLM test runtime provisioned and collection pulls |
qandrew
left a comment
There was a problem hiding this comment.
thanks for putting this together! could you add an E2E test in test_simple.py also?
|
This pull request has merge conflicts that must be resolved before it can be |
|
Closing this PR as the feature has been superseded by #29947 (merged in commit 9fe404e), which implements tool/function call streaming support in the Responses API. The upstream implementation covers the same core fixes:
Thanks @chaunceyjiang for landing this! 🎉 |
Purpose
Fix Responses API streaming tool call support for non-harmony models.
Fixes #36435
Problem
When streaming responses from models that use XML-based tool calling (e.g., Qwen3.5 with
<think>tags), the Responses API does not emit properresponse.function_call_arguments.delta/response.function_call_arguments.doneevents. Instead, raw tool call XML leaks intoresponse.output_text.deltaevents, breaking any client that relies on structured function call streaming. This affects all non-harmony models that combine reasoning and tool calling.Root Causes
Mutually exclusive parser chain:
reasoning_parserandtool_parserare in anif/elifchain in_process_simple_streaming_events. When a model has both (e.g., Qwen3.5 with<think>tags), theelif tool_parser:branch is never reached, so tool calls are silently dropped.Missing event conversion: No code existed to convert
DeltaMessage.tool_callsinto Responses API function call events. The original code had a# todo(kebe7jun) tool call supportcomment placeholder.Changes
is_reasoning_end(), removing the mutual exclusionResponseFunctionCallArgumentsDeltaEvent/ResponseFunctionCallArgumentsDoneEventvia existing helperstool_streaming_state.current_output_indexwith the main output index to keep event ordering consistentResponseTextDeltaEventonce tool calls are detected, preventing XML leakage into text eventsTest Plan
Added two new unit tests in
test_serving_responses.py:test_tool_only_stream_emits_function_call_events— verifies correct event sequence for models with only tool callingtest_reasoning_then_tool_call_stream— verifies correct event sequence when reasoning precedes tool callsAll 18 existing tests in
test_serving_responses.pycontinue to pass, confirming no regressions.