Skip to content

fix: Responses API streaming tool call support for non-harmony models#36484

Closed
giulio-leone wants to merge 2 commits intovllm-project:mainfrom
giulio-leone:fix/responses-api-tool-call-xml
Closed

fix: Responses API streaming tool call support for non-harmony models#36484
giulio-leone wants to merge 2 commits intovllm-project:mainfrom
giulio-leone:fix/responses-api-tool-call-xml

Conversation

@giulio-leone
Copy link
Contributor

@giulio-leone giulio-leone commented Mar 9, 2026

Purpose

Fix Responses API streaming tool call support for non-harmony models.

Fixes #36435

Problem

When streaming responses from models that use XML-based tool calling (e.g., Qwen3.5 with <think> tags), the Responses API does not emit proper response.function_call_arguments.delta / response.function_call_arguments.done events. Instead, raw tool call XML leaks into response.output_text.delta events, breaking any client that relies on structured function call streaming. This affects all non-harmony models that combine reasoning and tool calling.

Root Causes

  1. Mutually exclusive parser chain: reasoning_parser and tool_parser are in an if/elif chain in _process_simple_streaming_events. When a model has both (e.g., Qwen3.5 with <think> tags), the elif tool_parser: branch is never reached, so tool calls are silently dropped.

  2. Missing event conversion: No code existed to convert DeltaMessage.tool_calls into Responses API function call events. The original code had a # todo(kebe7jun) tool call support comment placeholder.

Changes

  • Handle both reasoning and tool calls together — reasoning is processed first, then tool calls are processed after is_reasoning_end(), removing the mutual exclusion
  • Handle tool-call-only models (no reasoning parser) as a separate path
  • Emit ResponseFunctionCallArgumentsDeltaEvent / ResponseFunctionCallArgumentsDoneEvent via existing helpers
  • Close message output items before function call events when content precedes tool calls
  • Sync tool_streaming_state.current_output_index with the main output index to keep event ordering consistent
  • Suppress ResponseTextDeltaEvent once tool calls are detected, preventing XML leakage into text events

Test Plan

Added two new unit tests in test_serving_responses.py:

  • test_tool_only_stream_emits_function_call_events — verifies correct event sequence for models with only tool calling
  • test_reasoning_then_tool_call_stream — verifies correct event sequence when reasoning precedes tool calls

All 18 existing tests in test_serving_responses.py continue to pass, confirming no regressions.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant fix for streaming tool calls in the Responses API for non-harmony models. The changes correctly address the root causes described, such as handling reasoning and tool parsers in sequence and converting tool call deltas into the proper API events. The addition of new unit tests is also a great step towards ensuring correctness.

However, I've identified a critical issue with the implementation for emitting tool call done events, which appears to be fragile and is not covered by the new tests. Please see my detailed comment.

Comment on lines +1507 to +1518
if args_delta == "}" and tc_idx < len(
tool_parser.prev_tool_call_arr
):
tc_info = tool_parser.prev_tool_call_arr[tc_idx]
for event in emit_function_call_done_events(
tc_info.get("name", fn_name),
tc_info.get("arguments", "{}"),
tool_streaming_state,
):
yield _increment_sequence_number_and_return(
event
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for emitting ResponseFunctionCallArgumentsDoneEvent is fragile and not adequately tested. It relies on the condition args_delta == "}" to detect the end of a tool call's arguments, which has several potential problems:

  1. Brittleness: The arguments delta may contain more than just the closing brace (e.g., whitespace or be part of a larger string), causing the strict equality check to fail. This makes the logic highly dependent on a very specific and potentially unreliable output format from the parser.
  2. Incomplete State Handling: If a new tool call starts before the previous one's arguments are closed with a }, the previous tool call item will be left in an in_progress state without a done event. The logic at lines 1492-1496 advances to a new item but doesn't finalize the previous one.
  3. Insufficient Testing: The new tests in test_serving_responses.py do not cover this done event emission logic. The mocked argument deltas don't trigger this condition, and the mock parser's prev_tool_call_arr is not populated, which would also prevent the done event from being created.

This can lead to clients never receiving a done event for a tool call, which is a critical bug in a streaming API.

A more robust approach should be considered. For example, the parser could explicitly signal the completion of a tool call. A potential improvement could be to have the extract_tool_calls_streaming method return not just DeltaMessage, but a tuple (DeltaMessage, bool) where the boolean indicates if a tool call is complete. This would decouple the streaming logic from the internal parsing details of argument strings.

@mergify
Copy link

mergify bot commented Mar 9, 2026

Hi @giulio-leone, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@chaunceyjiang
Copy link
Collaborator

see #29947

@giulio-leone
Copy link
Contributor Author

Thanks @chaunceyjiang — I see #29947 addresses tool calling in the Responses API more comprehensively. Happy to close this in favor of that PR if the maintainers prefer. My fix is narrower (streaming-only text→function_call detection), so it could serve as a quick stopgap or be superseded by #29947.

@mergify
Copy link

mergify bot commented Mar 9, 2026

Hi @giulio-leone, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@giulio-leone giulio-leone force-pushed the fix/responses-api-tool-call-xml branch from f463a9b to b74a754 Compare March 9, 2026 15:23
@giulio-leone
Copy link
Contributor Author

Friendly ping — CI is green, tests pass, rebased on latest. Ready for review whenever convenient. Happy to address any feedback. 🙏

@giulio-leone
Copy link
Contributor Author

@chaunceyjiang Thanks for the pointer. I see #29947 adds Responses API streaming tool call support as well. My PR focuses specifically on the non-harmonized tool parser path (e.g. Hermes, Llama) where function.name and function.arguments are missing from delta chunks — the existing code only handled the harmonized path. Happy to close this if #29947 covers that case too, or we could coordinate to avoid duplication.

@mergify
Copy link

mergify bot commented Mar 9, 2026

Hi @giulio-leone, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@giulio-leone giulio-leone force-pushed the fix/responses-api-tool-call-xml branch from eec4d58 to a2aa3d6 Compare March 9, 2026 20:52
@mergify
Copy link

mergify bot commented Mar 9, 2026

Hi @giulio-leone, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Handle tool-parser streaming events for non-harmony models, make function-call done-event emission robust against delta batching, and keep the serving path ruff-clean for pre-commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
@giulio-leone giulio-leone force-pushed the fix/responses-api-tool-call-xml branch from a2aa3d6 to 41f1587 Compare March 9, 2026 23:33
@giulio-leone
Copy link
Contributor Author

Refreshed the branch as a single signed commit to clear the DCO blocker and keep the latest fixes together. This push preserves the non-harmony Responses API tool-call streaming fix, the more robust function-call done-event emission, and the ruff/pre-commit cleanup in serving.py.

@giulio-leone
Copy link
Contributor Author

Follow-up on the remaining docs status: the only failing check left is docs/readthedocs.org:vllm, and the raw RTD log shows it is being cancelled by docs/maybe_skip_pr_build.sh because the PR lacks the documentation or ready label. The refreshed branch itself now has pre-commit, DCO, Summary, and Meta Internal-Only Changes Check green on commit 41f1587725283ff1f9e4f4d5a988159d422d7124.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
@giulio-leone
Copy link
Contributor Author

Follow-up for the bot review: I tightened the new streaming regressions so they now populate prev_tool_call_arr and explicitly assert ResponseFunctionCallArgumentsDoneEvent emission in both paths:

  • tool-only streaming
  • reasoning-then-tool-call streaming

That directly covers the previously untested done event path and avoids relying only on delta assertions.

Local validation completed here:

  • python -m py_compile tests/entrypoints/openai/test_serving_responses.py (Python 3.12)
  • git diff --check

I also attempted targeted pytest execution, but this local macOS environment does not have the full vLLM test runtime provisioned and collection pulls torch/compilation dependencies outside the current setup. CI on the PR should now validate the strengthened regression coverage on the real project environment.

Copy link
Contributor

@qandrew qandrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for putting this together! could you add an E2E test in test_simple.py also?

cc @chaunceyjiang

@mergify
Copy link

mergify bot commented Mar 12, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @giulio-leone.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 12, 2026
@giulio-leone
Copy link
Contributor Author

Closing this PR as the feature has been superseded by #29947 (merged in commit 9fe404e), which implements tool/function call streaming support in the Responses API.

The upstream implementation covers the same core fixes:

  • Removes the mutual exclusion between reasoning_parser and tool_parser
  • Handles the reasoning-then-tool-call transition
  • Emits proper ResponseFunctionCallArgumentsDeltaEvent / ResponseFunctionCallArgumentsDoneEvent

Thanks @chaunceyjiang for landing this! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend gpt-oss Related to GPT-OSS models needs-rebase

Projects

Status: Done

3 participants