[Harmony] Fix analysis-channel tool calls and preserve reasoning across turns by will-deines · Pull Request #35907 · vllm-project/vllm

will-deines · 2026-03-03T20:17:13Z

Recreated from #35884, which was closed when the fork was temporarily made private.

Motivation

GPT-OSS Harmony models exhibit three behaviors that stock vLLM handles incorrectly, causing silent misrouting of tool calls and loss of reasoning context in multi-turn tool-calling conversations.

Bug 1: Tool calls on the analysis channel are silently misrouted

GPT-OSS models sometimes emit function calls on the analysis channel instead of commentary. The completed-message parser (harmony_to_response_output) only accepted function calls on commentary, so analysis-channel function calls fell through to _parse_mcp_call(), producing incorrect MCP call output items instead of function tool calls.

This was an inconsistency: parser_state_to_response_output() (streaming) and the in-progress parser already accepted function calls on both channels. Only the completed-message path was missing.

Bug 2: Reasoning lost between tool-calling turns

The openai_harmony library defaults to auto_drop_analysis=True when rendering conversations for completion, stripping all analysis messages. vLLM already has its own auto_drop_analysis_messages() that selectively drops prior-turn analysis while preserving current-turn reasoning. The encoder's blanket drop on top of vLLM's selective drop caused double-filtering that destroyed the model's reasoning context between tool-calling turns.

Additionally, the Responses API path had a no-op slice-delete-reappend cycle in serving.py that was relying on the encoder to strip analysis at render time. With the encoder's auto_drop_analysis disabled (Fix 2), the Responses path had zero analysis filtering, causing stale chain-of-thought from completed turns to accumulate across multi-turn conversations.

Bug 3: Embedded function calls in preamble content

The triggered_tags sub-dispatch grammar is fully permissive (all tokens allowed between triggers). The model sometimes outputs a preamble message <|channel|>commentary<|message|> whose content contains the raw function-call channel tokens:

<|channel|>commentary<|message|><|channel|>commentary to=functions.X<|message|>{args}<|end|>

harmony_to_response_output then produces a ResponseOutputMessage with the raw channel tokens as text content, rather than a ResponseFunctionToolCall. This causes tool_choice=required assertions to fail and corrupts the output.

Changes

Fix 1: Accept function calls on analysis channel (`harmony.py`)

Widened the channel check in harmony_to_response_output() from == "commentary" to in ("commentary", "analysis"), making the completed-message parser consistent with the streaming and in-progress paths.

Fix 2: Disable encoder-side analysis dropping (`harmony_utils.py`) + add Responses API filtering (`serving.py`)

Two-part fix:

Pass RenderConversationConfig(auto_drop_analysis=False) to the openai_harmony encoder in render_for_completion(). This prevents the encoder from double-dropping analysis messages that vLLM already selectively filters via auto_drop_analysis_messages().
Replace the no-op slice-delete-reappend cycle in serving.py _construct_input_messages_with_harmony() with a proper call to auto_drop_analysis_messages(). This makes the Responses API path consistent with the Chat Completions path, which already called this function. Without this, disabling the encoder's auto-drop left the Responses path with zero analysis filtering.

Important context: Dropping prior-turn analysis after a final message is intentional per the Harmony spec (confirmed in #35779 discussion). This fix does not change the dropping policy — it prevents double-filtering where the encoder's blanket drop destroys current-turn reasoning the model needs for tool-call context, before vLLM's selective auto_drop_analysis_messages() even runs.

Fix 3: Extract embedded function calls from preamble content (`harmony.py`)

Add _try_extract_embedded_function_call() to detect when a preamble's content starts with a function-call channel prefix (e.g. <|channel|>commentary to=functions.X<|message|>{args}). When detected, the raw content is re-parsed into a proper ResponseFunctionToolCall instead of being emitted as a text message. Called from _parse_message_no_recipient() before falling back to _parse_final_message().

Files Changed

File	Change
`vllm/entrypoints/openai/responses/harmony.py`	Widen channel check for function calls (Fix 1); add `_try_extract_embedded_function_call()` (Fix 3); call it from `_parse_message_no_recipient()`
`vllm/entrypoints/openai/parser/harmony_utils.py`	Disable encoder `auto_drop_analysis` in `render_for_completion()` (Fix 2, part 1)
`vllm/entrypoints/openai/responses/serving.py`	Replace no-op analysis filtering with `auto_drop_analysis_messages()` (Fix 2, part 2)
`tests/entrypoints/openai/responses/test_harmony_utils.py`	Add `test_analysis_with_function_recipient_creates_function_call`
`tests/entrypoints/openai/parser/test_harmony_utils.py`	Add `TestRenderForCompletion` with `test_preserves_analysis` and `test_preserves_reasoning_across_tool_turns`

Related Issues / PRs

#	Title	Status	Relation
#35779	Harmony models incorrectly drops prior-turn analysis channel	Open	Primary bug report for reasoning loss — discussion confirmed prior-turn analysis drop is intentional per spec; our fix addresses the encoder-side double-drop, not the policy
#35826	Fix: preserve prior-turn analysis messages	Closed (not merged)	Proposed changing `auto_drop_analysis_messages()` algorithm; rejected because prior-turn analysis drop is intentional per Harmony spec. Confirms our approach is correct: the algorithm is fine, but the encoder's blanket `auto_drop_analysis=True` is the problem
#32114	[Bugfix] Fix Harmony preamble visibility in Responses API	Merged	Foundation for Fix 3 — established that preambles are visible `ResponseOutputMessage`s, not hidden reasoning. Our fix extends this: when a preamble's content is an embedded function call, extract it as `ResponseFunctionToolCall`
#37433	[Responses API] tool_choice support for GPT-OSS	Draft (ours)	Downstream dependent — depends on this PR for analysis-channel tool calls and embedded function call extraction
#27653	[RFC] Include past-reasoning for harmony formatting	Open	Proposes preserving reasoning in multi-turn; our fix implements the rendering half
#28262	Responses API incorrect input/output handling	Open	Channel metadata loss in round-trips
#35540	Fix empty channel/recipient in harmony for /v1/responses	Open	Fixes channel/recipient preservation on input; complementary
#32713	[RFC] Unified Parser for reasoning, tool calling	Open	Long-term architecture; these fixes are consistent with it

Design Decisions

Why widen the channel check instead of fixing the model? The model's behavior of emitting tool calls on analysis is valid per the Harmony protocol — the streaming and in-progress parsers already handle it. The completed-message parser was the only inconsistent path.
Why disable auto_drop_analysis at the encoder instead of removing auto_drop_analysis_messages()? vLLM's auto_drop_analysis_messages() implements the correct selective dropping policy (only prior-turn analysis before a final message), which is intentional per the Harmony spec (confirmed in [Bug]: Harmony models incorrectly drops prior-turn analysis channel in multi-turn conversations #35779 discussion; fix: preserve prior-turn analysis messages in Harmony multi-turn conversations #35826, which proposed changing the algorithm, was closed as not-a-bug). The encoder's blanket auto_drop_analysis=True is the problem — it fires before vLLM's selective drop, destroying current-turn reasoning the model needs for tool-call context. Disabling the encoder's drop preserves vLLM's intentional filtering while preventing the double-drop.
Why re-parse embedded function calls instead of fixing the grammar? The triggered_tags sub-dispatch grammar intentionally allows all tokens between triggers (free-text). Constraining it to prevent embedded channel sequences would require changes to the xgrammar structural tag specification. Re-parsing at the output layer is a surgical fix that handles the model's actual behavior without modifying the grammar contract. This extends the preamble visibility work in [Bugfix] Fix Harmony preamble visibility in Responses API #32114 — preambles are visible messages, but when their content is actually a function call, we should extract it.

Test Plan

pytest tests/entrypoints/openai/parser/test_harmony_utils.py -v — 61 passed
pytest tests/entrypoints/openai/responses/test_harmony_utils.py -v — 23 passed
All 84 tests pass with no regressions
Pre-commit checks pass

gemini-code-assist

Code Review

This pull request effectively addresses two bugs related to GPT-OSS Harmony model handling. The first fix correctly routes tool calls made on the analysis channel by ensuring they are parsed as ResponseFunctionToolCall objects, making the completed-message parser consistent with the streaming and in-progress paths. The second fix prevents the loss of reasoning context in multi-turn conversations by disabling the openai_harmony encoder's analysis message dropping, allowing vLLM's own selective filtering to function correctly. No high or critical severity security issues were identified during the audit, and these changes improve the robustness and performance of Harmony-based models. The changes are well-implemented, with specific unit tests ensuring correctness and robustness, and the code quality is high.

bbrowning

I gave this a once-over and it looks good to me. The fix to consider both analysis and commentary channels for tool calls matches real-world conditions, and also aligns with what we already do in the streaming case. And, it's safe to not have the openai/harmony library auto drop analysis messages because we already take care of that internally, since the harmony library was inconsistent in how it applied that auto dropping.

I also have a coworker that was running some BFCL multi-turn test suites against gpt-oss models and vLLM's Responses API implementation and hit the exact bug with unexpected McpCall being returned that this fixes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 84ce3814c4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

vllm/entrypoints/openai/parser/harmony_utils.py

will-deines · 2026-03-12T18:42:20Z

Thanks for flagging this @bbrowning and @chatgpt-codex-connector — good catch.

The auto_drop_analysis=False change in render_for_completion disabled the Harmony encoder's built-in analysis stripping so vLLM can handle it explicitly via auto_drop_analysis_messages(). The Chat Completions path already called that function, but the Responses path did not — its slice/reappend block was a no-op that relied on the encoder to strip analysis at render time.

With the encoder no longer stripping, the Responses path had no analysis filtering, so stale chain-of-thought from completed turns would accumulate in multi-turn conversations.

Fix: replaced the no-op block with auto_drop_analysis_messages(prev_msgs). This function finds the last assistant final message and drops all analysis messages before it, while preserving analysis after it (i.e., reasoning from the current in-progress turn that the model needs for context). Both API paths are now consistent.

chatgpt-codex-connector · 2026-03-12T18:42:31Z

To use Codex here, create an environment for this repo.

mergify · 2026-03-12T18:54:24Z

Hi @will-deines, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-12T19:53:23Z

Hi @will-deines, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

…ss turns Backport of PR vllm-project#35907 fixes: - Widen channel check to accept function calls on analysis channel - Disable encoder-side auto_drop_analysis to prevent double-filtering - Replace no-op loop with auto_drop_analysis_messages() in serving.py

robinnarsinghranabhat · 2026-03-16T22:49:25Z

@bbrowning
Comparing row 2 and 3, we can see, with this PR, the accuracy bumps over 3% in BFCL-Multi turn.

We can see vllm responses catching up towards vllm-cc.

While I expect Vllm-CC (row 1 )to be slightly better in a apples to apples comparison as it supports reasoning. I will note that, there is still another error popping while running BFCL-eval in Vllm-Responses (very infrequent); for which I will raise another separate issue.

Rank,Model,Multi Turn Overall Acc,Base,Miss Func,Miss Param,Long Context
1,openai/gpt-oss-120b (FC) (vLLM Chat Completions),49.75%,61.50%,54.00%,48.50%,35.00%
2,openai/gpt-oss-120b (FC) (vLLM Responses MCP Call Output Fix),44.62%,62.50%,36.50%,48.00%,31.50%
3,openai/gpt-oss-120b (FC) (vLLM Responses),41.50%,53.50%,37.00%,45.00%,30.50%

will-deines · 2026-03-17T13:49:02Z

@robinnarsinghranabhat Thank you for reporting! I'm interested in your comment "...I expect Vllm-CC (row 1 )to be slightly better in a apples to apples comparison as it supports reasoning"

Can you help me understand this better?

Also can you help me understand the other issue you're raising? I have run into a problem where control tokens are leaking into tool calls and have another PR that sanitizes them that's working for me in production. If that's the issue, you can try it out PR 35906

…ble Responses API tool_choice=required Three fixes on top of cherry-picked upstream PR vllm-project#33306: 1. EBNF grammar: tool_block now accepts both commentary and analysis channels, matching GPT-OSS behavior found in our PR vllm-project#35907. 2. adjust_request: handle both ChatCompletion and Responses API tool formats, guard response_format access for ResponsesRequest. 3. Responses API: remove NotImplementedError guard, add adjust_request call in _make_request_with_harmony so EBNF grammar flows through to sampling params.

robinnarsinghranabhat · 2026-03-17T16:32:03Z

@robinnarsinghranabhat Thank you for reporting! I'm interested in your comment "...I expect Vllm-CC (row 1 )to be slightly better in a apples to apples comparison as it supports reasoning"

Can you help me understand this better?

I noticed for gpt-oss models, vllm responses doesn't render the reasoning fields in the input prompts (that analysis channel ) while chat-completions does.

robinnarsinghranabhat · 2026-03-17T16:38:16Z

Also can you help me understand the other issue you're raising? I have run into a problem where control tokens are leaking into tool calls and have another PR that sanitizes them that's working for me in production. If that's the issue, you can try it out PR 35906

To be clear, this error of lesser frequency I am talking about is specific to gpt-oss-120b with vllm-responses, not seen in vllm-chat-completions when I ran BFCL multi-turn evals.

(APIServer pid=272488)   File "/home/ec2-user/vllm/vllm/entrypoints/openai/responses/harmony.py", line 395, in _parse_message_no_recipient
(APIServer pid=272488)     raise ValueError(f"Unknown channel: {message.channel}")
(APIServer pid=272488) ValueError: Unknown channel: None

However, There is this another problem of Tool name contamination that is highly noticeable specific to gpt-oss-20B and vllm-cc.
Infact, it is reason for performance drop of vllm-cc (checked both streamable and non-streamable mode) by 10+% compared to vllm-responses.

rank | config | overall | base | miss_func | miss_param | long_context
openai/gpt-oss-20b (FC) (vLLM Responses) | 39.50% | 53.50% | 31.00% | 46.00% | 27.50%
openai/gpt-oss-20b (FC) (vLLM Chat Completions) | 28.00% | 33.50% | 33.00% | 28.00% | 17.50%
Tool-name problem fixed on client side | 42.00% | 53% | 43% | 41% | %

In the third row, I stripped those <|channel|>commentary client side.
Nice to see PR 35906 could potentially solve this.
@bbrowning

bbrowning · 2026-03-18T13:33:44Z

Traveling so can't take a deep look right now, but we should check for regressions in how we prompt these models with Chat Completions based on those numbers. They are extremely sensitive to the prompting exactly following the expected Harmony formatting.

…ss turns Two fixes for GPT-OSS Harmony model behavior: 1. Accept function calls on analysis channel in harmony_to_response_output() to match the streaming/in-progress parsers that already handle both channels. 2. Disable openai_harmony encoder's auto_drop_analysis to prevent double-filtering with vLLM's auto_drop_analysis_messages(), preserving reasoning context between tool-calling turns. Signed-off-by: Will Deines <will@garr.io>

The render_for_completion change (auto_drop_analysis=False) disabled the Harmony encoder's built-in analysis stripping so that vLLM could handle it explicitly via auto_drop_analysis_messages(). The Chat Completions path already called this function, but the Responses path did not — it had a no-op slice/reappend block that was relying on the encoder to strip analysis at render time. With the encoder no longer stripping, the Responses path had zero analysis filtering, causing stale chain-of-thought from completed turns to accumulate across multi-turn conversations (bloating prompt tokens and potentially altering model behavior). Replace the no-op block with auto_drop_analysis_messages(), which drops analysis messages from completed turns (before the last assistant final message) while preserving analysis from the current in-progress turn that the model needs for context. This makes both API paths consistent. Signed-off-by: Will Deines <will@garr.io>

Fixes ruff-format pre-commit failure (missing two blank lines between TestRenderForCompletion and TestResponseInputToHarmonyReasoningItem). Signed-off-by: Will Deines <will@garr.io>

The triggered_tags sub-dispatch grammar is fully permissive (all tokens allowed between triggers). The model sometimes outputs a preamble message <|channel|>commentary<|message|> with the tool call tokens as content: <|channel|>commentary<|message|><|channel|>commentary to=functions.X<|message|>{args}<|end|> harmony_to_response_output then produces a ResponseOutputMessage with the raw channel tokens as text, failing tool_choice=required assertions. Add _try_extract_embedded_function_call() to detect this pattern in _parse_message_no_recipient: if a preamble's content starts with a function-call channel prefix, re-parse it as ResponseFunctionToolCall. Signed-off-by: Will Deines <will@garr.io>

…ol-calls Signed-off-by: Will Deines <will@garr.io>

mergify bot added frontend gpt-oss Related to GPT-OSS models labels Mar 3, 2026

github-project-automation bot added this to gpt-oss Issues & Enhancements Mar 3, 2026

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Mar 3, 2026

gemini-code-assist bot reviewed Mar 3, 2026

View reviewed changes

will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 3, 2026

merge: harmony analysis tool calls (PR vllm-project#35907)

61ff2d9

will-deines force-pushed the harmony-analysis-tool-calls branch from 3c7c674 to d6013c5 Compare March 4, 2026 20:13

bbrowning approved these changes Mar 12, 2026

View reviewed changes

will-deines marked this pull request as ready for review March 12, 2026 18:16

will-deines requested review from DarkLight1337, NickLucche, aarnphm, chaunceyjiang, robertgshaw2-redhat and russellb as code owners March 12, 2026 18:16

chatgpt-codex-connector bot reviewed Mar 12, 2026

View reviewed changes

vllm/entrypoints/openai/parser/harmony_utils.py Show resolved Hide resolved

will-deines force-pushed the harmony-analysis-tool-calls branch from 9a5721b to 32554be Compare March 12, 2026 20:23

Pradyun92 mentioned this pull request Mar 14, 2026

[Bugfix] Fix harmony streaming tool call crash and argument splitting #37070

Open

5 tasks

will-deines mentioned this pull request Mar 18, 2026

[Responses API] tool_choice support (auto / required / none) for GPT-OSS #37433

Open

12 tasks

will-deines force-pushed the harmony-analysis-tool-calls branch 2 times, most recently from 59fb2ef to 9cc9ce2 Compare March 18, 2026 13:29

garrio-1 added 4 commits March 18, 2026 10:02

style: add missing blank lines before class in test file

ae0ad5c

Fixes ruff-format pre-commit failure (missing two blank lines between TestRenderForCompletion and TestResponseInputToHarmonyReasoningItem). Signed-off-by: Will Deines <will@garr.io>

will-deines force-pushed the harmony-analysis-tool-calls branch from 9cc9ce2 to bdbf625 Compare March 18, 2026 14:06

will-deines and others added 2 commits March 18, 2026 11:01

Merge branch 'main' into harmony-analysis-tool-calls

30844bb

Merge remote-tracking branch 'upstream/main' into harmony-analysis-to…

6d7074f

…ol-calls Signed-off-by: Will Deines <will@garr.io>

Uh oh!

Conversation

will-deines commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Bug 1: Tool calls on the analysis channel are silently misrouted

Bug 2: Reasoning lost between tool-calling turns

Bug 3: Embedded function calls in preamble content

Changes

Fix 1: Accept function calls on analysis channel (harmony.py)

Fix 2: Disable encoder-side analysis dropping (harmony_utils.py) + add Responses API filtering (serving.py)

Fix 3: Extract embedded function calls from preamble content (harmony.py)

Files Changed

Related Issues / PRs

Design Decisions

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

bbrowning left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

will-deines commented Mar 12, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 12, 2026

Uh oh!

mergify bot commented Mar 12, 2026

Uh oh!

mergify bot commented Mar 12, 2026

Uh oh!

robinnarsinghranabhat commented Mar 16, 2026

Uh oh!

will-deines commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robinnarsinghranabhat commented Mar 17, 2026

Uh oh!

robinnarsinghranabhat commented Mar 17, 2026

Uh oh!

bbrowning commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

will-deines commented Mar 3, 2026 •

edited

Loading

Fix 1: Accept function calls on analysis channel (`harmony.py`)

Fix 2: Disable encoder-side analysis dropping (`harmony_utils.py`) + add Responses API filtering (`serving.py`)

Fix 3: Extract embedded function calls from preamble content (`harmony.py`)

will-deines commented Mar 17, 2026 •

edited

Loading