Skip to content

[Harmony] Fix analysis-channel tool calls and preserve reasoning across turns#35907

Open
will-deines wants to merge 6 commits intovllm-project:mainfrom
will-deines:harmony-analysis-tool-calls
Open

[Harmony] Fix analysis-channel tool calls and preserve reasoning across turns#35907
will-deines wants to merge 6 commits intovllm-project:mainfrom
will-deines:harmony-analysis-tool-calls

Conversation

@will-deines
Copy link
Copy Markdown

@will-deines will-deines commented Mar 3, 2026

Recreated from #35884, which was closed when the fork was temporarily made private.

Motivation

GPT-OSS Harmony models exhibit three behaviors that stock vLLM handles incorrectly, causing silent misrouting of tool calls and loss of reasoning context in multi-turn tool-calling conversations.

Bug 1: Tool calls on the analysis channel are silently misrouted

GPT-OSS models sometimes emit function calls on the analysis channel instead of commentary. The completed-message parser (harmony_to_response_output) only accepted function calls on commentary, so analysis-channel function calls fell through to _parse_mcp_call(), producing incorrect MCP call output items instead of function tool calls.

This was an inconsistency: parser_state_to_response_output() (streaming) and the in-progress parser already accepted function calls on both channels. Only the completed-message path was missing.

Bug 2: Reasoning lost between tool-calling turns

The openai_harmony library defaults to auto_drop_analysis=True when rendering conversations for completion, stripping all analysis messages. vLLM already has its own auto_drop_analysis_messages() that selectively drops prior-turn analysis while preserving current-turn reasoning. The encoder's blanket drop on top of vLLM's selective drop caused double-filtering that destroyed the model's reasoning context between tool-calling turns.

Additionally, the Responses API path had a no-op slice-delete-reappend cycle in serving.py that was relying on the encoder to strip analysis at render time. With the encoder's auto_drop_analysis disabled (Fix 2), the Responses path had zero analysis filtering, causing stale chain-of-thought from completed turns to accumulate across multi-turn conversations.

Bug 3: Embedded function calls in preamble content

The triggered_tags sub-dispatch grammar is fully permissive (all tokens allowed between triggers). The model sometimes outputs a preamble message <|channel|>commentary<|message|> whose content contains the raw function-call channel tokens:

<|channel|>commentary<|message|><|channel|>commentary to=functions.X<|message|>{args}<|end|>

harmony_to_response_output then produces a ResponseOutputMessage with the raw channel tokens as text content, rather than a ResponseFunctionToolCall. This causes tool_choice=required assertions to fail and corrupts the output.

Changes

Fix 1: Accept function calls on analysis channel (harmony.py)

Widened the channel check in harmony_to_response_output() from == "commentary" to in ("commentary", "analysis"), making the completed-message parser consistent with the streaming and in-progress paths.

Fix 2: Disable encoder-side analysis dropping (harmony_utils.py) + add Responses API filtering (serving.py)

Two-part fix:

  1. Pass RenderConversationConfig(auto_drop_analysis=False) to the openai_harmony encoder in render_for_completion(). This prevents the encoder from double-dropping analysis messages that vLLM already selectively filters via auto_drop_analysis_messages().

  2. Replace the no-op slice-delete-reappend cycle in serving.py _construct_input_messages_with_harmony() with a proper call to auto_drop_analysis_messages(). This makes the Responses API path consistent with the Chat Completions path, which already called this function. Without this, disabling the encoder's auto-drop left the Responses path with zero analysis filtering.

Important context: Dropping prior-turn analysis after a final message is intentional per the Harmony spec (confirmed in #35779 discussion). This fix does not change the dropping policy — it prevents double-filtering where the encoder's blanket drop destroys current-turn reasoning the model needs for tool-call context, before vLLM's selective auto_drop_analysis_messages() even runs.

Fix 3: Extract embedded function calls from preamble content (harmony.py)

Add _try_extract_embedded_function_call() to detect when a preamble's content starts with a function-call channel prefix (e.g. <|channel|>commentary to=functions.X<|message|>{args}). When detected, the raw content is re-parsed into a proper ResponseFunctionToolCall instead of being emitted as a text message. Called from _parse_message_no_recipient() before falling back to _parse_final_message().

Files Changed

File Change
vllm/entrypoints/openai/responses/harmony.py Widen channel check for function calls (Fix 1); add _try_extract_embedded_function_call() (Fix 3); call it from _parse_message_no_recipient()
vllm/entrypoints/openai/parser/harmony_utils.py Disable encoder auto_drop_analysis in render_for_completion() (Fix 2, part 1)
vllm/entrypoints/openai/responses/serving.py Replace no-op analysis filtering with auto_drop_analysis_messages() (Fix 2, part 2)
tests/entrypoints/openai/responses/test_harmony_utils.py Add test_analysis_with_function_recipient_creates_function_call
tests/entrypoints/openai/parser/test_harmony_utils.py Add TestRenderForCompletion with test_preserves_analysis and test_preserves_reasoning_across_tool_turns

Related Issues / PRs

# Title Status Relation
#35779 Harmony models incorrectly drops prior-turn analysis channel Open Primary bug report for reasoning loss — discussion confirmed prior-turn analysis drop is intentional per spec; our fix addresses the encoder-side double-drop, not the policy
#35826 Fix: preserve prior-turn analysis messages Closed (not merged) Proposed changing auto_drop_analysis_messages() algorithm; rejected because prior-turn analysis drop is intentional per Harmony spec. Confirms our approach is correct: the algorithm is fine, but the encoder's blanket auto_drop_analysis=True is the problem
#32114 [Bugfix] Fix Harmony preamble visibility in Responses API Merged Foundation for Fix 3 — established that preambles are visible ResponseOutputMessages, not hidden reasoning. Our fix extends this: when a preamble's content is an embedded function call, extract it as ResponseFunctionToolCall
#37433 [Responses API] tool_choice support for GPT-OSS Draft (ours) Downstream dependent — depends on this PR for analysis-channel tool calls and embedded function call extraction
#27653 [RFC] Include past-reasoning for harmony formatting Open Proposes preserving reasoning in multi-turn; our fix implements the rendering half
#28262 Responses API incorrect input/output handling Open Channel metadata loss in round-trips
#35540 Fix empty channel/recipient in harmony for /v1/responses Open Fixes channel/recipient preservation on input; complementary
#32713 [RFC] Unified Parser for reasoning, tool calling Open Long-term architecture; these fixes are consistent with it

Design Decisions

  1. Why widen the channel check instead of fixing the model? The model's behavior of emitting tool calls on analysis is valid per the Harmony protocol — the streaming and in-progress parsers already handle it. The completed-message parser was the only inconsistent path.

  2. Why disable auto_drop_analysis at the encoder instead of removing auto_drop_analysis_messages()? vLLM's auto_drop_analysis_messages() implements the correct selective dropping policy (only prior-turn analysis before a final message), which is intentional per the Harmony spec (confirmed in [Bug]: Harmony models incorrectly drops prior-turn analysis channel in multi-turn conversations #35779 discussion; fix: preserve prior-turn analysis messages in Harmony multi-turn conversations #35826, which proposed changing the algorithm, was closed as not-a-bug). The encoder's blanket auto_drop_analysis=True is the problem — it fires before vLLM's selective drop, destroying current-turn reasoning the model needs for tool-call context. Disabling the encoder's drop preserves vLLM's intentional filtering while preventing the double-drop.

  3. Why re-parse embedded function calls instead of fixing the grammar? The triggered_tags sub-dispatch grammar intentionally allows all tokens between triggers (free-text). Constraining it to prevent embedded channel sequences would require changes to the xgrammar structural tag specification. Re-parsing at the output layer is a surgical fix that handles the model's actual behavior without modifying the grammar contract. This extends the preamble visibility work in [Bugfix] Fix Harmony preamble visibility in Responses API #32114 — preambles are visible messages, but when their content is actually a function call, we should extract it.

Test Plan

  • pytest tests/entrypoints/openai/parser/test_harmony_utils.py -v — 61 passed
  • pytest tests/entrypoints/openai/responses/test_harmony_utils.py -v — 23 passed
  • All 84 tests pass with no regressions
  • Pre-commit checks pass

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses two bugs related to GPT-OSS Harmony model handling. The first fix correctly routes tool calls made on the analysis channel by ensuring they are parsed as ResponseFunctionToolCall objects, making the completed-message parser consistent with the streaming and in-progress paths. The second fix prevents the loss of reasoning context in multi-turn conversations by disabling the openai_harmony encoder's analysis message dropping, allowing vLLM's own selective filtering to function correctly. No high or critical severity security issues were identified during the audit, and these changes improve the robustness and performance of Harmony-based models. The changes are well-implemented, with specific unit tests ensuring correctness and robustness, and the code quality is high.

will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 3, 2026
@will-deines will-deines force-pushed the harmony-analysis-tool-calls branch from 3c7c674 to d6013c5 Compare March 4, 2026 20:13
Copy link
Copy Markdown
Contributor

@bbrowning bbrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a once-over and it looks good to me. The fix to consider both analysis and commentary channels for tool calls matches real-world conditions, and also aligns with what we already do in the streaming case. And, it's safe to not have the openai/harmony library auto drop analysis messages because we already take care of that internally, since the harmony library was inconsistent in how it applied that auto dropping.

I also have a coworker that was running some BFCL multi-turn test suites against gpt-oss models and vLLM's Responses API implementation and hit the exact bug with unexpected McpCall being returned that this fixes.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 84ce3814c4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@will-deines
Copy link
Copy Markdown
Author

Thanks for flagging this @bbrowning and @chatgpt-codex-connector — good catch.

The auto_drop_analysis=False change in render_for_completion disabled the Harmony encoder's built-in analysis stripping so vLLM can handle it explicitly via auto_drop_analysis_messages(). The Chat Completions path already called that function, but the Responses path did not — its slice/reappend block was a no-op that relied on the encoder to strip analysis at render time.

With the encoder no longer stripping, the Responses path had no analysis filtering, so stale chain-of-thought from completed turns would accumulate in multi-turn conversations.

Fix: replaced the no-op block with auto_drop_analysis_messages(prev_msgs). This function finds the last assistant final message and drops all analysis messages before it, while preserving analysis after it (i.e., reasoning from the current in-progress turn that the model needs for context). Both API paths are now consistent.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create an environment for this repo.

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 12, 2026

Hi @will-deines, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

1 similar comment
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 12, 2026

Hi @will-deines, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@will-deines will-deines force-pushed the harmony-analysis-tool-calls branch from 9a5721b to 32554be Compare March 12, 2026 20:23
robinnarsinghranabhat pushed a commit to robinnarsinghranabhat/vllm that referenced this pull request Mar 16, 2026
…ss turns

Backport of PR vllm-project#35907 fixes:
- Widen channel check to accept function calls on analysis channel
- Disable encoder-side auto_drop_analysis to prevent double-filtering
- Replace no-op loop with auto_drop_analysis_messages() in serving.py
@robinnarsinghranabhat
Copy link
Copy Markdown

@bbrowning
Comparing row 2 and 3, we can see, with this PR, the accuracy bumps over 3% in BFCL-Multi turn.

We can see vllm responses catching up towards vllm-cc.

While I expect Vllm-CC (row 1 )to be slightly better in a apples to apples comparison as it supports reasoning. I will note that, there is still another error popping while running BFCL-eval in Vllm-Responses (very infrequent); for which I will raise another separate issue.

Rank,Model,Multi Turn Overall Acc,Base,Miss Func,Miss Param,Long Context
1,openai/gpt-oss-120b (FC) (vLLM Chat Completions),49.75%,61.50%,54.00%,48.50%,35.00%
2,openai/gpt-oss-120b (FC) (vLLM Responses MCP Call Output Fix),44.62%,62.50%,36.50%,48.00%,31.50%
3,openai/gpt-oss-120b (FC) (vLLM Responses),41.50%,53.50%,37.00%,45.00%,30.50%

@will-deines
Copy link
Copy Markdown
Author

will-deines commented Mar 17, 2026

@robinnarsinghranabhat Thank you for reporting! I'm interested in your comment "...I expect Vllm-CC (row 1 )to be slightly better in a apples to apples comparison as it supports reasoning"

Can you help me understand this better?

Also can you help me understand the other issue you're raising? I have run into a problem where control tokens are leaking into tool calls and have another PR that sanitizes them that's working for me in production. If that's the issue, you can try it out PR 35906

will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 17, 2026
…ble Responses API tool_choice=required

Three fixes on top of cherry-picked upstream PR vllm-project#33306:

1. EBNF grammar: tool_block now accepts both commentary and analysis
   channels, matching GPT-OSS behavior found in our PR vllm-project#35907.

2. adjust_request: handle both ChatCompletion and Responses API tool
   formats, guard response_format access for ResponsesRequest.

3. Responses API: remove NotImplementedError guard, add adjust_request
   call in _make_request_with_harmony so EBNF grammar flows through
   to sampling params.
will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 17, 2026
…ble Responses API tool_choice=required

Three fixes on top of cherry-picked upstream PR vllm-project#33306:

1. EBNF grammar: tool_block now accepts both commentary and analysis
   channels, matching GPT-OSS behavior found in our PR vllm-project#35907.

2. adjust_request: handle both ChatCompletion and Responses API tool
   formats, guard response_format access for ResponsesRequest.

3. Responses API: remove NotImplementedError guard, add adjust_request
   call in _make_request_with_harmony so EBNF grammar flows through
   to sampling params.
@robinnarsinghranabhat
Copy link
Copy Markdown

@robinnarsinghranabhat Thank you for reporting! I'm interested in your comment "...I expect Vllm-CC (row 1 )to be slightly better in a apples to apples comparison as it supports reasoning"

Can you help me understand this better?

I noticed for gpt-oss models, vllm responses doesn't render the reasoning fields in the input prompts (that analysis channel ) while chat-completions does.

@robinnarsinghranabhat
Copy link
Copy Markdown

Also can you help me understand the other issue you're raising? I have run into a problem where control tokens are leaking into tool calls and have another PR that sanitizes them that's working for me in production. If that's the issue, you can try it out PR 35906

To be clear, this error of lesser frequency I am talking about is specific to gpt-oss-120b with vllm-responses, not seen in vllm-chat-completions when I ran BFCL multi-turn evals.

(APIServer pid=272488)   File "/home/ec2-user/vllm/vllm/entrypoints/openai/responses/harmony.py", line 395, in _parse_message_no_recipient
(APIServer pid=272488)     raise ValueError(f"Unknown channel: {message.channel}")
(APIServer pid=272488) ValueError: Unknown channel: None

However, There is this another problem of Tool name contamination that is highly noticeable specific to gpt-oss-20B and vllm-cc.
Infact, it is reason for performance drop of vllm-cc (checked both streamable and non-streamable mode) by 10+% compared to vllm-responses.

rank | config | overall | base | miss_func | miss_param | long_context
openai/gpt-oss-20b (FC) (vLLM Responses) | 39.50% | 53.50% | 31.00% | 46.00% | 27.50%
openai/gpt-oss-20b (FC) (vLLM Chat Completions) | 28.00% | 33.50% | 33.00% | 28.00% | 17.50%
Tool-name problem fixed on client side | 42.00% | 53% | 43% | 41% | %

In the third row, I stripped those <|channel|>commentary client side.
Nice to see PR 35906 could potentially solve this.
@bbrowning

@will-deines will-deines force-pushed the harmony-analysis-tool-calls branch 2 times, most recently from 59fb2ef to 9cc9ce2 Compare March 18, 2026 13:29
@bbrowning
Copy link
Copy Markdown
Contributor

Traveling so can't take a deep look right now, but we should check for regressions in how we prompt these models with Chat Completions based on those numbers. They are extremely sensitive to the prompting exactly following the expected Harmony formatting.

…ss turns

Two fixes for GPT-OSS Harmony model behavior:

1. Accept function calls on analysis channel in harmony_to_response_output()
   to match the streaming/in-progress parsers that already handle both channels.

2. Disable openai_harmony encoder's auto_drop_analysis to prevent
   double-filtering with vLLM's auto_drop_analysis_messages(), preserving
   reasoning context between tool-calling turns.

Signed-off-by: Will Deines <will@garr.io>
The render_for_completion change (auto_drop_analysis=False) disabled the
Harmony encoder's built-in analysis stripping so that vLLM could handle
it explicitly via auto_drop_analysis_messages(). The Chat Completions
path already called this function, but the Responses path did not — it
had a no-op slice/reappend block that was relying on the encoder to
strip analysis at render time.

With the encoder no longer stripping, the Responses path had zero
analysis filtering, causing stale chain-of-thought from completed turns
to accumulate across multi-turn conversations (bloating prompt tokens
and potentially altering model behavior).

Replace the no-op block with auto_drop_analysis_messages(), which drops
analysis messages from completed turns (before the last assistant final
message) while preserving analysis from the current in-progress turn
that the model needs for context. This makes both API paths consistent.

Signed-off-by: Will Deines <will@garr.io>
Fixes ruff-format pre-commit failure (missing two blank lines
between TestRenderForCompletion and TestResponseInputToHarmonyReasoningItem).

Signed-off-by: Will Deines <will@garr.io>
The triggered_tags sub-dispatch grammar is fully permissive (all tokens
allowed between triggers). The model sometimes outputs a preamble message
<|channel|>commentary<|message|> with the tool call tokens as content:
  <|channel|>commentary<|message|><|channel|>commentary to=functions.X<|message|>{args}<|end|>

harmony_to_response_output then produces a ResponseOutputMessage with
the raw channel tokens as text, failing tool_choice=required assertions.

Add _try_extract_embedded_function_call() to detect this pattern in
_parse_message_no_recipient: if a preamble's content starts with a
function-call channel prefix, re-parse it as ResponseFunctionToolCall.

Signed-off-by: Will Deines <will@garr.io>
@will-deines will-deines force-pushed the harmony-analysis-tool-calls branch from 9cc9ce2 to bdbf625 Compare March 18, 2026 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend gpt-oss Related to GPT-OSS models

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

4 participants