Skip to content

[Frontend] Add tool_choice=required support for GPT-OSS Harmony models#33306

Open
gkswns0531 wants to merge 2 commits intovllm-project:mainfrom
gkswns0531:feature/gptoss-tool-choice-required-v0.15
Open

[Frontend] Add tool_choice=required support for GPT-OSS Harmony models#33306
gkswns0531 wants to merge 2 commits intovllm-project:mainfrom
gkswns0531:feature/gptoss-tool-choice-required-v0.15

Conversation

@gkswns0531
Copy link
Copy Markdown
Contributor

@gkswns0531 gkswns0531 commented Jan 29, 2026

Summary

Add tool_choice="required" support for GPT-OSS Harmony models via EBNF grammar, and fix the Harmony render path to actually apply tool parser grammar constraints.

Problem

GPT-OSS models use the Harmony chat format with channel-based output (analysis, commentary, final). Two issues prevented tool_choice="required" from working:

  1. No grammar enforcement mechanism — The standard JSON schema approach (base class adjust_request) constrains output to a JSON array, which is incompatible with Harmony's channel-based format.
  2. Harmony render path bypasses adjust_request()render_chat() routes GPT-OSS requests through _make_request_with_harmony() instead of _preprocess_chat(), so tool_parser.adjust_request() was never called. Even with a correct grammar implementation, it was never applied.

Without this fix, tool call generation depends entirely on the model's natural behavior (~85% success rate at default temperature).

Solution

EBNF Grammar (commit 1)

Use an EBNF grammar compiled by xgrammar to constrain token-level generation while preserving the Harmony channel structure:

  • Allows analysis blocks (chain-of-thought reasoning)
  • Allows commentary preambles (visible text before tool calls)
  • Requires at least one tool call (commentary to=functions.X)
  • Blocks the final channel entirely (not defined in grammar)
root       ::= non_tool_block* tool_block more_tool*
non_tool_block ::= ("analysis" | "commentary") "<|message|>" content "<|end|>" "<|start|>" "assistant" "<|channel|>"
tool_block ::= "commentary to=" func_name "<|message|>" content "<|end|>" "<|call|>"
more_tool  ::= "<|start|>" "assistant" "<|channel|>" non_tool_block* tool_block
func_name  ::= "functions.get_weather" | "functions.calculate" | ...
content    ::= ([^<] | "<" [^|])*

Harmony Render Path Fix (commit 2)

Add adjust_request() call in the Harmony branch of render_chat() so grammar constraints are actually applied for GPT-OSS models.

Why EBNF instead of LogitsProcessor?

The previous implementation used a custom PatternForcedSequenceLogitsProcessor. This was changed because:

  1. LogitsProcessor refactoringLogitsProcessor is being refactored for model runner v2 (comment)
  2. Reasoning preservation — EBNF grammar allows analysis/commentary channels, preserving chain-of-thought (concern)
  3. Smaller diff — Leverages existing structured_outputs + xgrammar path, no model runner internals modified

Changes

  • vllm/entrypoints/serve/render/serving.py: Call tool_parser.adjust_request() in the Harmony render path (_make_request_with_harmony branch) so grammar constraints are applied for GPT-OSS models.
  • vllm/tool_parsers/openai_tool_parser.py: Override adjust_request() with EBNF grammar builder for tool_choice=required. Use raw_decode() for JSON extraction to handle trailing structural tokens.
  • vllm/entrypoints/openai/parser/harmony_utils.py: Add error handling in parse_output_into_messages for grammar-constrained token sequences.
  • tests/tool_parsers/test_openai_tool_parser_ebnf.py: Unit tests for EBNF grammar and xgrammar validation.

Test Plan

  • Unit tests: xgrammar acceptance/blocking/termination tests pass
  • Live server batch test: 100/100 requests produce tool calls on gpt-oss-120b (default temperature)
  • Before fix: ~85% success rate (grammar was never applied due to render path bug)
  • After fix: 100% success rate

Related Issue: #33966

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for tool_choice="required" in GPT-OSS Harmony models by using bad_words to prevent non-tool-call generations. The implementation appears solid and is well-tested for both streaming and non-streaming scenarios. I've identified a performance regression where the OpenAIToolParser's initialization, which now includes expensive tokenization operations, is executed on every request. I have provided a suggestion to introduce caching to mitigate this performance impact.

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 29, 2026

Hi @gkswns0531, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@gkswns0531 gkswns0531 force-pushed the feature/gptoss-tool-choice-required-v0.15 branch from 24310c1 to b2ece57 Compare January 29, 2026 05:46
@gkswns0531 gkswns0531 changed the title [Model] Add tool_choice=required support for GPT-OSS Harmony models [Frontend] Add tool_choice=required support for GPT-OSS Harmony models Jan 29, 2026
@gkswns0531 gkswns0531 force-pushed the feature/gptoss-tool-choice-required-v0.15 branch from f90cd82 to 9ba3d07 Compare January 29, 2026 07:36
@chaunceyjiang chaunceyjiang self-assigned this Jan 29, 2026
@gkswns0531 gkswns0531 force-pushed the feature/gptoss-tool-choice-required-v0.15 branch from 9ba3d07 to 2e1374e Compare January 30, 2026 06:59
@mergify mergify bot added the v1 label Jan 30, 2026
@gkswns0531 gkswns0531 force-pushed the feature/gptoss-tool-choice-required-v0.15 branch from 826a5d6 to 2e1374e Compare January 30, 2026 07:18
@gkswns0531
Copy link
Copy Markdown
Contributor Author

Hello, @chaunceyjiang
Just a friendly reminder about this PR when you have a moment. I'd really appreciate it if you could take a look whenever your schedule allows.
Happy to address any feedback or questions.
Thank you!

@gkswns0531 gkswns0531 force-pushed the feature/gptoss-tool-choice-required-v0.15 branch from 2e1374e to a0cbc0a Compare February 3, 2026 06:57
@gkswns0531
Copy link
Copy Markdown
Contributor Author

Hi @chaunceyjiang , just a gentle follow-up on this PR. I understand you must be busy, so no rush at all.
If this issue has already been resolved in the meantime, I'll go ahead and close this PR. Otherwise, if there's a preferred alternative approach or any concerns about the implementation, I'd be more than happy to hear your thoughts and make adjustments accordingly.
Thanks again for your time!

@gkswns0531
Copy link
Copy Markdown
Contributor Author

Hi @chaunceyjiang , hope you're doing well. Just following up once more on this PR.
I completely understand if you're busy or if this isn't a priority right now. If that's the case, could you let me know so I can plan accordingly? And if this issue has already been addressed elsewhere, I'm happy to close this PR.
Any feedback or a quick update would be greatly appreciated. Thanks so much!

@chaunceyjiang
Copy link
Copy Markdown
Collaborator

@gkswns0531 Thanks for your PR — I think this is a useful feature. I’ve just started my Chinese New Year holiday recently.

@qandrew @yeqcharlotte PTAL.

@jonoillar
Copy link
Copy Markdown
Contributor

@gkswns0531 skimming through your improvement, I have many questions.

The main one: does it ensure the model can "reason" before sending a tool call ?

I.e., can the model have long chain of thoughts before being forced to generate a tool call ?

For example, can it create a Preamble before generating a tool call, like shown in Preamble:

<|channel|>analysis<|message|>{long chain of thought}<|end|><|start|>assistant<|channel|>commentary<|message|>**Action plan**:
1. Generate an HTML file
2. Generate a JavaScript for the Node.js server
3. Start the server
---
Will start executing the plan step by step<|end|><|start|>assistant<|channel|>commentary to=functions.generate_file<|constrain|>json<|message|>{"template": "basic_html", "path": "index.html"}<|call|>

As far as I understand, your feature works like this: as soon as the model outputs <|end|><|start|>assistant<|channel|> , force a tool calling

But what if the model in it's chain of thought, want to create many <|channel|>analysis messages in a row ?

Maybe we should let it the freedom to do so, and enforce the tool calling only after a given number of analysis

Also 1: you state "This approach is more robust than the previous bad_words blocking approach, which could not guarantee 100% blocking (edge cases like " final", " finally" tokens could slip through)."

What is this approach/did someone implement it ? My naive understanding is that with this approach, we only prevent the <|channel|>final token from appearing

Also 2: After you forced the tool calling path (i.e. force the model to generate commentary to=), what ensures that tool call generation respects the schema you provided ? An idea that could be implemented in a follow up PR could be:

  • whenever the model generates commentary to=, enforce a guided decoding (with xgrammar or others)

@gkswns0531
Copy link
Copy Markdown
Contributor Author

gkswns0531 commented Feb 11, 2026

@jonoillar Thanks for the thorough review and great questions!

  1. Multiple analysis rounds:

You're right that the Harmony format supports multiple <|channel|>analysis messages and preambles (plain commentary without to=) in a single response. The current implementation intentionally trades off these for guaranteed tool call generation. I also explored forcing only commentary to allow the model to generate preambles before tool calls. However, the model would often generate a commentary message and then immediately emit <|return|>, ending the response without actually making a tool call. So forcing commentary to= was necessary to ensure the model commits to the tool call path.

In production, many systems depend on tool calls happening reliably at specific points — orchestration agents, pipelines with mandatory function calls, etc. Most models support tool_choice="required" out of the box, but GPT-OSS Harmony models currently lack this enforcement, which was the core motivation for this PR.

tool_choice="auto" is where the model should have full freedom — multiple analysis rounds, preambles, choosing whether to call tools or respond directly, etc. But tool_choice="required" explicitly signals that the caller needs a tool call to happen, no matter what. In that context, I believe it's important to have an option that guarantees 100% tool call generation, even if it means sacrificing some reasoning depth. Users who want the model's full reasoning capability can always use tool_choice="auto".

  1. The bad_words approach:

This was actually my first implementation. I registered the final token in bad_words to prevent the model from entering the final channel. This doesn't block multiple analysis rounds, and it improved the tool call success rate from ~80% to ~97%, but the model could produce variant tokens like " final" (with a leading space) or "finally" that bypassed the filter, introducing 2-3% errors.

The current LogitsProcessor approach solves this with positive enforcement — instead of trying to block all possible variants of unwanted tokens, it forces the exact required sequence (commentary to=) at the channel decision point, guaranteeing 100% tool call generation.

  1. Schema enforcement after forcing:

Once the model enters the commentary to= path, GPT-OSS models are trained to reliably generate well-formed tool calls — selecting the correct function name, producing valid JSON arguments, and terminating cleanly with <|call|>. This isn't theoretically enforced at the logits level like xgrammar, but in practice, across hundreds of test runs, the model has consistently produced valid tool calls once it's on the correct channel path.


Thanks! I'm happy to adjust the approach if the community has a different preference — always open to working together toward what benefits the project most.

@jonoillar
Copy link
Copy Markdown
Contributor

However, the model would often generate a commentary message and then immediately emit <|return|>, ending the response without actually making a tool call. So forcing commentary to= was necessary to ensure the model commits to the tool call path.

So the model sometimes returns something like:

<|start|>assistant<|channel|>commentary<|message|>some comment<|return|>

?

The bad_words approach:

This was actually my first implementation. I registered the final token in bad_words to prevent the model from entering the final channel. This doesn't block multiple analysis rounds, and it improved the tool call success rate from ~80% to ~97%, but the model could produce variant tokens like " final" (with a leading space) or "finally" that bypassed the filter, introducing 2-3% errors.

Could you share the code ? Have you tried to constrain the model not to return:

  • <|channel|>final -> so that we don't get into the final channel
  • <|return|> -> so that the model doesn't return
  • <|call|> WITHOUT having generated a tool call before, i.e. without having generated <|channel|>commentary to=

I'm interested in exploring this approach, since I fear removing the reasoning capabilities would harm the model intelligence

Schema enforcement after forcing:

Once the model enters the commentary to= path, GPT-OSS models are trained to reliably generate well-formed tool calls — selecting the correct function name, producing valid JSON arguments, and terminating cleanly with <|call|>. This isn't theoretically enforced at the logits level like xgrammar, but in practice, across hundreds of test runs, the model has consistently produced valid tool calls once it's on the correct channel path.

I faced cases where the model doesn't return the correct schema. This might be a little out of scope of this issue though, and could be tackled in a follow up issue. I'll be happy to take care of it :)

@gkswns0531
Copy link
Copy Markdown
Contributor Author

gkswns0531 commented Feb 11, 2026

@jonoillar Thanks!
I think my test cases may have been biased toward some specific scenarios. I share your concern about preserving the model's reasoning capability.

I've put the bad_words-based implementation on my fork for reference:
https://github.com/gkswns0531/vllm/tree/feature/gptoss-tool-choice-bad-words-demo

It blocks the final channel variants (final / final / finally) after the <|end|<|start|>assistant<|channel|> prefix, and globally blocks <|return|> as a safety net. The blocked patterns are declared as a simple list in BLOCKED_PATTERNS on OpenAIToolParser. If you'd like to block additional tokens, just add them to that list.

A couple of notes from my testing:

  • I built this on the latest main branch, but had to use --enforce-eager to load the model due to what seems like a local environment issue. Just in case you run into something similar.
  • I occasionally saw cases where the model ends with <|call|> after a preamble without actually producing a tool call.

I'll think more about this direction as well.

@gkswns0531
Copy link
Copy Markdown
Contributor Author

Hi @qandrew @yeqcharlotte , hope you're both doing well! Just a friendly follow-up on this PR — it's been a while since @chaunceyjiang kindly tagged you for review.

If this issue has already been resolved or if there's a different direction the team prefers, I'm happy to close this PR. Otherwise, any feedback or guidance would be really appreciated so I can move forward accordingly.

Thanks for your time!

cc @chaunceyjiang

@chaunceyjiang
Copy link
Copy Markdown
Collaborator

Thanks~ @gkswns0531, I will take a look at this PR over the next few days.

forced_sequence: list[int]


class PatternForcedSequenceLogitsProcessor(LogitsProcessor):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, by the way, LogitsProcessor is currently undergoing refactoring, since model runner v2 will restructure this part of the code.

This PR might need to wait until model runner v2 is officially enabled by default before we continue moving it forward.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Understood — I'll watch for the model runner v2 progress and update this PR accordingly!

Copy link
Copy Markdown
Contributor

@sfeng33 sfeng33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if you have considered solving the problem using an alternative approach: structural tag, e.g. define tags for analysis and commentary to=functions.*

That way the reasoning is preserved, no change is needed on the model runner. Structure tag has been supported in xgrammar/guidance.

@gkswns0531 gkswns0531 force-pushed the feature/gptoss-tool-choice-required-v0.15 branch from 0ee72e8 to 1d449a6 Compare March 13, 2026 11:08
@gkswns0531 gkswns0531 requested a review from russellb as a code owner March 13, 2026 11:08
@gkswns0531 gkswns0531 force-pushed the feature/gptoss-tool-choice-required-v0.15 branch from 1d449a6 to 0c78b4d Compare March 13, 2026 11:13
Use xgrammar EBNF grammar to enforce tool calls for Harmony models
instead of the previous LogitsProcessor approach. This avoids
dependency on model runner internals (which are being refactored
for v2) and preserves the model's reasoning ability.

The grammar allows analysis/commentary channels while blocking
the final channel entirely, requiring at least one tool call.

Changes:
- Add adjust_request() to OpenAIToolParser with EBNF grammar
  builder for tool_choice=required
- Use raw_decode() for JSON extraction to handle trailing
  structural tokens from partial Harmony parsing
- Add error handling in parse_output_into_messages for
  grammar-constrained token sequences
- Add comprehensive unit tests (59 cases) for EBNF grammar
  acceptance/blocking via xgrammar
- Add E2E test script (15 scenarios) for live server testing

Signed-off-by: hanjun <hanjun.cho@allganize.io>
@gkswns0531 gkswns0531 force-pushed the feature/gptoss-tool-choice-required-v0.15 branch from 0c78b4d to 902c87f Compare March 13, 2026 11:17
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 13, 2026

Hi @gkswns0531, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@sfeng33
Copy link
Copy Markdown
Contributor

sfeng33 commented Mar 13, 2026

Hey @gkswns0531, thanks for all the updates, this looks quite promising!

After you finish, if you could also run this eval script:

.buildkite/scripts/tool_call/run-bfcl-eval.sh

I'd be interested to know the score before and after this PR on gpt oss 20b and 120b, and the performance impact.
The categories of interest are the multi turn ones, see this example cmd:

BFCL_MODEL="openai/gpt-oss-120b" \
    BFCL_TP_SIZE=4 \
    BFCL_TEST_CATEGORY="multi_turn_base,multi_turn_miss_func,multi_turn_miss_param,multi_turn_long_cont
  ext" \
    BFCL_OUTPUT_DIR=./bfcl-chat-completions \
    BFCL_API_TYPE=chat_completions \
    bash .buildkite/scripts/tool_call/run-bfcl-eval.sh

cc @chaunceyjiang @bbrowning

The Harmony code path in render_chat() uses _make_request_with_harmony()
instead of _preprocess_chat(), which bypassed tool_parser.adjust_request().
This meant the EBNF grammar for tool_choice="required" was never applied,
leaving generation unconstrained (~85% tool call rate instead of 100%).

Add adjust_request() call after _make_request_with_harmony() so grammar
constraints are applied for GPT-OSS models. Also clean up debug logging
and consolidate redundant test cases.

Verified: 100/100 requests produce tool calls on gpt-oss-120b.

Signed-off-by: hanjun <hanjun.cho@allganize.io>
@gkswns0531 gkswns0531 force-pushed the feature/gptoss-tool-choice-required-v0.15 branch from cec2dd9 to 87f3f8d Compare March 13, 2026 16:09
@bbrowning
Copy link
Copy Markdown
Contributor

@sfeng33 I don't think BFCL will measure the impact of this one without modification, as it does not set tool_choice=required in its inference requests. It could verify there aren't side-effects in other the default tool_choice=auto case, but we may need something else to determine if the guided decoding is implemented properly for required tool choice.

@gkswns0531
Copy link
Copy Markdown
Contributor Author

gkswns0531 commented Mar 13, 2026

@bbrowning @sfeng33 — I'm running BFCL now to confirm this PR introduces no regression on the default tool_choice=auto path, and will share results once complete.

For validating the tool_choice=required implementation specifically, I have a separate test suite: https://github.com/gkswns0531/gpt-oss-tool-eval

  • 14 E2E scenarios (e2e_gptoss_tool_choice.py): diverse inputs including multi-turn, Korean/Unicode, HTML content with </>, nested JSON args, no-arg tools, chain reasoning, etc. — all with tool_choice="required". 14/14 passed on gpt-oss-120b.
  • 100-request batch test (run_tool_call_cases.py): sends 100 requests with tool_choice="required" across 9 different prompts and 8 tool schemas, validates each response has valid tool calls with correct JSON arguments. 100/100 succeeded on gpt-oss-120b (without the grammar constraint, the same setup yielded ~85% success rate).

@sfeng33
Copy link
Copy Markdown
Contributor

sfeng33 commented Mar 13, 2026

@bbrowning @gkswns0531 Yes a small monkey-patch can be added to the script to make it work with tool choice, please see this: sfeng33#8

@gkswns0531
Copy link
Copy Markdown
Contributor Author

@sfeng33 Thanks for sharing the patch! I'll apply it to my environment and run the BFCL eval on gpt-oss-120b now. Will share the results once it's done.

@gkswns0531
Copy link
Copy Markdown
Contributor Author

gkswns0531 commented Mar 14, 2026

@sfeng33 @bbrowning Here are the BFCL multi-turn results:

gpt-oss-20b (H200 SXM - 1 GPU)

Category Base auto PR auto Base required PR required
base 30.50% 29.00% 30.00% 6.00%
miss_func 26.50% 26.00% 20.00% 4.00%
miss_param 30.00% 28.00% 29.00% 3.00%
long_context 4.00% 5.00% 4.50% 0.00%
Overall 22.75% 22.00% 20.88% 3.25%

gpt-oss-120b (H200 SXM - 8 GPU, PR only)

Category PR auto PR required
base 46.00% 29.50%
miss_func 35.00% 14.00%
miss_param 38.50% 16.50%
long_context 7.50% 5.00%
Overall 31.75% 16.25%
  • No regression in auto mode — Base 22.75% vs PR 22.00%, within noise.
  • Base required ≈ Base auto (20.88% vs 22.75%) — confirms tool_choice=required had no effect before this PR, i.e., no grammar enforcement was happening.
  • PR required scores drop in multi-turn — the grammar now forces tool calls on every turn, including turns where the model should respond with text. Confirms the implementation is working as intended.

will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 17, 2026
…ble Responses API tool_choice=required

Three fixes on top of cherry-picked upstream PR vllm-project#33306:

1. EBNF grammar: tool_block now accepts both commentary and analysis
   channels, matching GPT-OSS behavior found in our PR vllm-project#35907.

2. adjust_request: handle both ChatCompletion and Responses API tool
   formats, guard response_format access for ResponsesRequest.

3. Responses API: remove NotImplementedError guard, add adjust_request
   call in _make_request_with_harmony so EBNF grammar flows through
   to sampling params.
will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 17, 2026
…ble Responses API tool_choice=required

Three fixes on top of cherry-picked upstream PR vllm-project#33306:

1. EBNF grammar: tool_block now accepts both commentary and analysis
   channels, matching GPT-OSS behavior found in our PR vllm-project#35907.

2. adjust_request: handle both ChatCompletion and Responses API tool
   formats, guard response_format access for ResponsesRequest.

3. Responses API: remove NotImplementedError guard, add adjust_request
   call in _make_request_with_harmony so EBNF grammar flows through
   to sampling params.
@gkswns0531
Copy link
Copy Markdown
Contributor Author

@sfeng33 Gentle reminder

@gkswns0531
Copy link
Copy Markdown
Contributor Author

Hi @sfeng33 @bbrowning, just following up — the BFCL results are posted above. Happy to run additional benchmarks or adjust the approach if needed. Any feedback would be appreciated!

@gkswns0531
Copy link
Copy Markdown
Contributor Author

Reminder @sfeng33

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status
Status: To Triage

Development

Successfully merging this pull request may close these issues.

5 participants