[Frontend] Add tool_choice=required support for GPT-OSS Harmony models#33306
[Frontend] Add tool_choice=required support for GPT-OSS Harmony models#33306gkswns0531 wants to merge 2 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request adds support for tool_choice="required" in GPT-OSS Harmony models by using bad_words to prevent non-tool-call generations. The implementation appears solid and is well-tested for both streaming and non-streaming scenarios. I've identified a performance regression where the OpenAIToolParser's initialization, which now includes expensive tokenization operations, is executed on every request. I have provided a suggestion to introduce caching to mitigate this performance impact.
|
Hi @gkswns0531, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
24310c1 to
b2ece57
Compare
f90cd82 to
9ba3d07
Compare
9ba3d07 to
2e1374e
Compare
826a5d6 to
2e1374e
Compare
|
Hello, @chaunceyjiang |
2e1374e to
a0cbc0a
Compare
|
Hi @chaunceyjiang , just a gentle follow-up on this PR. I understand you must be busy, so no rush at all. |
|
Hi @chaunceyjiang , hope you're doing well. Just following up once more on this PR. |
|
@gkswns0531 Thanks for your PR — I think this is a useful feature. I’ve just started my Chinese New Year holiday recently. @qandrew @yeqcharlotte PTAL. |
|
@gkswns0531 skimming through your improvement, I have many questions. The main one: does it ensure the model can "reason" before sending a tool call ? I.e., can the model have long chain of thoughts before being forced to generate a tool call ? For example, can it create a Preamble before generating a tool call, like shown in Preamble: As far as I understand, your feature works like this: as soon as the model outputs But what if the model in it's chain of thought, want to create many Maybe we should let it the freedom to do so, and enforce the tool calling only after a given number of Also 1: you state "This approach is more robust than the previous bad_words blocking approach, which could not guarantee 100% blocking (edge cases like " final", " finally" tokens could slip through)." What is this approach/did someone implement it ? My naive understanding is that with this approach, we only prevent the Also 2: After you forced the tool calling path (i.e. force the model to generate
|
|
@jonoillar Thanks for the thorough review and great questions!
You're right that the Harmony format supports multiple In production, many systems depend on tool calls happening reliably at specific points — orchestration agents, pipelines with mandatory function calls, etc. Most models support
This was actually my first implementation. I registered the The current LogitsProcessor approach solves this with positive enforcement — instead of trying to block all possible variants of unwanted tokens, it forces the exact required sequence (
Once the model enters the Thanks! I'm happy to adjust the approach if the community has a different preference — always open to working together toward what benefits the project most. |
So the model sometimes returns something like: ?
Could you share the code ? Have you tried to constrain the model not to return:
I'm interested in exploring this approach, since I fear removing the reasoning capabilities would harm the model intelligence
I faced cases where the model doesn't return the correct schema. This might be a little out of scope of this issue though, and could be tackled in a follow up issue. I'll be happy to take care of it :) |
|
@jonoillar Thanks! I've put the bad_words-based implementation on my fork for reference: It blocks the final channel variants (final / final / finally) after the <|end|<|start|>assistant<|channel|> prefix, and globally blocks <|return|> as a safety net. The blocked patterns are declared as a simple list in BLOCKED_PATTERNS on OpenAIToolParser. If you'd like to block additional tokens, just add them to that list. A couple of notes from my testing:
I'll think more about this direction as well. |
|
Hi @qandrew @yeqcharlotte , hope you're both doing well! Just a friendly follow-up on this PR — it's been a while since @chaunceyjiang kindly tagged you for review. If this issue has already been resolved or if there's a different direction the team prefers, I'm happy to close this PR. Otherwise, any feedback or guidance would be really appreciated so I can move forward accordingly. Thanks for your time! |
|
Thanks~ @gkswns0531, I will take a look at this PR over the next few days. |
| forced_sequence: list[int] | ||
|
|
||
|
|
||
| class PatternForcedSequenceLogitsProcessor(LogitsProcessor): |
There was a problem hiding this comment.
Oh, by the way, LogitsProcessor is currently undergoing refactoring, since model runner v2 will restructure this part of the code.
This PR might need to wait until model runner v2 is officially enabled by default before we continue moving it forward.
There was a problem hiding this comment.
Thanks! Understood — I'll watch for the model runner v2 progress and update this PR accordingly!
sfeng33
left a comment
There was a problem hiding this comment.
I was wondering if you have considered solving the problem using an alternative approach: structural tag, e.g. define tags for analysis and commentary to=functions.*
That way the reasoning is preserved, no change is needed on the model runner. Structure tag has been supported in xgrammar/guidance.
0ee72e8 to
1d449a6
Compare
1d449a6 to
0c78b4d
Compare
Use xgrammar EBNF grammar to enforce tool calls for Harmony models instead of the previous LogitsProcessor approach. This avoids dependency on model runner internals (which are being refactored for v2) and preserves the model's reasoning ability. The grammar allows analysis/commentary channels while blocking the final channel entirely, requiring at least one tool call. Changes: - Add adjust_request() to OpenAIToolParser with EBNF grammar builder for tool_choice=required - Use raw_decode() for JSON extraction to handle trailing structural tokens from partial Harmony parsing - Add error handling in parse_output_into_messages for grammar-constrained token sequences - Add comprehensive unit tests (59 cases) for EBNF grammar acceptance/blocking via xgrammar - Add E2E test script (15 scenarios) for live server testing Signed-off-by: hanjun <hanjun.cho@allganize.io>
0c78b4d to
902c87f
Compare
|
Hi @gkswns0531, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hey @gkswns0531, thanks for all the updates, this looks quite promising! After you finish, if you could also run this eval script: I'd be interested to know the score before and after this PR on gpt oss 20b and 120b, and the performance impact. |
The Harmony code path in render_chat() uses _make_request_with_harmony() instead of _preprocess_chat(), which bypassed tool_parser.adjust_request(). This meant the EBNF grammar for tool_choice="required" was never applied, leaving generation unconstrained (~85% tool call rate instead of 100%). Add adjust_request() call after _make_request_with_harmony() so grammar constraints are applied for GPT-OSS models. Also clean up debug logging and consolidate redundant test cases. Verified: 100/100 requests produce tool calls on gpt-oss-120b. Signed-off-by: hanjun <hanjun.cho@allganize.io>
cec2dd9 to
87f3f8d
Compare
|
@sfeng33 I don't think BFCL will measure the impact of this one without modification, as it does not set |
|
@bbrowning @sfeng33 — I'm running BFCL now to confirm this PR introduces no regression on the default For validating the
|
|
@bbrowning @gkswns0531 Yes a small monkey-patch can be added to the script to make it work with tool choice, please see this: sfeng33#8 |
|
@sfeng33 Thanks for sharing the patch! I'll apply it to my environment and run the BFCL eval on gpt-oss-120b now. Will share the results once it's done. |
|
@sfeng33 @bbrowning Here are the BFCL multi-turn results: gpt-oss-20b (H200 SXM - 1 GPU)
gpt-oss-120b (H200 SXM - 8 GPU, PR only)
|
…ble Responses API tool_choice=required Three fixes on top of cherry-picked upstream PR vllm-project#33306: 1. EBNF grammar: tool_block now accepts both commentary and analysis channels, matching GPT-OSS behavior found in our PR vllm-project#35907. 2. adjust_request: handle both ChatCompletion and Responses API tool formats, guard response_format access for ResponsesRequest. 3. Responses API: remove NotImplementedError guard, add adjust_request call in _make_request_with_harmony so EBNF grammar flows through to sampling params.
…ble Responses API tool_choice=required Three fixes on top of cherry-picked upstream PR vllm-project#33306: 1. EBNF grammar: tool_block now accepts both commentary and analysis channels, matching GPT-OSS behavior found in our PR vllm-project#35907. 2. adjust_request: handle both ChatCompletion and Responses API tool formats, guard response_format access for ResponsesRequest. 3. Responses API: remove NotImplementedError guard, add adjust_request call in _make_request_with_harmony so EBNF grammar flows through to sampling params.
|
@sfeng33 Gentle reminder |
|
Hi @sfeng33 @bbrowning, just following up — the BFCL results are posted above. Happy to run additional benchmarks or adjust the approach if needed. Any feedback would be appreciated! |
|
Reminder @sfeng33 |
Summary
Add
tool_choice="required"support for GPT-OSS Harmony models via EBNF grammar, and fix the Harmony render path to actually apply tool parser grammar constraints.Problem
GPT-OSS models use the Harmony chat format with channel-based output (
analysis,commentary,final). Two issues preventedtool_choice="required"from working:adjust_request) constrains output to a JSON array, which is incompatible with Harmony's channel-based format.adjust_request()—render_chat()routes GPT-OSS requests through_make_request_with_harmony()instead of_preprocess_chat(), sotool_parser.adjust_request()was never called. Even with a correct grammar implementation, it was never applied.Without this fix, tool call generation depends entirely on the model's natural behavior (~85% success rate at default temperature).
Solution
EBNF Grammar (commit 1)
Use an EBNF grammar compiled by xgrammar to constrain token-level generation while preserving the Harmony channel structure:
commentary to=functions.X)Harmony Render Path Fix (commit 2)
Add
adjust_request()call in the Harmony branch ofrender_chat()so grammar constraints are actually applied for GPT-OSS models.Why EBNF instead of LogitsProcessor?
The previous implementation used a custom
PatternForcedSequenceLogitsProcessor. This was changed because:LogitsProcessoris being refactored for model runner v2 (comment)structured_outputs+ xgrammar path, no model runner internals modifiedChanges
vllm/entrypoints/serve/render/serving.py: Calltool_parser.adjust_request()in the Harmony render path (_make_request_with_harmonybranch) so grammar constraints are applied for GPT-OSS models.vllm/tool_parsers/openai_tool_parser.py: Overrideadjust_request()with EBNF grammar builder fortool_choice=required. Useraw_decode()for JSON extraction to handle trailing structural tokens.vllm/entrypoints/openai/parser/harmony_utils.py: Add error handling inparse_output_into_messagesfor grammar-constrained token sequences.tests/tool_parsers/test_openai_tool_parser_ebnf.py: Unit tests for EBNF grammar and xgrammar validation.Test Plan
gpt-oss-120b(default temperature)Related Issue: #33966