[Bugfix]: Fix structured output in multi-turn gpt-oss by bbrowning · Pull Request #34454 · vllm-project/vllm

bbrowning · 2026-02-12T19:25:59Z

Purpose

The logic in the gptoss_reasoning_parser to detect when the model has finished outputting reasoning content and starting to output content to the final channel was inadvertently matching on final channel messages from previous messages for multi-turn scenarios. In practice this meant that vLLM started applying the grammar bitmasks to the entirety of the model's output in these multi-turn conversations prematurely, causing the model to deviate from its trained Harmony format and lead to empty or invalid outputs.

This PR fixes things by never looking for the final channel marker in any message prior to the current one the model is generating so that we don't falsely believe the model is starting generation of the final channel unless it's actually doing so during this turn of the conversation.

Prior to vLLM v0.13.0 this bug existed but we didn't actually trip over it because the way we handle multi-turn conversation state with gpt-oss models was missing important tokens that coincidentally caused those prior conversations to not actually match these token id checks. But, once we fixed multi-turn conversation state, that caused structured output usage with things like json_object response formats to then hit this bug in the reasoning parser.

Fixes #32791

Test Plan

I added a unit test specifically to cover this case, following test-driven-development by ensuring the test failed initially, applied my fix, and then ensured the test passed.

The existing and new gptoss_reasoning_parser unit tests were run via:

pytest tests/reasoning/test_gptoss_reasoning_parser.py
pytest tests/v1/structured_output/test_gptoss_structural_tags.py
pytest tests/entrypoints/openai/test_gptoss_structural_tags_integration.py

Additionally, I ran the manual reproducer (labeled as case 3) in #32791:

vllm serve openai/gpt-oss-20b \
  --tool-call-parser openai \
  --enable-auto-tool-choice

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer dummy" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [
      {
        "role": "user",
        "content": "Respond with JSON only in the form {\"response\":\"hello\"}."
      },
      {
        "role": "assistant",
        "content": "{\"response\":\"hello\"}"
      },
      {
        "role": "user",
        "content": "Respond with JSON only in the form {\"response\":\"bye\"}."
      }
    ],
    "response_format": { "type": "json_object" },
    "max_tokens": 128,
    "temperature": 0
  }' | jq .

Test Result

All the unit tests passed.

For the manual curl test, prior to this change it gave a response with empty content:

{
  "id": "chatcmpl-81416dae965f4f7d",
  "object": "chat.completion",
  "created": 1770920903,
  "model": "openai/gpt-oss-20b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": null
      },
      "logprobs": null,
      "finish_reason": "stop",
...

After this change, the model gives the expected response:

{
  "id": "chatcmpl-9c7eb34a997d07e2",
  "object": "chat.completion",
  "created": 1770923019,
  "model": "openai/gpt-oss-20b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{\"response\":\"bye\"}",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": "The user wants JSON only: {\"response\":\"bye\"}. So output that."
      },
      "logprobs": null,
      "finish_reason": "stop",
...

The logic in the gptoss_reasoning_parser to detect when the model has finished outputting reasoning content is is starting to output content to the final channel was inadvertently matching on final channel messages from previous messages for multi-turn scenarios. In practice this meant that vLLM started applying the grammar bitmasks to the entirety of the model's output in these multi-turn conversations prematurely, causing the model to deviate from its trained Harmony format and lead to empty or invalid outputs. This PR fixes things by never looking for the final channel marker in any message prior to the current one the model is generating so that we don't falsely believe the model is starting generation of the final channel unless it's actually doing so during this turn of the conversation. Prior to vLLM v0.13.0 this bug existed but we didn't actually trip over it because the way we handle multi-turn conversation state with gpt-oss models was missing important tokens that coincidentally caused those prior conversations to not actually match these token id checks. But, once we fixed multi-turn conversation state, that caused structured output usage with things like `json_object` response formats to then hit this bug in the reasoning parser. Fixes vllm-project#32791 Signed-off-by: Ben Browning <bbrownin@redhat.com>

gemini-code-assist

Code Review

This pull request addresses a critical bug in the gptoss_reasoning_parser that caused premature termination of reasoning in multi-turn conversations, leading to incorrect structured outputs. The fix, which involves stopping the backward search for the end-of-reasoning marker upon encountering a message boundary from a previous turn, is logical and well-implemented. The inclusion of a specific unit test to cover this multi-turn scenario is a great addition and significantly improves the robustness of the parser. Overall, the changes are excellent and effectively resolve the described issue. I have one suggestion to further improve the robustness of the code.

vllm/reasoning/gptoss_reasoning_parser.py

Signed-off-by: Ben Browning <bbrownin@redhat.com>

…asoning-structured

vllm/reasoning/gptoss_reasoning_parser.py

Instead of .encode followed by taking the first token, it's cleaner to just directly use model_tokenizer.vocab to fetch single token ids. Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Ben Browning <bbrownin@redhat.com>

DarkLight1337

Thanks for fixing!

CI discovered some additional tests that use gptoss_reasoning_parser but with a mocked tokenizer. So, this adds a mocked `vocab` to that mock tokenizer so that these tests also pass. Signed-off-by: Ben Browning <bbrownin@redhat.com>

bbrowning · 2026-02-13T14:20:03Z

CI picked up some additional tests that used gptoss_reasoning_parser but with a mocked tokenizer that failed after adjusting to use .vocab instead of .encode. So, I pushed one more commit adding a vocab mock to those mock tokenizers, grepped the tests to ensure no other tests use gptoss_reasoning_parser that need updating, and updated the test plan in the PR description to reflect running the 3 unit tests that touch this code:

pytest tests/reasoning/test_gptoss_reasoning_parser.py
pytest tests/v1/structured_output/test_gptoss_structural_tags.py
pytest tests/entrypoints/openai/test_gptoss_structural_tags_integration.py

The latter two failed and caught by CI, but are passing locally now.

bbrowning · 2026-02-13T19:09:15Z

The amd-basic-correctness test failure looks unrelated, but I left a comment on the recently merged PR (32993) that added those tests so the authors are aware that test is failing on AMD hardware.

…4454) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

…4454) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

…4454) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

…4454) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary>#1 — llama-nemotron-embed / score-template support (vllm-project#30550): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#2 — Triton Attention (vllm-project#31406): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#3 — Llama-4 attn quant (vllm-project#34243): Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454): Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507): Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085): Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary>vllm-project#7 — response_format validation for completions (vllm-project#35456): Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary>vllm-project#8 — response_format validation for chat completions (vllm-project#35510): Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

bbrowning requested review from aarnphm and chaunceyjiang as code owners February 12, 2026 19:26

mergify bot added gpt-oss Related to GPT-OSS models bug Something isn't working labels Feb 12, 2026

github-project-automation bot added this to gpt-oss Issues & Enhancements Feb 12, 2026

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Feb 12, 2026

bbrowning force-pushed the gptoss-multiturn-reasoning-structured branch from 65c163e to c851d60 Compare February 12, 2026 19:27

gemini-code-assist bot reviewed Feb 12, 2026

View reviewed changes

vllm/reasoning/gptoss_reasoning_parser.py Outdated Show resolved Hide resolved

Be explicit about expecting gpt-oss eom to be a single token

aae49f9

Signed-off-by: Ben Browning <bbrownin@redhat.com>

bbrowning mentioned this pull request Feb 12, 2026

[Bug]: chat.completions returns content: null for GPT-OSS multi-turn with json_object #32791

Closed

1 task

Merge remote-tracking branch 'upstream/main' into gptoss-multiturn-re…

203c9a0

…asoning-structured

DarkLight1337 reviewed Feb 13, 2026

View reviewed changes

vllm/reasoning/gptoss_reasoning_parser.py Outdated Show resolved Hide resolved

DarkLight1337 approved these changes Feb 13, 2026

View reviewed changes

github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Feb 13, 2026

DarkLight1337 enabled auto-merge (squash) February 13, 2026 13:13

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 13, 2026

Adjust mocks for other gptoss_reasoning_parser tests

209c6bb

CI discovered some additional tests that use gptoss_reasoning_parser but with a mocked tokenizer. So, this adds a mocked `vocab` to that mock tokenizer so that these tests also pass. Signed-off-by: Ben Browning <bbrownin@redhat.com>

auto-merge was automatically disabled February 13, 2026 14:18
Head branch was pushed to by a user without write access

bbrowning requested review from NickLucche, mgoin, robertgshaw2-redhat and russellb as code owners February 13, 2026 14:18

mergify bot added structured-output v1 labels Feb 13, 2026

github-project-automation bot added this to Structured Output Feb 13, 2026

Merge branch 'main' into gptoss-multiturn-reasoning-structured

3f04cee

vllm-bot merged commit fd267bc into vllm-project:main Feb 13, 2026
45 of 47 checks passed

github-project-automation bot moved this from Ready to Done in gpt-oss Issues & Enhancements Feb 13, 2026

github-project-automation bot moved this to Done in Structured Output Feb 13, 2026

bbrowning deleted the gptoss-multiturn-reasoning-structured branch February 13, 2026 19:13

kndtran mentioned this pull request Feb 19, 2026

fix: strip channel tokens from gpt-oss model output ibm-granite/granite-common#123

Closed

4 tasks

This was referenced Mar 3, 2026

[Responses API] Structured output + reasoning via structural tag embedding #35873

Closed

[Responses API] Structured output + reasoning via structural tag embedding #35904

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix]: Fix structured output in multi-turn gpt-oss#34454

[Bugfix]: Fix structured output in multi-turn gpt-oss#34454
vllm-bot merged 6 commits intovllm-project:mainfrom
bbrowning:gptoss-multiturn-reasoning-structured

bbrowning commented Feb 12, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 left a comment

Uh oh!

bbrowning commented Feb 13, 2026

Uh oh!

bbrowning commented Feb 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

bbrowning commented Feb 12, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

bbrowning commented Feb 13, 2026

Uh oh!

bbrowning commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bbrowning commented Feb 12, 2026 •

edited by github-actions bot

Loading

bbrowning commented Feb 13, 2026 •

edited

Loading