[Bugfix] Fix Responses API harmony streaming: token splitting, missing done events, nested sequence_number by Pradyun92 · Pull Request #37071 · vllm-project/vllm

Pradyun92 · 2026-03-14T21:31:37Z

Purpose

Three fixes for Responses API streaming with harmony models (e.g., gpt_oss):

Bug 1: Multi-token RequestOutput loses intermediate channel transitions

With Eagle speculative decoding, RequestOutput objects contain multiple token_ids. In _generate_with_builtin_tools, the entire multi-token output is passed to StreamingHarmonyContext.append_output(), which processes all tokens in a loop but only yields the context once. If the batch crosses channel boundaries (e.g., reasoning → function call), intermediate channel transitions and their content are lost.

Fix: In _generate_with_builtin_tools (engine/serving.py), split multi-token RequestOutput objects into single-token ones for StreamingHarmonyContext before calling append_output(). Each single-token output gets its own yield context, ensuring the harmony parser processes tokens one at a time.

Bug 2: Final message done events not emitted

In _process_harmony_streaming_events, done events (`response.output_text.done`, `response.content_part.done`, `response.output_item.done`) are only emitted when the next message starts (`is_expecting_start()`). The last message never triggers this because the `async for` loop ends first.

Symptom: Streamed content is truncated — e.g., streaming gives `"2 + 2 equals"` but `response.completed` has `"2 + 2 equals 4."`.

Fix: After the `async for` loop in `_process_harmony_streaming_events`, emit done events for the final message.

Bug 3: `sequence_number=-1` in nested response items

Events in `streaming_events.py` are created with placeholder `sequence_number=-1`. `_increment_sequence_number_and_return` fixes the top-level event but not nested items (e.g., `ResponseFunctionToolCall` inside `ResponseOutputItemDoneEvent.item`).

Fix: Also set `sequence_number` on nested `item` attributes.

Not duplicating existing PRs: #36445 is about non-harmony models. No other open PRs target these specific issues.

AI assistance: This PR was developed with AI assistance (Claude). The submitter has reviewed all changes and tested end-to-end.

Test Plan

```bash

Start vLLM with a harmony model + Eagle speculative decoding

python -m vllm.entrypoints.openai.api_server --model

Test streaming text (checks for last-token loss)

curl -X POST http://localhost:8000/v1/responses
-H "Content-Type: application/json"
-d '{"model": "", "stream": true, "input": "What is 2+2? Answer briefly."}'

Test tool call (checks for sequence_number=-1 and channel transitions)

curl -X POST http://localhost:8000/v1/responses
-H "Content-Type: application/json"
-d '{"model": "", "stream": true, "input": "What is the weather in Tokyo?",
"tools": [{"type": "function", "name": "get_weather", "description": "Get weather",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}]}'
```

Test Result

Bug 1 — Before: With Eagle, multi-token batches crossing channel boundaries lose intermediate content.
Bug 1 — After: Each token processed individually; all channel transitions preserved.

Bug 2 — Before: Streaming gives `"2 + 2 equals"`, completed gives `"2 + 2 equals 4."` — last tokens lost.
Bug 2 — After: Streamed content matches completed content exactly.

Bug 3 — Before: `ResponseFunctionToolCall` nested in `response.output_item.done` has `sequence_number: -1`.
Bug 3 — After: Nested item carries the correct sequence number.

Tested with gpt-oss-120b model across all scenarios.

Essential Elements of an Effective PR Description Checklist

gemini-code-assist

Code Review

This pull request introduces two important fixes for streaming with harmony models in the Responses API. The first fix correctly addresses an issue where done events were not being emitted for the final message in a stream by adding a post-loop cleanup logic, which is a robust solution. The second fix resolves a bug where nested items in streaming events retained an incorrect placeholder sequence_number, by ensuring the sequence number is propagated to these nested items. The changes are well-targeted, clear, and improve the correctness of the streaming API. I have reviewed the code and found no issues.

mergify · 2026-03-14T21:36:01Z

Hi @Pradyun92, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-14T21:45:17Z

Hi @Pradyun92, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

…g done events, nested sequence_number Three fixes for Responses API streaming with harmony models: 1. Multi-token RequestOutput splitting for speculative decoding: With Eagle, RequestOutputs can contain multiple tokens that span channel boundaries. StreamingHarmonyContext.append_output() processes all tokens in a loop but only yields once, losing intermediate channel transitions (e.g., reasoning to function call). Fix: Split multi- token RequestOutputs into single-token ones before append_output() so the harmony parser processes tokens one at a time. 2. Final message done events not emitted: During streaming, done events (output_text.done, content_part.done, output_item.done) are only emitted when a new message starts (is_expecting_start). The last message never triggers this because the generator ends. Fix: After the async for loop, emit done events for the final message. 3. Nested item sequence_number stuck at -1: Events are created with placeholder sequence_number=-1. _increment_sequence_number_and_return fixes the top-level event but not nested items (e.g., ResponseFunctionToolCall inside ResponseOutputItemDoneEvent.item). Fix: Also set sequence_number on nested item if present. Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com> Co-authored-by: Claude

mergify · 2026-03-17T22:37:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Pradyun92.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Pradyun92 requested review from DarkLight1337, aarnphm, chaunceyjiang and russellb as code owners March 14, 2026 21:31

mergify bot added frontend gpt-oss Related to GPT-OSS models bug Something isn't working labels Mar 14, 2026

github-project-automation bot added this to gpt-oss Issues & Enhancements Mar 14, 2026

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Mar 14, 2026

gemini-code-assist bot reviewed Mar 14, 2026

View reviewed changes

Pradyun92 force-pushed the fix/responses-harmony-streaming-bugs branch from 8442a38 to 55591da Compare March 14, 2026 21:41

Pradyun92 force-pushed the fix/responses-harmony-streaming-bugs branch from 55591da to ba943c3 Compare March 14, 2026 21:54

Pradyun92 force-pushed the fix/responses-harmony-streaming-bugs branch from ba943c3 to 858df9c Compare March 14, 2026 22:05

Pradyun92 changed the title ~~[Bugfix] Fix Responses API harmony streaming: missing done events and nested sequence_number~~ [Bugfix] Fix Responses API harmony streaming: token splitting, missing done events, nested sequence_number Mar 14, 2026

mergify bot added the needs-rebase label Mar 17, 2026

will-deines mentioned this pull request Mar 18, 2026

[Bugfix] Fix Harmony streaming cross-channel delta accumulation #36011

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix Responses API harmony streaming: token splitting, missing done events, nested sequence_number#37071

[Bugfix] Fix Responses API harmony streaming: token splitting, missing done events, nested sequence_number#37071
Pradyun92 wants to merge 1 commit intovllm-project:mainfrom
Pradyun92:fix/responses-harmony-streaming-bugs

Pradyun92 commented Mar 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Mar 14, 2026

Uh oh!

mergify bot commented Mar 14, 2026

Uh oh!

mergify bot commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Pradyun92 commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Bug 1: Multi-token RequestOutput loses intermediate channel transitions

Bug 2: Final message done events not emitted

Bug 3: `sequence_number=-1` in nested response items

Test Plan

Start vLLM with a harmony model + Eagle speculative decoding

Test streaming text (checks for last-token loss)

Test tool call (checks for sequence_number=-1 and channel transitions)

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Mar 14, 2026

Uh oh!

mergify bot commented Mar 14, 2026

Uh oh!

mergify bot commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Pradyun92 commented Mar 14, 2026 •

edited

Loading