Skip to content

[Bugfix] Fix Responses API harmony streaming: token splitting, missing done events, nested sequence_number#37071

Open
Pradyun92 wants to merge 1 commit intovllm-project:mainfrom
Pradyun92:fix/responses-harmony-streaming-bugs
Open

[Bugfix] Fix Responses API harmony streaming: token splitting, missing done events, nested sequence_number#37071
Pradyun92 wants to merge 1 commit intovllm-project:mainfrom
Pradyun92:fix/responses-harmony-streaming-bugs

Conversation

@Pradyun92
Copy link
Copy Markdown
Contributor

@Pradyun92 Pradyun92 commented Mar 14, 2026

Purpose

Three fixes for Responses API streaming with harmony models (e.g., gpt_oss):

Bug 1: Multi-token RequestOutput loses intermediate channel transitions

With Eagle speculative decoding, RequestOutput objects contain multiple token_ids. In _generate_with_builtin_tools, the entire multi-token output is passed to StreamingHarmonyContext.append_output(), which processes all tokens in a loop but only yields the context once. If the batch crosses channel boundaries (e.g., reasoning → function call), intermediate channel transitions and their content are lost.

Fix: In _generate_with_builtin_tools (engine/serving.py), split multi-token RequestOutput objects into single-token ones for StreamingHarmonyContext before calling append_output(). Each single-token output gets its own yield context, ensuring the harmony parser processes tokens one at a time.

Bug 2: Final message done events not emitted

In _process_harmony_streaming_events, done events (`response.output_text.done`, `response.content_part.done`, `response.output_item.done`) are only emitted when the next message starts (`is_expecting_start()`). The last message never triggers this because the `async for` loop ends first.

Symptom: Streamed content is truncated — e.g., streaming gives `"2 + 2 equals"` but `response.completed` has `"2 + 2 equals 4."`.

Fix: After the `async for` loop in `_process_harmony_streaming_events`, emit done events for the final message.

Bug 3: `sequence_number=-1` in nested response items

Events in `streaming_events.py` are created with placeholder `sequence_number=-1`. `_increment_sequence_number_and_return` fixes the top-level event but not nested items (e.g., `ResponseFunctionToolCall` inside `ResponseOutputItemDoneEvent.item`).

Fix: Also set `sequence_number` on nested `item` attributes.

Not duplicating existing PRs: #36445 is about non-harmony models. No other open PRs target these specific issues.

AI assistance: This PR was developed with AI assistance (Claude). The submitter has reviewed all changes and tested end-to-end.

Test Plan

```bash

Start vLLM with a harmony model + Eagle speculative decoding

python -m vllm.entrypoints.openai.api_server --model

Test streaming text (checks for last-token loss)

curl -X POST http://localhost:8000/v1/responses
-H "Content-Type: application/json"
-d '{"model": "", "stream": true, "input": "What is 2+2? Answer briefly."}'

Test tool call (checks for sequence_number=-1 and channel transitions)

curl -X POST http://localhost:8000/v1/responses
-H "Content-Type: application/json"
-d '{"model": "", "stream": true, "input": "What is the weather in Tokyo?",
"tools": [{"type": "function", "name": "get_weather", "description": "Get weather",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}]}'
```

Test Result

Bug 1 — Before: With Eagle, multi-token batches crossing channel boundaries lose intermediate content.
Bug 1 — After: Each token processed individually; all channel transitions preserved.

Bug 2 — Before: Streaming gives `"2 + 2 equals"`, completed gives `"2 + 2 equals 4."` — last tokens lost.
Bug 2 — After: Streamed content matches completed content exactly.

Bug 3 — Before: `ResponseFunctionToolCall` nested in `response.output_item.done` has `sequence_number: -1`.
Bug 3 — After: Nested item carries the correct sequence number.

Tested with gpt-oss-120b model across all scenarios.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR
  • The test plan
  • The test results
  • (Optional) Documentation update
  • (Optional) Release notes update

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two important fixes for streaming with harmony models in the Responses API. The first fix correctly addresses an issue where done events were not being emitted for the final message in a stream by adding a post-loop cleanup logic, which is a robust solution. The second fix resolves a bug where nested items in streaming events retained an incorrect placeholder sequence_number, by ensuring the sequence number is propagated to these nested items. The changes are well-targeted, clear, and improve the correctness of the streaming API. I have reviewed the code and found no issues.

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 14, 2026

Hi @Pradyun92, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@Pradyun92 Pradyun92 force-pushed the fix/responses-harmony-streaming-bugs branch from 8442a38 to 55591da Compare March 14, 2026 21:41
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 14, 2026

Hi @Pradyun92, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@Pradyun92 Pradyun92 force-pushed the fix/responses-harmony-streaming-bugs branch from 55591da to ba943c3 Compare March 14, 2026 21:54
…g done events, nested sequence_number

Three fixes for Responses API streaming with harmony models:

1. Multi-token RequestOutput splitting for speculative decoding: With
   Eagle, RequestOutputs can contain multiple tokens that span channel
   boundaries. StreamingHarmonyContext.append_output() processes all
   tokens in a loop but only yields once, losing intermediate channel
   transitions (e.g., reasoning to function call). Fix: Split multi-
   token RequestOutputs into single-token ones before append_output()
   so the harmony parser processes tokens one at a time.

2. Final message done events not emitted: During streaming, done events
   (output_text.done, content_part.done, output_item.done) are only
   emitted when a new message starts (is_expecting_start). The last
   message never triggers this because the generator ends. Fix: After
   the async for loop, emit done events for the final message.

3. Nested item sequence_number stuck at -1: Events are created with
   placeholder sequence_number=-1. _increment_sequence_number_and_return
   fixes the top-level event but not nested items (e.g.,
   ResponseFunctionToolCall inside ResponseOutputItemDoneEvent.item).
   Fix: Also set sequence_number on nested item if present.

Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
Co-authored-by: Claude
@Pradyun92 Pradyun92 force-pushed the fix/responses-harmony-streaming-bugs branch from ba943c3 to 858df9c Compare March 14, 2026 22:05
@Pradyun92 Pradyun92 changed the title [Bugfix] Fix Responses API harmony streaming: missing done events and nested sequence_number [Bugfix] Fix Responses API harmony streaming: token splitting, missing done events, nested sequence_number Mar 14, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 17, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Pradyun92.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working frontend gpt-oss Related to GPT-OSS models needs-rebase

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

1 participant