[Bugfix] Fix Responses API harmony streaming: token splitting, missing done events, nested sequence_number#37071
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces two important fixes for streaming with harmony models in the Responses API. The first fix correctly addresses an issue where done events were not being emitted for the final message in a stream by adding a post-loop cleanup logic, which is a robust solution. The second fix resolves a bug where nested items in streaming events retained an incorrect placeholder sequence_number, by ensuring the sequence number is propagated to these nested items. The changes are well-targeted, clear, and improve the correctness of the streaming API. I have reviewed the code and found no issues.
|
Hi @Pradyun92, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
8442a38 to
55591da
Compare
|
Hi @Pradyun92, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
55591da to
ba943c3
Compare
…g done events, nested sequence_number Three fixes for Responses API streaming with harmony models: 1. Multi-token RequestOutput splitting for speculative decoding: With Eagle, RequestOutputs can contain multiple tokens that span channel boundaries. StreamingHarmonyContext.append_output() processes all tokens in a loop but only yields once, losing intermediate channel transitions (e.g., reasoning to function call). Fix: Split multi- token RequestOutputs into single-token ones before append_output() so the harmony parser processes tokens one at a time. 2. Final message done events not emitted: During streaming, done events (output_text.done, content_part.done, output_item.done) are only emitted when a new message starts (is_expecting_start). The last message never triggers this because the generator ends. Fix: After the async for loop, emit done events for the final message. 3. Nested item sequence_number stuck at -1: Events are created with placeholder sequence_number=-1. _increment_sequence_number_and_return fixes the top-level event but not nested items (e.g., ResponseFunctionToolCall inside ResponseOutputItemDoneEvent.item). Fix: Also set sequence_number on nested item if present. Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com> Co-authored-by: Claude
ba943c3 to
858df9c
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Three fixes for Responses API streaming with harmony models (e.g.,
gpt_oss):Bug 1: Multi-token RequestOutput loses intermediate channel transitions
With Eagle speculative decoding,
RequestOutputobjects contain multipletoken_ids. In_generate_with_builtin_tools, the entire multi-token output is passed toStreamingHarmonyContext.append_output(), which processes all tokens in a loop but only yields the context once. If the batch crosses channel boundaries (e.g., reasoning → function call), intermediate channel transitions and their content are lost.Fix: In
_generate_with_builtin_tools(engine/serving.py), split multi-tokenRequestOutputobjects into single-token ones forStreamingHarmonyContextbefore callingappend_output(). Each single-token output gets its ownyield context, ensuring the harmony parser processes tokens one at a time.Bug 2: Final message done events not emitted
In
_process_harmony_streaming_events, done events (`response.output_text.done`, `response.content_part.done`, `response.output_item.done`) are only emitted when the next message starts (`is_expecting_start()`). The last message never triggers this because the `async for` loop ends first.Symptom: Streamed content is truncated — e.g., streaming gives `"2 + 2 equals"` but `response.completed` has `"2 + 2 equals 4."`.
Fix: After the `async for` loop in `_process_harmony_streaming_events`, emit done events for the final message.
Bug 3: `sequence_number=-1` in nested response items
Events in `streaming_events.py` are created with placeholder `sequence_number=-1`. `_increment_sequence_number_and_return` fixes the top-level event but not nested items (e.g., `ResponseFunctionToolCall` inside `ResponseOutputItemDoneEvent.item`).
Fix: Also set `sequence_number` on nested `item` attributes.
Not duplicating existing PRs: #36445 is about non-harmony models. No other open PRs target these specific issues.
AI assistance: This PR was developed with AI assistance (Claude). The submitter has reviewed all changes and tested end-to-end.
Test Plan
```bash
Start vLLM with a harmony model + Eagle speculative decoding
python -m vllm.entrypoints.openai.api_server --model
Test streaming text (checks for last-token loss)
curl -X POST http://localhost:8000/v1/responses
-H "Content-Type: application/json"
-d '{"model": "", "stream": true, "input": "What is 2+2? Answer briefly."}'
Test tool call (checks for sequence_number=-1 and channel transitions)
curl -X POST http://localhost:8000/v1/responses
-H "Content-Type: application/json"
-d '{"model": "", "stream": true, "input": "What is the weather in Tokyo?",
"tools": [{"type": "function", "name": "get_weather", "description": "Get weather",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}]}'
```
Test Result
Bug 1 — Before: With Eagle, multi-token batches crossing channel boundaries lose intermediate content.
Bug 1 — After: Each token processed individually; all channel transitions preserved.
Bug 2 — Before: Streaming gives `"2 + 2 equals"`, completed gives `"2 + 2 equals 4."` — last tokens lost.
Bug 2 — After: Streamed content matches completed content exactly.
Bug 3 — Before: `ResponseFunctionToolCall` nested in `response.output_item.done` has `sequence_number: -1`.
Bug 3 — After: Nested item carries the correct sequence number.
Tested with gpt-oss-120b model across all scenarios.
Essential Elements of an Effective PR Description Checklist