[Bugfix] Fix Harmony streaming cross-channel delta accumulation#36011
Open
will-deines wants to merge 7 commits intovllm-project:mainfrom
Open
[Bugfix] Fix Harmony streaming cross-channel delta accumulation#36011will-deines wants to merge 7 commits intovllm-project:mainfrom
will-deines wants to merge 7 commits intovllm-project:mainfrom
Conversation
96395c1 to
111ce47
Compare
Contributor
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in Harmony streaming where content could leak across different channels within a single token batch. The fix introduces a mechanism to track content deltas per channel in StreamingHarmonyContext and refactors the event emission logic in streaming_events.py to process these deltas individually. This ensures that content is correctly attributed to its channel, preventing data leakage and loss. The implementation is clear, and the accompanying test modifications are appropriate. The changes effectively resolve the issue.
111ce47 to
9d0f0ad
Compare
5 tasks
…hannel content leaks When --stream-interval yields a batch of tokens that crosses a Harmony channel boundary (e.g. analysis → commentary), append_output accumulated all content into a single last_content_delta string. emit_content_delta_events then classified the entire blob using the channel at the END of the batch, causing analysis-tail text to leak into output_text.delta events and commentary content to be lost. Track (channel, recipient, delta) triples per contiguous run in the batch so each segment is emitted with the correct event type. Signed-off-by: Will Deines <will@garr.io>
9e7497e to
d48bf7b
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
…ng-cross-channel-delta Signed-off-by: Will Deines <will@garr.io>
The Harmony streaming path only emits done events (output_text.done, content_part.done, output_item.done, etc.) when is_expecting_start() fires at a message boundary. For the last message in the stream, no subsequent message triggers this, so done events are never emitted. Add a post-loop flush that constructs a synthetic HarmonyMessage from the parser's in-progress state and passes it to the existing emit_previous_item_done_events() dispatcher, guarded by state.sent_output_item_added to prevent double-emission. Signed-off-by: Will Deines <will@garr.io>
…treaming Within a single Harmony message, the channel can change (e.g. analysis → final, analysis → functions.*) without a <|start|> boundary. The previous fix only handled the final message's done events via a post-loop flush, but mid-message channel transitions were never emitting done events for the outgoing channel. Add channel-transition tracking to StreamingState (last_channel, last_recipient, accumulated_text). When emit_content_delta_events detects a channel/recipient change, it emits done events for the previous channel and resets state before starting the new one. The post-loop flush now uses state.accumulated_text and state.last_channel/last_recipient instead of parser state, which is more reliable since the parser's current_content may span multiple channels. Signed-off-by: Will Deines <will@garr.io>
- Add sent_output_item_added=True to emit_function_call_delta_events,
enabling post-loop flush and mid-message channel transitions for
function calls (was the only delta emitter missing this flag)
- Use state-based done events at is_expecting_start() boundaries
instead of parser message, eliminating race with emit_content_delta_events
- Always use arguments="" in first-delta output_item.added for simple
path tool calls, preventing double-counted arguments
- Fix test_code_interpreter: client.timeout is a Timeout object, not a number
- Fix test_mcp_tool_multi_turn: Message.to_dict() flattens author fields
to top level, so use msg.get("role") instead of msg.get("author", {}).get("role")
- Fix test_chat: use instructions= param instead of system role in input
array (Harmony models build their own system message)
- Fix test_logprobs: skip for gpt-oss models (logprobs intentionally rejected)
Signed-off-by: Will Deines <will@garr.io>
Signed-off-by: Will Deines <will@garr.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix Harmony streaming content leaks when a token batch crosses a channel boundary (e.g. analysis → commentary). With
--stream-interval > 1, vLLM yields up to N tokens at a time. When a batch spans two channels,append_outputaccumulated all content deltas into a single string, andemit_content_delta_eventsclassified the entire blob using the channel at the end of the batch — causing analysis-tail text to leak intoresponse.output_text.deltaevents and actual commentary content to be lost or misclassified.Symptom: Streaming clients see an empty
output_textstream (noresponse.output_text.deltaevents for visible text), even thoughresponse.completedcontains the correctoutput_text.Fix: Track
(channel, recipient, delta)triples per contiguous run within each batch, then emit each segment with the correct event type.Related Issues & PRs
Directly Addressed
tool_calls.stream_interval > 1, multiple messages complete within a single yield, and content gets misclassified by end-of-batch channel state.Prior Fixes for the Same Class of Bug
emit_content_delta_events()where preamble content was misrouted as reasoning.Code Introduced By
emit_content_delta_events(ctx, state)that readsctx.last_content_delta+ctx.parser.current_channel(end-of-batch state)._emit_*methods intostreaming_events.py, establishing the two-layer dispatcher + leaf-helper architecture.Related Open PRs
extract_harmony_streaming_delta(Chat Completions streaming) with aHarmonyStreamingStatedataclass. Our fix addresses the Responses API streaming path (emit_content_delta_events). Both are needed.reasoning_contentincorrectly flushed intocontentin the final streamed chunk. Different mechanism but same symptom family.base_indexchanges as messages complete within a batch. Another multi-token batch state-tracking bug.Root Cause
File:
vllm/entrypoints/openai/responses/context.py—StreamingHarmonyContext.append_output()File:
vllm/entrypoints/openai/responses/streaming_events.py—emit_content_delta_events()When
--stream-interval 20yields a batch that starts inanalysisand transitions tocommentary:last_delta_textcommentarylast_delta_textcurrent_channelis nowcommentary(end of batch)emit_content_delta_eventsemits the entire blob asoutput_text.delta— leaking analysis content into visible textcurrent_channelhas moved on, so it gets droppedFix
1.
context.py— Track per-channel deltas2.
streaming_events.py— Emit per-channelExtract the dispatch logic into
_emit_delta_for_channel(channel, recipient, delta, state), then iterate:This preserves the exact same dispatch rules (commentary → text, analysis → reasoning, functions.* → function call, etc.) but applies them per-segment instead of per-batch.
Files Changed
vllm/entrypoints/openai/responses/context.pychannel_deltaslist toStreamingHarmonyContext; populate it per-token inappend_outputwith coalescing for consecutive same-channel tokensvllm/entrypoints/openai/responses/streaming_events.py_emit_delta_for_channel(); rewriteemit_content_delta_events()to iterate overctx.channel_deltastests/entrypoints/openai/test_serving_responses.py_make_ctxmock to setchannel_deltasso existing tests work with the new code pathDecisions to Debate
1. Per-token channel tracking vs. message-level diffing
What we chose: Record each token's
(channel, recipient)at process time and coalesce consecutive runs.Alternative: PR #35449 takes a different approach for the Chat Completions path — it diffs
parser.messagesbefore/after processing a batch to derive what changed. This is more robust against parser state quirks but requires more complex state management.Why we chose this: For the Responses API path, the per-token approach is minimal and surgical. The
StreamableParseralready exposescurrent_channelandcurrent_recipientper token. We just need to not throw that information away by accumulating into a single string. The message-diffing approach would be over-engineering for this specific dispatcher.What reviewers might disagree with: Per-token tracking assumes
current_channelis always accurate at the momentlast_content_deltais set. If the parser ever updatescurrent_channelbefore or after settinglast_content_delta, the attribution could be wrong. In practice this doesn't happen — the parser updates both atomically inprocess().2. Keeping
last_content_deltaalongsidechannel_deltasWhat we chose: We still populate
self.last_content_delta(the accumulated string) for backward compatibility, even thoughemit_content_delta_eventsnow useschannel_deltasexclusively.Alternative: Remove
last_content_deltaentirely and make all consumers usechannel_deltas.Why we chose this:
last_content_deltais read in other code paths (e.g._update_num_reasoning_tokens, potentially external consumers). Removing it would expand the blast radius. The cost of maintaining both is negligible — it's one extra string concatenation per batch.3. Tuple vs. dataclass for channel_deltas entries
What we chose:
list[tuple[str | None, str | None, str]]— lightweight, no new classes.Alternative: A
ChannelDeltadataclass with named fields (channel,recipient,delta) for clarity.Why we chose this: The list is ephemeral (rebuilt every
append_outputcall), consumed in one place (emit_content_delta_events), and destructured immediately. A dataclass would add a class definition for a 3-field struct used in exactly two lines of code.Test Plan
test_serving_responses.pytests pass (including 5 preamble streaming tests)test_context.pytests passresponse.completed.output_text—test_streaming_responsewas FAILED (assert '' == 'Hello there!') before the fix, now PASSED;test_streaming_text_matches_finalalso asserts the same invariant with a Harmony-specific prompt--stream-interval 20— server intentionally runs at maximum batch size to maximize cross-channel batch probability; all streaming tests passtest_streaming_reasoning_eventsverifies reasoning delta events fire separately from text deltas with no control tokens in either;test_reasoning_items_present+test_multi_turn_reasoning_consistentconfirm reasoning items intact in both single and multi-turnOut of Scope (follow-ups)
extract_harmony_streaming_deltafor/v1/chat/completions. That fix and this one are complementary; both are needed.base_indexdrift when tool calls complete within a batch. Separate mechanism, same symptom family.stream_interval=1default — Settingstream_interval=1avoids the bug entirely but at the cost of higher SSE event volume. Not proposed here.