[responsesAPI] move streaming logic to parser#37007
[responsesAPI] move streaming logic to parser#37007qandrew wants to merge 5 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the streaming parsing logic by moving it from serving.py into the parser module. This is a positive architectural change that centralizes parsing responsibilities. A new extract_streaming_delta method and a StreamingParseState class are introduced to handle this. However, I've identified a critical issue in the refactored logic where a portion of the streaming message could be lost during the transition from reasoning to tool-use parsing. I have provided a detailed comment and a suggested fix for this issue.
Signed-off-by: Andrew Xia <axia@meta.com>
Signed-off-by: Andrew Xia <axia@meta.com>
a017fd8 to
16fb5f0
Compare
|
cc @chaunceyjiang @sfeng33 please take a look :) |
|
|
||
| # ========== Streaming Event Generation ========== | ||
|
|
||
| async def process_streaming_events( |
There was a problem hiding this comment.
LGTM.
non-blocking: I know this function was moved from serving.py. I feel the function is a bit too long and could probably be split into a few smaller functions.
There was a problem hiding this comment.
yeah definitely makes sense! I can do that in a follow up PR :)
There was a problem hiding this comment.
@chaunceyjiang this method is response api specific. In the unified parser, the scope is processing model's output, and return the content, reasoning, tool calls back to api serving layer. In other words, I think this method as well as extract_response_outputs don't belong in the parser's scope, wdyt?
sfeng33
left a comment
There was a problem hiding this comment.
When self.parser is None (no tool/reasoning parser configured), the old code (before PR) still emitted the full event
lifecycle:
- response.output_item.added
- response.content_part.added
- response.output_text.delta (per chunk)
- response.output_text.done
- response.content_part.done
- response.output_item.done
The new fallback only emits bare ResponseTextDeltaEvent with hardcoded item_id="", output_index=0, content_index=0 — no start or done lifecycle events.
Is this expected?
Signed-off-by: Andrew Xia <axia@fb.com>
thanks @sfeng33 for the catch! Ideally we never hit this code path bc there should be a reasoning/tool parser for serving models; I updated the logic and added a UT to prevent regressions. |
|
@sfeng33 @chaunceyjiang ready for re-review :) |
|
Hey @qandrew, this PR is not a pure no-functional-change refactor, since the change is quite large, if possible, can you keep this PR to the non-functional relocation changes, and leave the new added logic to following PRs so that it can be more throughly tested and reviewed? For example, the extract_streaming_delta method adds new logic:
The no-parser fallback path has also changed. |
qandrew
left a comment
There was a problem hiding this comment.
Hi @sfeng33 , thanks for the feedback!
It introduces a new StreamingParseState dataclass with mutable per-request state
this is needed to keep the no-functional-change refactor
The reasoning-to-tool transition logic
I don't think i see any functional changes in this PR, added comments for specific lines, please let me know which specific lines you see issues?
The no-parser fallback path has also changed.
This logic did not change, as I added a unit test to guard the fact that no changes were made. If you'd prefer I can separate out the unit test to a different PR to make it more explicit.
we added the 'ready' tag in advance, and as all CI tests pass, it shows that there's no functional changes in this PR

| delta_text = output.text | ||
| delta_token_ids = as_list(output.token_ids) | ||
| current_text = previous_text + delta_text | ||
| current_token_ids = previous_token_ids + delta_token_ids |
There was a problem hiding this comment.
The reasoning-to-tool transition logic in extract_streaming_delta resets previous_text/previous_token_ids on transition — this is new bookkeeping that wasn't in the old code's streaming path for the Responses API (it was in the chat completions path)
@sfeng33 here is the old code
| ) | ||
|
|
||
| current_text = state.previous_text + delta_text | ||
| current_token_ids = state.previous_token_ids + delta_token_ids |
There was a problem hiding this comment.
The reasoning-to-tool transition logic in extract_streaming_delta resets previous_text/previous_token_ids on transition — this is new bookkeeping that wasn't in the old code's streaming path for the Responses API (it was in the chat completions path)
@sfeng33 here is the new code, we can seee that the bookkeeping didn't change
| current_text = "" | ||
|
|
||
| if reasoning_ended: | ||
| if not tool_call_text_started: |
There was a problem hiding this comment.
The reasoning-to-tool transition logic in extract_streaming_delta resets previous_text/previous_token_ids on transition — this is new bookkeeping that wasn't in the old code's streaming path for the Responses API (it was in the chat completions path)
@sfeng33 here is the old code, we see the bookkeeping logic hasn't changed
| else: | ||
| current_text = "" | ||
|
|
||
| if state.reasoning_ended: |
There was a problem hiding this comment.
The reasoning-to-tool transition logic in extract_streaming_delta resets previous_text/previous_token_ids on transition — this is new bookkeeping that wasn't in the old code's streaming path for the Responses API (it was in the chat completions path)
@sfeng33 here is the new code, we see the logic transition is the same
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
similar to #33281, this PR moves all the responsesAPI streaming logic inside to DelegatingParser. No behavioral changes in this PR. However, now we can have model specific behavior for responsesAPI streaming (ie maybe in ResponseReasoningDeltaEvent, kimi would want to output additional metadata that the role is assitant).
Implements streaming logic for #32713
Test Plan
Test Result
https://gist.github.com/qandrew/ceff5bb4a0b36c6a62ee41d6df680d3f
Also passes a new logprob test in #37126