Skip to content

[responsesAPI] move streaming logic to parser#37007

Open
qandrew wants to merge 5 commits intovllm-project:mainfrom
qandrew:parser-streaming
Open

[responsesAPI] move streaming logic to parser#37007
qandrew wants to merge 5 commits intovllm-project:mainfrom
qandrew:parser-streaming

Conversation

@qandrew
Copy link
Contributor

@qandrew qandrew commented Mar 13, 2026

Purpose

similar to #33281, this PR moves all the responsesAPI streaming logic inside to DelegatingParser. No behavioral changes in this PR. However, now we can have model specific behavior for responsesAPI streaming (ie maybe in ResponseReasoningDeltaEvent, kimi would want to output additional metadata that the role is assitant).

Implements streaming logic for #32713

Test Plan

vllm serve Qwen/Qwen3-8B   --reasoning-parser qwen3   --tool-call-parser qwen3
 curl -X POST "http://localhost:8000/v1/responses"   -H "Content-Type: application/json"   -H "Authorization: Bearer dummy-api-key"   -d '{
        "model": "Qwen/Qwen3-8B",
        "input": "Hello.", "stream": true, "enable_response_messages": true
      }'

Test Result

https://gist.github.com/qandrew/ceff5bb4a0b36c6a62ee41d6df680d3f

Also passes a new logprob test in #37126

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the streaming parsing logic by moving it from serving.py into the parser module. This is a positive architectural change that centralizes parsing responsibilities. A new extract_streaming_delta method and a StreamingParseState class are introduced to handle this. However, I've identified a critical issue in the refactored logic where a portion of the streaming message could be lost during the transition from reasoning to tool-use parsing. I have provided a detailed comment and a suggested fix for this issue.

@qandrew qandrew marked this pull request as draft March 13, 2026 21:42
Signed-off-by: Andrew Xia <axia@meta.com>
@qandrew qandrew changed the title [responsesAPI] move streaming to parser [responsesAPI] move streaming logic to parser Mar 15, 2026
Signed-off-by: Andrew Xia <axia@meta.com>
Signed-off-by: Andrew Xia <axia@meta.com>
@qandrew qandrew marked this pull request as ready for review March 15, 2026 22:10
@houseroad houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 15, 2026
@qandrew
Copy link
Contributor Author

qandrew commented Mar 16, 2026

cc @chaunceyjiang @sfeng33 please take a look :)


# ========== Streaming Event Generation ==========

async def process_streaming_events(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
non-blocking: I know this function was moved from serving.py. I feel the function is a bit too long and could probably be split into a few smaller functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah definitely makes sense! I can do that in a follow up PR :)

Copy link
Contributor

@sfeng33 sfeng33 Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaunceyjiang this method is response api specific. In the unified parser, the scope is processing model's output, and return the content, reasoning, tool calls back to api serving layer. In other words, I think this method as well as extract_response_outputs don't belong in the parser's scope, wdyt?

Copy link
Contributor

@sfeng33 sfeng33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When self.parser is None (no tool/reasoning parser configured), the old code (before PR) still emitted the full event
lifecycle:

  • response.output_item.added
  • response.content_part.added
  • response.output_text.delta (per chunk)
  • response.output_text.done
  • response.content_part.done
  • response.output_item.done

The new fallback only emits bare ResponseTextDeltaEvent with hardcoded item_id="", output_index=0, content_index=0 — no start or done lifecycle events.

Is this expected?

Andrew Xia added 2 commits March 17, 2026 09:05
Signed-off-by: Andrew Xia <axia@fb.com>
@qandrew
Copy link
Contributor Author

qandrew commented Mar 17, 2026

When self.parser is None (no tool/reasoning parser configured), the old code (before PR) still emitted the full event lifecycle:

  • response.output_item.added
  • response.content_part.added
  • response.output_text.delta (per chunk)
  • response.output_text.done
  • response.content_part.done
  • response.output_item.done

The new fallback only emits bare ResponseTextDeltaEvent with hardcoded item_id="", output_index=0, content_index=0 — no start or done lifecycle events.

Is this expected?

thanks @sfeng33 for the catch! Ideally we never hit this code path bc there should be a reasoning/tool parser for serving models; I updated the logic and added a UT to prevent regressions.

@qandrew
Copy link
Contributor Author

qandrew commented Mar 17, 2026

@sfeng33 @chaunceyjiang ready for re-review :)

@sfeng33
Copy link
Contributor

sfeng33 commented Mar 17, 2026

Hey @qandrew, this PR is not a pure no-functional-change refactor, since the change is quite large, if possible, can you keep this PR to the non-functional relocation changes, and leave the new added logic to following PRs so that it can be more throughly tested and reviewed?

For example, the extract_streaming_delta method adds new logic:

  • It introduces a new StreamingParseState dataclass with mutable per-request state
  • The reasoning-to-tool transition logic in extract_streaming_delta resets previous_text/previous_token_ids on transition — this is new bookkeeping that wasn't in the old code's streaming path for the Responses API (it was in the chat completions path)

The no-parser fallback path has also changed.

Copy link
Contributor Author

@qandrew qandrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sfeng33 , thanks for the feedback!

It introduces a new StreamingParseState dataclass with mutable per-request state

this is needed to keep the no-functional-change refactor

The reasoning-to-tool transition logic

I don't think i see any functional changes in this PR, added comments for specific lines, please let me know which specific lines you see issues?

The no-parser fallback path has also changed.

This logic did not change, as I added a unit test to guard the fact that no changes were made. If you'd prefer I can separate out the unit test to a different PR to make it more explicit.

we added the 'ready' tag in advance, and as all CI tests pass, it shows that there's no functional changes in this PR
Image

delta_text = output.text
delta_token_ids = as_list(output.token_ids)
current_text = previous_text + delta_text
current_token_ids = previous_token_ids + delta_token_ids
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning-to-tool transition logic in extract_streaming_delta resets previous_text/previous_token_ids on transition — this is new bookkeeping that wasn't in the old code's streaming path for the Responses API (it was in the chat completions path)

@sfeng33 here is the old code

)

current_text = state.previous_text + delta_text
current_token_ids = state.previous_token_ids + delta_token_ids
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning-to-tool transition logic in extract_streaming_delta resets previous_text/previous_token_ids on transition — this is new bookkeeping that wasn't in the old code's streaming path for the Responses API (it was in the chat completions path)

@sfeng33 here is the new code, we can seee that the bookkeeping didn't change

current_text = ""

if reasoning_ended:
if not tool_call_text_started:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning-to-tool transition logic in extract_streaming_delta resets previous_text/previous_token_ids on transition — this is new bookkeeping that wasn't in the old code's streaming path for the Responses API (it was in the chat completions path)

@sfeng33 here is the old code, we see the bookkeeping logic hasn't changed

else:
current_text = ""

if state.reasoning_ended:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning-to-tool transition logic in extract_streaming_delta resets previous_text/previous_token_ids on transition — this is new bookkeeping that wasn't in the old code's streaming path for the Responses API (it was in the chat completions path)

@sfeng33 here is the new code, we see the logic transition is the same

@mergify
Copy link

mergify bot commented Mar 19, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @qandrew.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend needs-rebase ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants