[responsesAPI] move streaming logic to parser by qandrew · Pull Request #37007 · vllm-project/vllm

qandrew · 2026-03-13T20:38:48Z

Purpose

similar to #33281, this PR moves all the responsesAPI streaming logic inside to DelegatingParser. No behavioral changes in this PR. However, now we can have model specific behavior for responsesAPI streaming (ie maybe in ResponseReasoningDeltaEvent, kimi would want to output additional metadata that the role is assitant).

Implements streaming logic for #32713

Test Plan

vllm serve Qwen/Qwen3-8B   --reasoning-parser qwen3   --tool-call-parser qwen3

 curl -X POST "http://localhost:8000/v1/responses"   -H "Content-Type: application/json"   -H "Authorization: Bearer dummy-api-key"   -d '{
        "model": "Qwen/Qwen3-8B",
        "input": "Hello.", "stream": true, "enable_response_messages": true
      }'

Test Result

https://gist.github.com/qandrew/ceff5bb4a0b36c6a62ee41d6df680d3f

Also passes a new logprob test in #37126

gemini-code-assist

Code Review

This pull request refactors the streaming parsing logic by moving it from serving.py into the parser module. This is a positive architectural change that centralizes parsing responsibilities. A new extract_streaming_delta method and a StreamingParseState class are introduced to handle this. However, I've identified a critical issue in the refactored logic where a portion of the streaming message could be lost during the transition from reasoning to tool-use parsing. I have provided a detailed comment and a suggested fix for this issue.

vllm/parser/abstract_parser.py

Signed-off-by: Andrew Xia <axia@meta.com>

qandrew · 2026-03-16T04:11:18Z

cc @chaunceyjiang @sfeng33 please take a look :)

chaunceyjiang · 2026-03-16T08:30:34Z

vllm/parser/abstract_parser.py

+
+    # ========== Streaming Event Generation ==========
+
+    async def process_streaming_events(


LGTM.
non-blocking: I know this function was moved from serving.py. I feel the function is a bit too long and could probably be split into a few smaller functions.

yeah definitely makes sense! I can do that in a follow up PR :)

@chaunceyjiang this method is response api specific. In the unified parser, the scope is processing model's output, and return the content, reasoning, tool calls back to api serving layer. In other words, I think this method as well as extract_response_outputs don't belong in the parser's scope, wdyt?

sfeng33

When self.parser is None (no tool/reasoning parser configured), the old code (before PR) still emitted the full event
lifecycle:

response.output_item.added
response.content_part.added
response.output_text.delta (per chunk)
response.output_text.done
response.content_part.done
response.output_item.done

The new fallback only emits bare ResponseTextDeltaEvent with hardcoded item_id="", output_index=0, content_index=0 — no start or done lifecycle events.

Is this expected?

Signed-off-by: Andrew Xia <axia@fb.com>

qandrew · 2026-03-17T17:35:16Z

When self.parser is None (no tool/reasoning parser configured), the old code (before PR) still emitted the full event lifecycle:

response.output_item.added

response.content_part.added

response.output_text.delta (per chunk)

response.output_text.done

response.content_part.done

response.output_item.done

The new fallback only emits bare ResponseTextDeltaEvent with hardcoded item_id="", output_index=0, content_index=0 — no start or done lifecycle events.

Is this expected?

thanks @sfeng33 for the catch! Ideally we never hit this code path bc there should be a reasoning/tool parser for serving models; I updated the logic and added a UT to prevent regressions.

qandrew · 2026-03-17T21:28:08Z

@sfeng33 @chaunceyjiang ready for re-review :)

sfeng33 · 2026-03-17T21:45:21Z

Hey @qandrew, this PR is not a pure no-functional-change refactor, since the change is quite large, if possible, can you keep this PR to the non-functional relocation changes, and leave the new added logic to following PRs so that it can be more throughly tested and reviewed?

For example, the extract_streaming_delta method adds new logic:

It introduces a new StreamingParseState dataclass with mutable per-request state
The reasoning-to-tool transition logic in extract_streaming_delta resets previous_text/previous_token_ids on transition — this is new bookkeeping that wasn't in the old code's streaming path for the Responses API (it was in the chat completions path)

The no-parser fallback path has also changed.

qandrew

Hi @sfeng33 , thanks for the feedback!

It introduces a new StreamingParseState dataclass with mutable per-request state

this is needed to keep the no-functional-change refactor

The reasoning-to-tool transition logic

I don't think i see any functional changes in this PR, added comments for specific lines, please let me know which specific lines you see issues?

The no-parser fallback path has also changed.

This logic did not change, as I added a unit test to guard the fact that no changes were made. If you'd prefer I can separate out the unit test to a different PR to make it more explicit.

we added the 'ready' tag in advance, and as all CI tests pass, it shows that there's no functional changes in this PR

qandrew · 2026-03-17T22:30:57Z

vllm/entrypoints/openai/responses/serving.py

-                delta_text = output.text
-                delta_token_ids = as_list(output.token_ids)
-                current_text = previous_text + delta_text
-                current_token_ids = previous_token_ids + delta_token_ids


The reasoning-to-tool transition logic in extract_streaming_delta resets previous_text/previous_token_ids on transition — this is new bookkeeping that wasn't in the old code's streaming path for the Responses API (it was in the chat completions path)

@sfeng33 here is the old code

qandrew · 2026-03-17T22:31:45Z

vllm/parser/abstract_parser.py

+            )
+
+        current_text = state.previous_text + delta_text
+        current_token_ids = state.previous_token_ids + delta_token_ids


The reasoning-to-tool transition logic in extract_streaming_delta resets previous_text/previous_token_ids on transition — this is new bookkeeping that wasn't in the old code's streaming path for the Responses API (it was in the chat completions path)

@sfeng33 here is the new code, we can seee that the bookkeeping didn't change

qandrew · 2026-03-17T22:32:53Z

vllm/entrypoints/openai/responses/serving.py

-                                current_text = ""
-
-                    if reasoning_ended:
-                        if not tool_call_text_started:


The reasoning-to-tool transition logic in extract_streaming_delta resets previous_text/previous_token_ids on transition — this is new bookkeeping that wasn't in the old code's streaming path for the Responses API (it was in the chat completions path)

@sfeng33 here is the old code, we see the bookkeeping logic hasn't changed

qandrew · 2026-03-17T22:33:15Z

vllm/parser/abstract_parser.py

+                    else:
+                        current_text = ""
+
+            if state.reasoning_ended:


The reasoning-to-tool transition logic in extract_streaming_delta resets previous_text/previous_token_ids on transition — this is new bookkeeping that wasn't in the old code's streaming path for the Responses API (it was in the chat completions path)

@sfeng33 here is the new code, we see the logic transition is the same

mergify · 2026-03-19T10:37:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @qandrew.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

qandrew requested review from DarkLight1337, NickLucche, aarnphm, chaunceyjiang, njhill, robertgshaw2-redhat and russellb as code owners March 13, 2026 20:38

mergify bot added the frontend label Mar 13, 2026

gemini-code-assist bot reviewed Mar 13, 2026

View reviewed changes

vllm/parser/abstract_parser.py Show resolved Hide resolved

qandrew marked this pull request as draft March 13, 2026 21:42

initial commit move streaming to parser

d48335c

Signed-off-by: Andrew Xia <axia@meta.com>

qandrew changed the title ~~[responsesAPI] move streaming to parser~~ [responsesAPI] move streaming logic to parser Mar 15, 2026

move more logic into parser

16fb5f0

Signed-off-by: Andrew Xia <axia@meta.com>

qandrew force-pushed the parser-streaming branch from a017fd8 to 16fb5f0 Compare March 15, 2026 21:47

fix...

847ba47

Signed-off-by: Andrew Xia <axia@meta.com>

qandrew marked this pull request as ready for review March 15, 2026 22:10

houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 15, 2026

qandrew mentioned this pull request Mar 15, 2026

[responsesAPI][ez] add a unit test for SimpleContext logprobs #37126

Merged

chaunceyjiang reviewed Mar 16, 2026

View reviewed changes

sfeng33 reviewed Mar 16, 2026

View reviewed changes

Andrew Xia added 2 commits March 17, 2026 09:05

Merge branch 'main' into parser-streaming

9cc2b6b

flora comment

2b287ae

Signed-off-by: Andrew Xia <axia@fb.com>

qandrew commented Mar 17, 2026

View reviewed changes

qandrew requested review from chaunceyjiang and sfeng33 March 18, 2026 05:09

mergify bot added the needs-rebase label Mar 19, 2026


		# ========== Streaming Event Generation ==========

		async def process_streaming_events(

Uh oh!

Conversation

qandrew commented Mar 13, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

qandrew commented Mar 16, 2026

Uh oh!

chaunceyjiang Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

qandrew Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

sfeng33 Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfeng33 left a comment

Choose a reason for hiding this comment

Uh oh!

qandrew commented Mar 17, 2026

Uh oh!

qandrew commented Mar 17, 2026

Uh oh!

sfeng33 commented Mar 17, 2026

Uh oh!

qandrew left a comment

Choose a reason for hiding this comment

Uh oh!

qandrew Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

qandrew Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

qandrew Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

qandrew Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qandrew commented Mar 13, 2026 •

edited by github-actions bot

Loading

sfeng33 Mar 20, 2026 •

edited

Loading