nick's streaming changes by joshuadeng · Pull Request #1 · joshuadeng/vllm

joshuadeng · 2026-01-20T16:35:49Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Nick Hill <nickhill123@gmail.com>

joshuadeng · 2026-01-20T23:21:35Z

vllm/v1/core/sched/scheduler.py

-        NOTE: Make sure that prompt_token_ids is a copy of the original request's
-        _all_token_ids. Since the scheduler updates _all_token_ids each iteration, the
-        corresponding prompt_token_ids reference in NewRequestData will be mistakenly
-        updated while decoding if we don't make a copy.
-        """
-        req_data = NewRequestData.from_request(request, block_ids, prefill_token_ids)
-        if request.streaming_queue is not None:
-            req_data.prompt_token_ids = request._all_token_ids.copy()
-        return req_data


@njhill I think we still need to keep this functionality that makes NewRequestData.prompt_token_ids a copy of the request's _all_token_ids. otherwise when the model is decoding (RUNNING status, not updating with next streaming req), the prompt_token_ids field used in GPUModelRunner will get updated because it is set with a reference to request._all_token_ids. this causes a bug because in GPUModelRunner the outputs will be added to req_data field, and cause an issue if also updated in the prompt_token_ids field

Thanks @joshuadeng... I guess I'm not following this. This path is only for new requests (i.e. the first request of a "session"), and at that point request.prompt_token_ids should be a copy of request._all_token_ids anyhow: https://github.com/vllm-project/vllm/blob/378385b90cddbe8cbc6e51d4ed59ce83e499530a/vllm/v1/request.py#L91-L94.

in GPUModelRunner the outputs will be added to req_data field, and cause an issue if also updated in the prompt_token_ids field

Could you elaborate on this? I just can't see the problematic sequence of events.

so the path that creates NewRequestData object isn't just for new requests, requests that have streaming update are treated like a "new" request, i.e. it will be in scheduled_new_reqs. the pseudo new requests for existing sessions in scheduled_new_reqs are handled by _update_streaming_request in GPUModelRunner.

so basically, for streaming request's when we create NewRequestData.prompt_token_ids with a reference to request._all_token_ids, when decoding the scheduler will update request._all_token_ids, which in turn updates NewRequestData.prompt_token_ids. this leads to issues because in GPUModelRunner we now have duplicate output token in CachedRequestState.prompt_token_ids (which is a reference to the NewRequestData.prompt_token_ids) and CachedRequestState.output_token_ids, which affects how we construct input batch

when we create NewRequestData.prompt_token_ids with a reference to request._all_token_ids

I don't see where this happens.

NewRequestData.prompt_token_ids come from Request.prompt_token_ids: https://github.com/vllm-project/vllm/blob/dc917cceb877dfd13f98c538c4c96158047d98bd/vllm/v1/core/sched/output.py#L57

Request._all_token_ids is originally created as a copy of Request.prompt_token_ids (it's a different object): https://github.com/vllm-project/vllm/blob/dc917cceb877dfd13f98c538c4c96158047d98bd/vllm/v1/request.py#L91-L94

So the _all_token_ids list does not propagate to GPUModelRunner at all?

without the streaming changes _all_token_ids doesn't propagate to GPUModelRunner. for streaming we do it
in `_make_new_request_data with req_data.prompt_token_ids = request._all_token_ids.copy().

this is necessary for updating streaming sessions, because they are treated as scheduled_new_reqs, and so we need to pass in all tokens as the "prompt", because the prompt for an updated streaming session should include all past output tokens

so we need to pass in all tokens as the "prompt", because the prompt for an updated streaming session should include all past output tokens

Ah, I guess this is one thing I may have misunderstood. I thought that output tokens were returned to the user but then essentially discarded for the purpose of ongoing generation. The logic as it is now just the prompts, so each sub-request prompt is cumulatively concatenated prompts so far.

I thought for your current max_tokens=1 case you do discard the single generated token. Or should that single token also be sandwiched between consecutive input prompts for the next full prompt?

@joshuadeng per our discussion I've pushed another commit, but moved to a new branch re-based on yours, since yours was updated with latest main branch: #2

and don't support prompt_embeds with input streaming for now Signed-off-by: Nick Hill <nickhill123@gmail.com>

njhill added 4 commits January 15, 2026 19:02

some updates

8245269

Signed-off-by: Nick Hill <nickhill123@gmail.com>

update existing tests (used claude)

9cb3a3a

Signed-off-by: Nick Hill <nickhill123@gmail.com>

fixes, add e2e tests

38eec68

Signed-off-by: Nick Hill <nickhill123@gmail.com>

update other tests

379839f

Signed-off-by: Nick Hill <nickhill123@gmail.com>

joshuadeng mentioned this pull request Jan 21, 2026

[Feature] add session based streaming input support to v1 vllm-project/vllm#28973

Merged

5 tasks

joshuadeng commented Jan 21, 2026

View reviewed changes

update behavior to only discard final output token

db1ac70

and don't support prompt_embeds with input streaming for now Signed-off-by: Nick Hill <nickhill123@gmail.com>

njhill mentioned this pull request Jan 23, 2026

nick's streaming changes 2 #2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nick's streaming changes#1

nick's streaming changes#1
joshuadeng wants to merge 5 commits intojoshuadeng:streaming_supportfrom
njhill:streaming_support_nick1

joshuadeng commented Jan 20, 2026

Uh oh!

joshuadeng Jan 20, 2026

Uh oh!

njhill Jan 22, 2026

Uh oh!

joshuadeng Jan 22, 2026

Uh oh!

njhill Jan 23, 2026

Uh oh!

joshuadeng Jan 23, 2026

Uh oh!

njhill Jan 23, 2026

Uh oh!

njhill Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joshuadeng commented Jan 20, 2026

Purpose

Test Plan

Test Result

Uh oh!

joshuadeng Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

njhill Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

joshuadeng Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

njhill Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

joshuadeng Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

njhill Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

njhill Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants