Skip to content

nick's streaming changes#1

Open
joshuadeng wants to merge 5 commits intojoshuadeng:streaming_supportfrom
njhill:streaming_support_nick1
Open

nick's streaming changes#1
joshuadeng wants to merge 5 commits intojoshuadeng:streaming_supportfrom
njhill:streaming_support_nick1

Conversation

@joshuadeng
Copy link
Copy Markdown
Owner

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Comment on lines -936 to -944
NOTE: Make sure that prompt_token_ids is a copy of the original request's
_all_token_ids. Since the scheduler updates _all_token_ids each iteration, the
corresponding prompt_token_ids reference in NewRequestData will be mistakenly
updated while decoding if we don't make a copy.
"""
req_data = NewRequestData.from_request(request, block_ids, prefill_token_ids)
if request.streaming_queue is not None:
req_data.prompt_token_ids = request._all_token_ids.copy()
return req_data
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill I think we still need to keep this functionality that makes NewRequestData.prompt_token_ids a copy of the request's _all_token_ids. otherwise when the model is decoding (RUNNING status, not updating with next streaming req), the prompt_token_ids field used in GPUModelRunner will get updated because it is set with a reference to request._all_token_ids. this causes a bug because in GPUModelRunner the outputs will be added to req_data field, and cause an issue if also updated in the prompt_token_ids field

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @joshuadeng... I guess I'm not following this. This path is only for new requests (i.e. the first request of a "session"), and at that point request.prompt_token_ids should be a copy of request._all_token_ids anyhow: https://github.com/vllm-project/vllm/blob/378385b90cddbe8cbc6e51d4ed59ce83e499530a/vllm/v1/request.py#L91-L94.

in GPUModelRunner the outputs will be added to req_data field, and cause an issue if also updated in the prompt_token_ids field

Could you elaborate on this? I just can't see the problematic sequence of events.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the path that creates NewRequestData object isn't just for new requests, requests that have streaming update are treated like a "new" request, i.e. it will be in scheduled_new_reqs. the pseudo new requests for existing sessions in scheduled_new_reqs are handled by _update_streaming_request in GPUModelRunner.

so basically, for streaming request's when we create NewRequestData.prompt_token_ids with a reference to request._all_token_ids, when decoding the scheduler will update request._all_token_ids, which in turn updates NewRequestData.prompt_token_ids. this leads to issues because in GPUModelRunner we now have duplicate output token in CachedRequestState.prompt_token_ids (which is a reference to the NewRequestData.prompt_token_ids) and CachedRequestState.output_token_ids, which affects how we construct input batch

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we create NewRequestData.prompt_token_ids with a reference to request._all_token_ids

I don't see where this happens.

So the _all_token_ids list does not propagate to GPUModelRunner at all?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without the streaming changes _all_token_ids doesn't propagate to GPUModelRunner. for streaming we do it
in `_make_new_request_data with req_data.prompt_token_ids = request._all_token_ids.copy().

this is necessary for updating streaming sessions, because they are treated as scheduled_new_reqs, and so we need to pass in all tokens as the "prompt", because the prompt for an updated streaming session should include all past output tokens

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we need to pass in all tokens as the "prompt", because the prompt for an updated streaming session should include all past output tokens

Ah, I guess this is one thing I may have misunderstood. I thought that output tokens were returned to the user but then essentially discarded for the purpose of ongoing generation. The logic as it is now just the prompts, so each sub-request prompt is cumulatively concatenated prompts so far.

I thought for your current max_tokens=1 case you do discard the single generated token. Or should that single token also be sandwiched between consecutive input prompts for the next full prompt?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshuadeng per our discussion I've pushed another commit, but moved to a new branch re-based on yours, since yours was updated with latest main branch: #2

and don't support prompt_embeds with input streaming for now

Signed-off-by: Nick Hill <nickhill123@gmail.com>
@njhill njhill mentioned this pull request Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants