nick's streaming changes#1
Conversation
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
| NOTE: Make sure that prompt_token_ids is a copy of the original request's | ||
| _all_token_ids. Since the scheduler updates _all_token_ids each iteration, the | ||
| corresponding prompt_token_ids reference in NewRequestData will be mistakenly | ||
| updated while decoding if we don't make a copy. | ||
| """ | ||
| req_data = NewRequestData.from_request(request, block_ids, prefill_token_ids) | ||
| if request.streaming_queue is not None: | ||
| req_data.prompt_token_ids = request._all_token_ids.copy() | ||
| return req_data |
There was a problem hiding this comment.
@njhill I think we still need to keep this functionality that makes NewRequestData.prompt_token_ids a copy of the request's _all_token_ids. otherwise when the model is decoding (RUNNING status, not updating with next streaming req), the prompt_token_ids field used in GPUModelRunner will get updated because it is set with a reference to request._all_token_ids. this causes a bug because in GPUModelRunner the outputs will be added to req_data field, and cause an issue if also updated in the prompt_token_ids field
There was a problem hiding this comment.
Thanks @joshuadeng... I guess I'm not following this. This path is only for new requests (i.e. the first request of a "session"), and at that point request.prompt_token_ids should be a copy of request._all_token_ids anyhow: https://github.com/vllm-project/vllm/blob/378385b90cddbe8cbc6e51d4ed59ce83e499530a/vllm/v1/request.py#L91-L94.
in GPUModelRunner the outputs will be added to req_data field, and cause an issue if also updated in the prompt_token_ids field
Could you elaborate on this? I just can't see the problematic sequence of events.
There was a problem hiding this comment.
so the path that creates NewRequestData object isn't just for new requests, requests that have streaming update are treated like a "new" request, i.e. it will be in scheduled_new_reqs. the pseudo new requests for existing sessions in scheduled_new_reqs are handled by _update_streaming_request in GPUModelRunner.
so basically, for streaming request's when we create NewRequestData.prompt_token_ids with a reference to request._all_token_ids, when decoding the scheduler will update request._all_token_ids, which in turn updates NewRequestData.prompt_token_ids. this leads to issues because in GPUModelRunner we now have duplicate output token in CachedRequestState.prompt_token_ids (which is a reference to the NewRequestData.prompt_token_ids) and CachedRequestState.output_token_ids, which affects how we construct input batch
There was a problem hiding this comment.
when we create NewRequestData.prompt_token_ids with a reference to request._all_token_ids
I don't see where this happens.
NewRequestData.prompt_token_idscome fromRequest.prompt_token_ids: https://github.com/vllm-project/vllm/blob/dc917cceb877dfd13f98c538c4c96158047d98bd/vllm/v1/core/sched/output.py#L57Request._all_token_idsis originally created as a copy ofRequest.prompt_token_ids(it's a different object): https://github.com/vllm-project/vllm/blob/dc917cceb877dfd13f98c538c4c96158047d98bd/vllm/v1/request.py#L91-L94
So the _all_token_ids list does not propagate to GPUModelRunner at all?
There was a problem hiding this comment.
without the streaming changes _all_token_ids doesn't propagate to GPUModelRunner. for streaming we do it
in `_make_new_request_data with req_data.prompt_token_ids = request._all_token_ids.copy().
this is necessary for updating streaming sessions, because they are treated as scheduled_new_reqs, and so we need to pass in all tokens as the "prompt", because the prompt for an updated streaming session should include all past output tokens
There was a problem hiding this comment.
so we need to pass in all tokens as the "prompt", because the prompt for an updated streaming session should include all past output tokens
Ah, I guess this is one thing I may have misunderstood. I thought that output tokens were returned to the user but then essentially discarded for the purpose of ongoing generation. The logic as it is now just the prompts, so each sub-request prompt is cumulatively concatenated prompts so far.
I thought for your current max_tokens=1 case you do discard the single generated token. Or should that single token also be sandwiched between consecutive input prompts for the next full prompt?
There was a problem hiding this comment.
@joshuadeng per our discussion I've pushed another commit, but moved to a new branch re-based on yours, since yours was updated with latest main branch: #2
and don't support prompt_embeds with input streaming for now Signed-off-by: Nick Hill <nickhill123@gmail.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)