V0.12.0 support n sampling delay split to eliminate redundant prefill computation and memory#39646
V0.12.0 support n sampling delay split to eliminate redundant prefill computation and memory#39646Zhutianyi7230 wants to merge 6 commits intovllm-project:releases/v0.12.0from
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a delayed split strategy for parallel sampling in vLLM v1, allowing a single prefill to serve multiple decoding child requests. This optimization supports both PD disaggregation and colocated deployment modes. Feedback highlights several critical issues: a memory leak in the scheduler where child requests are not properly freed, a logic error where child requests fail to inherit the first generated token from the parent, and a runtime AttributeError caused by using 'append' instead of 'add_request' on the scheduler's waiting queue.
| self.finished_req_ids.add(child_request.request_id) | ||
| if self.finished_req_ids_dict is not None: | ||
| self.finished_req_ids_dict[child_request.client_index].add( | ||
| child_request.request_id) |
There was a problem hiding this comment.
In the PD disaggregation scenario, child requests are added to the scheduler's internal self.requests map during the split, but they are never removed. Simply adding them to finished_req_ids is insufficient to prevent a memory leak in the scheduler. You should call self._free_request(child_request) instead, which correctly handles the cleanup of the request from the scheduler's state and notifies the workers.
# Mark child request as finished and clean up.
self._free_request(child_request)| child_request._output_token_ids = [] | ||
| child_request._all_token_ids = list(child_request.prompt_token_ids) |
There was a problem hiding this comment.
In the PD colocated scenario, child requests must inherit the first token generated by the parent's prefill phase. By initializing _output_token_ids as an empty list and _all_token_ids with only the prompt tokens, the child requests will lose the first generated token. This will lead to incorrect results, as subsequent decode steps will be missing the first token in their sequence history, affecting detokenization and stop-condition checks. They should copy the parent's current token state.
| child_request._output_token_ids = [] | |
| child_request._all_token_ids = list(child_request.prompt_token_ids) | |
| child_request._output_token_ids = list(request._output_token_ids) | |
| child_request._all_token_ids = list(request._all_token_ids) |
| # Set status to RUNNING and add to running queue for immediate | ||
| # decode scheduling in the next iteration. | ||
| child_request.status = RequestStatus.WAITING | ||
| self.waiting.append(child_request) |
There was a problem hiding this comment.
self.waiting is an instance of RequestQueue (or a similar scheduling policy wrapper), which does not implement an append method. Attempting to call append will result in an AttributeError at runtime. You should use the add_request method, which is the standard way to add requests to the waiting queue in this scheduler implementation.
| self.waiting.append(child_request) | |
| self.waiting.add_request(child_request) |
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.