Skip to content

V0.12.0 support n sampling delay split to eliminate redundant prefill computation and memory#39646

Closed
Zhutianyi7230 wants to merge 6 commits intovllm-project:releases/v0.12.0from
Zhutianyi7230:v0.12.0
Closed

V0.12.0 support n sampling delay split to eliminate redundant prefill computation and memory#39646
Zhutianyi7230 wants to merge 6 commits intovllm-project:releases/v0.12.0from
Zhutianyi7230:v0.12.0

Conversation

@Zhutianyi7230
Copy link
Copy Markdown

@Zhutianyi7230 Zhutianyi7230 commented Apr 12, 2026

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a delayed split strategy for parallel sampling in vLLM v1, allowing a single prefill to serve multiple decoding child requests. This optimization supports both PD disaggregation and colocated deployment modes. Feedback highlights several critical issues: a memory leak in the scheduler where child requests are not properly freed, a logic error where child requests fail to inherit the first generated token from the parent, and a runtime AttributeError caused by using 'append' instead of 'add_request' on the scheduler's waiting queue.

Comment on lines +1221 to +1224
self.finished_req_ids.add(child_request.request_id)
if self.finished_req_ids_dict is not None:
self.finished_req_ids_dict[child_request.client_index].add(
child_request.request_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In the PD disaggregation scenario, child requests are added to the scheduler's internal self.requests map during the split, but they are never removed. Simply adding them to finished_req_ids is insufficient to prevent a memory leak in the scheduler. You should call self._free_request(child_request) instead, which correctly handles the cleanup of the request from the scheduler's state and notifies the workers.

                        # Mark child request as finished and clean up.
                        self._free_request(child_request)

Comment on lines +1389 to +1390
child_request._output_token_ids = []
child_request._all_token_ids = list(child_request.prompt_token_ids)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In the PD colocated scenario, child requests must inherit the first token generated by the parent's prefill phase. By initializing _output_token_ids as an empty list and _all_token_ids with only the prompt tokens, the child requests will lose the first generated token. This will lead to incorrect results, as subsequent decode steps will be missing the first token in their sequence history, affecting detokenization and stop-condition checks. They should copy the parent's current token state.

Suggested change
child_request._output_token_ids = []
child_request._all_token_ids = list(child_request.prompt_token_ids)
child_request._output_token_ids = list(request._output_token_ids)
child_request._all_token_ids = list(request._all_token_ids)

# Set status to RUNNING and add to running queue for immediate
# decode scheduling in the next iteration.
child_request.status = RequestStatus.WAITING
self.waiting.append(child_request)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

self.waiting is an instance of RequestQueue (or a similar scheduling policy wrapper), which does not implement an append method. Attempting to call append will result in an AttributeError at runtime. You should use the add_request method, which is the standard way to add requests to the waiting queue in this scheduler implementation.

Suggested change
self.waiting.append(child_request)
self.waiting.add_request(child_request)

@Zhutianyi7230 Zhutianyi7230 changed the title V0.12.0 support n sampling delay split to Eliminate redundant prefill computation and memory V0.12.0 support n sampling delay split to eliminate redundant prefill computation and memory Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant