V0.12.0 support n sampling delay split to eliminate redundant prefill computation and memory by Zhutianyi7230 · Pull Request #39646 · vllm-project/vllm

Zhutianyi7230 · 2026-04-12T18:42:07Z

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request implements a delayed split strategy for parallel sampling in vLLM v1, allowing a single prefill to serve multiple decoding child requests. This optimization supports both PD disaggregation and colocated deployment modes. Feedback highlights several critical issues: a memory leak in the scheduler where child requests are not properly freed, a logic error where child requests fail to inherit the first generated token from the parent, and a runtime AttributeError caused by using 'append' instead of 'add_request' on the scheduler's waiting queue.

gemini-code-assist · 2026-04-12T18:44:01Z

+                        self.finished_req_ids.add(child_request.request_id)
+                        if self.finished_req_ids_dict is not None:
+                            self.finished_req_ids_dict[child_request.client_index].add(
+                                child_request.request_id)


In the PD disaggregation scenario, child requests are added to the scheduler's internal self.requests map during the split, but they are never removed. Simply adding them to finished_req_ids is insufficient to prevent a memory leak in the scheduler. You should call self._free_request(child_request) instead, which correctly handles the cleanup of the request from the scheduler's state and notifies the workers.

# Mark child request as finished and clean up. self._free_request(child_request)

gemini-code-assist · 2026-04-12T18:44:01Z

+                child_request._output_token_ids = []
+                child_request._all_token_ids = list(child_request.prompt_token_ids)


In the PD colocated scenario, child requests must inherit the first token generated by the parent's prefill phase. By initializing _output_token_ids as an empty list and _all_token_ids with only the prompt tokens, the child requests will lose the first generated token. This will lead to incorrect results, as subsequent decode steps will be missing the first token in their sequence history, affecting detokenization and stop-condition checks. They should copy the parent's current token state.

Suggested change

child_request._output_token_ids = []

child_request._all_token_ids = list(child_request.prompt_token_ids)

child_request._output_token_ids = list(request._output_token_ids)

child_request._all_token_ids = list(request._all_token_ids)

gemini-code-assist · 2026-04-12T18:44:02Z

+                # Set status to RUNNING and add to running queue for immediate
+                # decode scheduling in the next iteration.
+                child_request.status = RequestStatus.WAITING
+                self.waiting.append(child_request)


self.waiting is an instance of RequestQueue (or a similar scheduling policy wrapper), which does not implement an append method. Attempting to call append will result in an AttributeError at runtime. You should use the add_request method, which is the standard way to add requests to the waiting queue in this scheduler implementation.

Suggested change

self.waiting.append(child_request)

self.waiting.add_request(child_request)

zhutianyi added 6 commits March 8, 2026 01:56

延迟拆分n_sampling

5b53451

fix 子请求没有req_state的问题

40ef32b

support n_sampling delay-split for PD colocation

b223db0

debug

0765ae8

fix the same answer problem when n>1

478e414

cancel child_kv_transfer_params

355f6d6

Zhutianyi7230 requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners April 12, 2026 18:42

Zhutianyi7230 closed this Apr 12, 2026

mergify bot added the v1 label Apr 12, 2026

gemini-code-assist bot reviewed Apr 12, 2026

View reviewed changes

Zhutianyi7230 changed the title ~~V0.12.0 support n sampling delay split to Eliminate redundant prefill computation and memory~~ V0.12.0 support n sampling delay split to eliminate redundant prefill computation and memory Apr 12, 2026

panpan0000 mentioned this pull request Apr 14, 2026

Introduce De-dup/Similarity-Check in CI Workflow for PR/Issue #39695

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

V0.12.0 support n sampling delay split to eliminate redundant prefill computation and memory#39646

V0.12.0 support n sampling delay split to eliminate redundant prefill computation and memory#39646
Zhutianyi7230 wants to merge 6 commits intovllm-project:releases/v0.12.0from
Zhutianyi7230:v0.12.0

Zhutianyi7230 commented Apr 12, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 12, 2026

Uh oh!

gemini-code-assist bot Apr 12, 2026

Uh oh!

gemini-code-assist bot Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		child_request._output_token_ids = []
		child_request._all_token_ids = list(child_request.prompt_token_ids)

	self.waiting.append(child_request)
	self.waiting.add_request(child_request)

Uh oh!

Conversation

Zhutianyi7230 commented Apr 12, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Zhutianyi7230 commented Apr 12, 2026 •

edited by github-actions bot

Loading