[P/D] layerwise connector support recompute scheduler by liziyu179 · Pull Request #5900 · vllm-project/vllm-ascend

liziyu179 · 2026-01-14T08:57:54Z

What this PR does / why we need it?

layerwise connector support recompute scheduler.

NOTE：
Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations.

[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support #4842

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.13.0
vLLM main: vllm-project/vllm@bde38c1

github-actions · 2026-01-14T08:58:10Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds support for a recompute scheduler in the layerwise connector example. The implementation introduces complex logic to handle retries when a 'recomputed' signal is received from the backend.

My review has identified several issues:

There are two critical bugs that could lead to unhandled exceptions: one is an IndexError if the messages list in a chat request is empty, and the other is a TypeError in the completion_tokens calculation for non-streaming responses.
I also found a couple of high-severity issues related to code quality and robustness. The released_kv flag seems to be part of an incomplete copy-paste and might hide a resource leak. Additionally, the logic for recalculating max_tokens during a retry is brittle and should be refactored for better clarity and reliability.

I have provided specific comments with suggestions for fixing these issues.

gemini-code-assist · 2026-01-14T09:00:18Z

examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py

+            messages = req_data["messages"]
+            origin_prompt = messages[0].get("content", "")


The code at line 439 (messages[0].get("content", "")) assumes that the messages list is not empty. If an API request is sent with an empty messages list (e.g., "messages": []), this will raise an IndexError, causing an unhandled exception. While the OpenAI API spec requires at least one message, it's best to code defensively.

A similar issue exists on lines 507-508. You should add checks to ensure messages is not empty before accessing its elements.

Here's a suggested way to fix this:

# At lines 438-439 messages = req_data["messages"] origin_prompt = messages[0].get("content", "") if messages else "" # And at lines 506-508 if chat_flag and messages: messages[0][ "content"] = origin_prompt + generated_token

Suggested change

messages = req_data["messages"]

origin_prompt = messages[0].get("content", "")

messages = req_data["messages"]

origin_prompt = messages[0].get("content", "") if messages else ""

gemini-code-assist · 2026-01-14T09:00:18Z

examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py

+                        completion_tokens = (completion_tokens + 1) if stream_flag else \
+                            (completion_tokens + usage.get("completion_tokens"))


In the non-streaming case, usage.get("completion_tokens") can return None if the key is not present in the usage dictionary. Adding None to completion_tokens will raise a TypeError. You should provide a default value of 0 to prevent this.

Suggested change

completion_tokens = (completion_tokens + 1) if stream_flag else \

(completion_tokens + usage.get("completion_tokens"))

completion_tokens = (completion_tokens + 1) if stream_flag else \

(completion_tokens + usage.get("completion_tokens", 0))

gemini-code-assist · 2026-01-14T09:00:18Z

examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py

            nonlocal released_kv
+            generated_token = ""
+            released_kv = False


The released_kv variable is declared nonlocal and then reassigned, but it's never read within the generate_stream function. This indicates dead code. It seems that logic for releasing the KV cache, which is present in load_balance_proxy_server_example.py, is missing here. This could lead to a resource leak if the KV cache is not freed. Please either add the KV cache release logic or remove the unused released_kv variable and its related declarations.

gemini-code-assist · 2026-01-14T09:00:19Z

examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py

+                            req_data[
+                                "max_tokens"] = origin_max_tokens - completion_tokens + retry_count


The calculation for max_tokens on retry, origin_max_tokens - completion_tokens + retry_count, is brittle. It seems to compensate for completion_tokens being incremented for the "recomputed" signal chunk, which is not a real token. This makes the logic hard to follow and prone to errors if the backend behavior changes.

A more robust approach would be to reorder the logic to check for stop_reason == "recomputed" before processing the chunk as a regular token and incrementing completion_tokens. This way, completion_tokens would accurately reflect the number of actual generated tokens, and the max_tokens calculation would simplify to origin_max_tokens - completion_tokens.

LCAIZJ · 2026-01-16T01:45:04Z

Could you please provide a more detailed explanation of the background and design of this PR?

Signed-off-by: liziyu <liziyu16@huawei.com>

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

Signed-off-by: liziyu <liziyu16@huawei.com>

winson-00178005 · 2026-02-06T09:27:57Z

examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py

@@ -451,7 +525,10 @@ async def generate_stream():
            # After streaming done, release tokens
            proxy_state.release_decoder(decoder_idx, decoder_score)


如果抛出异常的时候，是不是也要调用这个来释放？

Running to this line of the function signifies the end of stream return, releasing the key-value cache records of node D.

winson-00178005 · 2026-02-06T09:30:20Z

examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py

+                            else:
+                                choice["text"] = generated_token
+                            chunk = json.dumps(chunk_json).encode("utf-8")
+                        yield chunk
            except Exception as e:
                logger.error(
                    f"Error during streaming from decoder {decoder.url}: {str(e)} "


可以把request_id/retry_count加入日志中，方便问题定位

Thanks for reviewing the code. The number of retries will be printed in stream_service_response_with_retry, and the request_id will be printed on the next line.

…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [Patch] Remove the patch of MiniCPM (vllm-project#5975) [P/D] layerwise connector support recompute scheduler (vllm-project#5900) [CI] Add workflow support for lint image build (vllm-project#6489) [Bugfix] Fix problematic dummy_run & improper input_batch_size in eagle (vllm-project#6517) [Refactor]310p_e2e test case update (vllm-project#6539) [Refactor]refactor p2p connector (vllm-project#6551) [Refactor]refactor 310p attention impl and add ut (vllm-project#6579) [Refactor]refactor 310p ops and add ut (vllm-project#6591) [Ops][Refactor] Remove custom rotary_embedding operator (vllm-project#6523) [Lint]Style: Convert `vllm-ascend/` to ruff format(new Batch vllm-project#8) (vllm-project#6604) [Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder (vllm-project#5301) [CI] Fix broken CI (vllm-project#6599) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#10) (vllm-project#6173) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#11) (vllm-project#6176) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#8) (vllm-project#6129) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#7) (vllm-project#6023) [CI][Misc] Some improvement for github action (vllm-project#6587) [Image] Bump mooncake version to v0.3.8.post1 (vllm-project#6428)

) ### What this PR does / why we need it? layerwise connector support recompute scheduler. NOTE： Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support vllm-project#4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@bde38c1 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: luomin2005 <luomin2005@huawei.com>

) ### What this PR does / why we need it? layerwise connector support recompute scheduler. NOTE： Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support vllm-project#4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@bde38c1 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>

) ### What this PR does / why we need it? layerwise connector support recompute scheduler. NOTE： Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support vllm-project#4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@bde38c1 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

) ### What this PR does / why we need it? layerwise connector support recompute scheduler. NOTE： Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support vllm-project#4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@bde38c1 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>

) ### What this PR does / why we need it? layerwise connector support recompute scheduler. NOTE： Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support vllm-project#4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@bde38c1 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

) ### What this PR does / why we need it? layerwise connector support recompute scheduler. NOTE： Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support vllm-project#4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@bde38c1 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>

gemini-code-assist bot reviewed Jan 14, 2026

View reviewed changes

wangxiaoteng888 requested review from MengqingCao and wangxiyuan as code owners January 16, 2026 06:40

liziyu179 force-pushed the recompute_proxy_for_cdp branch from ae6ceb1 to 8ceb651 Compare February 4, 2026 08:28

liziyu179 and others added 2 commits February 4, 2026 16:58

layerwise connector support recompute scheduler

1b89fd7

Signed-off-by: liziyu <liziyu16@huawei.com>

bugfix_recompute

3e0d74a

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

wangxiaoteng888 force-pushed the recompute_proxy_for_cdp branch from 8ceb651 to 3e0d74a Compare February 4, 2026 08:58

liziyu179 added 2 commits February 4, 2026 17:00

fix prompt for vllm chat bench

bde95f9

Signed-off-by: liziyu <liziyu16@huawei.com>

use text/event-stream for stream response

5bfae63

Signed-off-by: liziyu <liziyu16@huawei.com>

liziyu179 mentioned this pull request Feb 4, 2026

[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support #4842

Open

whx-sjtu approved these changes Feb 4, 2026

View reviewed changes

whx-sjtu added ready read for review ready-for-test start test by label for PR labels Feb 5, 2026

liziyu179 force-pushed the recompute_proxy_for_cdp branch from 77a2abb to 4f942a7 Compare February 5, 2026 07:41

fix lint

df51879

Signed-off-by: liziyu <liziyu16@huawei.com>

liziyu179 force-pushed the recompute_proxy_for_cdp branch from 4f942a7 to df51879 Compare February 5, 2026 07:41

winson-00178005 reviewed Feb 6, 2026

View reviewed changes

zzzzwwjj approved these changes Feb 7, 2026

View reviewed changes

zzzzwwjj merged commit e5f0e0e into vllm-project:main Feb 7, 2026
17 checks passed

wangxiyuan mentioned this pull request Feb 24, 2026

[Misc]: test #6787

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[P/D] layerwise connector support recompute scheduler#5900

[P/D] layerwise connector support recompute scheduler#5900
zzzzwwjj merged 5 commits intovllm-project:mainfrom
liziyu179:recompute_proxy_for_cdp

liziyu179 commented Jan 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 14, 2026

Uh oh!

gemini-code-assist bot Jan 14, 2026

Uh oh!

gemini-code-assist bot Jan 14, 2026

Uh oh!

gemini-code-assist bot Jan 14, 2026

Uh oh!

LCAIZJ commented Jan 16, 2026

Uh oh!

winson-00178005 Feb 6, 2026

Uh oh!

liziyu179 Feb 7, 2026

Uh oh!

winson-00178005 Feb 6, 2026 •

edited

Loading

Uh oh!

liziyu179 Feb 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		messages = req_data["messages"]
		origin_prompt = messages[0].get("content", "")

		completion_tokens = (completion_tokens + 1) if stream_flag else \
		(completion_tokens + usage.get("completion_tokens"))

		req_data[
		"max_tokens"] = origin_max_tokens - completion_tokens + retry_count

		@@ -451,7 +525,10 @@ async def generate_stream():
		# After streaming done, release tokens
		proxy_state.release_decoder(decoder_idx, decoder_score)

Conversation

liziyu179 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

LCAIZJ commented Jan 16, 2026

Uh oh!

winson-00178005 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

liziyu179 Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

winson-00178005 Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liziyu179 Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

liziyu179 commented Jan 14, 2026 •

edited

Loading

winson-00178005 Feb 6, 2026 •

edited

Loading