Skip to content

[P/D] layerwise connector support recompute scheduler#5900

Merged
zzzzwwjj merged 5 commits intovllm-project:mainfrom
liziyu179:recompute_proxy_for_cdp
Feb 7, 2026
Merged

[P/D] layerwise connector support recompute scheduler#5900
zzzzwwjj merged 5 commits intovllm-project:mainfrom
liziyu179:recompute_proxy_for_cdp

Conversation

@liziyu179
Copy link
Collaborator

@liziyu179 liziyu179 commented Jan 14, 2026

What this PR does / why we need it?

layerwise connector support recompute scheduler.

NOTE:
Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations.

[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support #4842

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for a recompute scheduler in the layerwise connector example. The implementation introduces complex logic to handle retries when a 'recomputed' signal is received from the backend.

My review has identified several issues:

  • There are two critical bugs that could lead to unhandled exceptions: one is an IndexError if the messages list in a chat request is empty, and the other is a TypeError in the completion_tokens calculation for non-streaming responses.
  • I also found a couple of high-severity issues related to code quality and robustness. The released_kv flag seems to be part of an incomplete copy-paste and might hide a resource leak. Additionally, the logic for recalculating max_tokens during a retry is brittle and should be refactored for better clarity and reliability.

I have provided specific comments with suggestions for fixing these issues.

Comment on lines +438 to +439
messages = req_data["messages"]
origin_prompt = messages[0].get("content", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The code at line 439 (messages[0].get("content", "")) assumes that the messages list is not empty. If an API request is sent with an empty messages list (e.g., "messages": []), this will raise an IndexError, causing an unhandled exception. While the OpenAI API spec requires at least one message, it's best to code defensively.

A similar issue exists on lines 507-508. You should add checks to ensure messages is not empty before accessing its elements.

Here's a suggested way to fix this:

# At lines 438-439
            messages = req_data["messages"]
            origin_prompt = messages[0].get("content", "") if messages else ""

# And at lines 506-508
                            if chat_flag and messages:
                                messages[0][
                                    "content"] = origin_prompt + generated_token
Suggested change
messages = req_data["messages"]
origin_prompt = messages[0].get("content", "")
messages = req_data["messages"]
origin_prompt = messages[0].get("content", "") if messages else ""

Comment on lines +501 to +502
completion_tokens = (completion_tokens + 1) if stream_flag else \
(completion_tokens + usage.get("completion_tokens"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

In the non-streaming case, usage.get("completion_tokens") can return None if the key is not present in the usage dictionary. Adding None to completion_tokens will raise a TypeError. You should provide a default value of 0 to prevent this.

Suggested change
completion_tokens = (completion_tokens + 1) if stream_flag else \
(completion_tokens + usage.get("completion_tokens"))
completion_tokens = (completion_tokens + 1) if stream_flag else \
(completion_tokens + usage.get("completion_tokens", 0))

Comment on lines 446 to +448
nonlocal released_kv
generated_token = ""
released_kv = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The released_kv variable is declared nonlocal and then reassigned, but it's never read within the generate_stream function. This indicates dead code. It seems that logic for releasing the KV cache, which is present in load_balance_proxy_server_example.py, is missing here. This could lead to a resource leak if the KV cache is not freed. Please either add the KV cache release logic or remove the unused released_kv variable and its related declarations.

Comment on lines +512 to +513
req_data[
"max_tokens"] = origin_max_tokens - completion_tokens + retry_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation for max_tokens on retry, origin_max_tokens - completion_tokens + retry_count, is brittle. It seems to compensate for completion_tokens being incremented for the "recomputed" signal chunk, which is not a real token. This makes the logic hard to follow and prone to errors if the backend behavior changes.

A more robust approach would be to reorder the logic to check for stop_reason == "recomputed" before processing the chunk as a regular token and incrementing completion_tokens. This way, completion_tokens would accurately reflect the number of actual generated tokens, and the max_tokens calculation would simplify to origin_max_tokens - completion_tokens.

@LCAIZJ
Copy link
Collaborator

LCAIZJ commented Jan 16, 2026

Could you please provide a more detailed explanation of the background and design of this PR?

liziyu179 and others added 2 commits February 4, 2026 16:58
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
@wangxiaoteng888 wangxiaoteng888 force-pushed the recompute_proxy_for_cdp branch from 8ceb651 to 3e0d74a Compare February 4, 2026 08:58
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
@whx-sjtu whx-sjtu added ready read for review ready-for-test start test by label for PR labels Feb 5, 2026
@liziyu179 liziyu179 force-pushed the recompute_proxy_for_cdp branch from 77a2abb to 4f942a7 Compare February 5, 2026 07:41
Signed-off-by: liziyu <liziyu16@huawei.com>
@liziyu179 liziyu179 force-pushed the recompute_proxy_for_cdp branch from 4f942a7 to df51879 Compare February 5, 2026 07:41
@@ -451,7 +525,10 @@ async def generate_stream():
# After streaming done, release tokens
proxy_state.release_decoder(decoder_idx, decoder_score)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果抛出异常的时候,是不是也要调用这个来释放?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running to this line of the function signifies the end of stream return, releasing the key-value cache records of node D.

else:
choice["text"] = generated_token
chunk = json.dumps(chunk_json).encode("utf-8")
yield chunk
except Exception as e:
logger.error(
f"Error during streaming from decoder {decoder.url}: {str(e)} "
Copy link

@winson-00178005 winson-00178005 Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以把request_id/retry_count加入日志中,方便问题定位

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing the code. The number of retries will be printed in stream_service_response_with_retry, and the request_id will be printed on the next line.

@zzzzwwjj zzzzwwjj merged commit e5f0e0e into vllm-project:main Feb 7, 2026
17 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Feb 9, 2026
…to qwen3next_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend:
  [Patch] Remove the patch of MiniCPM (vllm-project#5975)
  [P/D] layerwise connector support recompute scheduler (vllm-project#5900)
  [CI] Add workflow support for lint image build (vllm-project#6489)
  [Bugfix] Fix problematic dummy_run & improper input_batch_size in eagle (vllm-project#6517)
  [Refactor]310p_e2e test case update (vllm-project#6539)
  [Refactor]refactor p2p connector (vllm-project#6551)
  [Refactor]refactor 310p attention impl and add ut (vllm-project#6579)
  [Refactor]refactor 310p ops and add ut (vllm-project#6591)
  [Ops][Refactor] Remove custom rotary_embedding operator (vllm-project#6523)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(new Batch vllm-project#8) (vllm-project#6604)
  [Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder  (vllm-project#5301)
  [CI] Fix broken CI (vllm-project#6599)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#10) (vllm-project#6173)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#11) (vllm-project#6176)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#8) (vllm-project#6129)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#7) (vllm-project#6023)
  [CI][Misc] Some improvement for github action (vllm-project#6587)
  [Image] Bump mooncake version to v0.3.8.post1 (vllm-project#6428)
luomin2005 pushed a commit to luomin2005/vllm-ascend that referenced this pull request Feb 11, 2026
)

### What this PR does / why we need it?
layerwise connector support recompute scheduler.

NOTE:
Triggering recompute will invoke the tokenizer again, which may lead to
precision fluctuations.

[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache
Layerwise Push Support
vllm-project#4842

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@bde38c1

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: luomin2005 <luomin2005@huawei.com>
chenchuw886 pushed a commit to chenchuw886/vllm-ascend that referenced this pull request Feb 12, 2026
)

### What this PR does / why we need it?
layerwise connector support recompute scheduler.

NOTE:
Triggering recompute will invoke the tokenizer again, which may lead to
precision fluctuations.

[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache
Layerwise Push Support
vllm-project#4842

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@bde38c1

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: momochenchuw <chenchuw@huawei.com>
@wangxiyuan wangxiyuan mentioned this pull request Feb 24, 2026
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
)

### What this PR does / why we need it?
layerwise connector support recompute scheduler.

NOTE:
Triggering recompute will invoke the tokenizer again, which may lead to
precision fluctuations.

[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache
Layerwise Push Support
vllm-project#4842

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@bde38c1

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
)

### What this PR does / why we need it?
layerwise connector support recompute scheduler. 

NOTE:
Triggering recompute will invoke the tokenizer again, which may lead to
precision fluctuations.

[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache
Layerwise Push Support
vllm-project#4842

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@bde38c1

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
)

### What this PR does / why we need it?
layerwise connector support recompute scheduler.

NOTE:
Triggering recompute will invoke the tokenizer again, which may lead to
precision fluctuations.

[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache
Layerwise Push Support
vllm-project#4842

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@bde38c1

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
)

### What this PR does / why we need it?
layerwise connector support recompute scheduler. 

NOTE:
Triggering recompute will invoke the tokenizer again, which may lead to
precision fluctuations.

[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache
Layerwise Push Support
vllm-project#4842

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@bde38c1

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants