[P/D] layerwise connector support recompute scheduler#5900
[P/D] layerwise connector support recompute scheduler#5900zzzzwwjj merged 5 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request adds support for a recompute scheduler in the layerwise connector example. The implementation introduces complex logic to handle retries when a 'recomputed' signal is received from the backend.
My review has identified several issues:
- There are two critical bugs that could lead to unhandled exceptions: one is an
IndexErrorif themessageslist in a chat request is empty, and the other is aTypeErrorin thecompletion_tokenscalculation for non-streaming responses. - I also found a couple of high-severity issues related to code quality and robustness. The
released_kvflag seems to be part of an incomplete copy-paste and might hide a resource leak. Additionally, the logic for recalculatingmax_tokensduring a retry is brittle and should be refactored for better clarity and reliability.
I have provided specific comments with suggestions for fixing these issues.
| messages = req_data["messages"] | ||
| origin_prompt = messages[0].get("content", "") |
There was a problem hiding this comment.
The code at line 439 (messages[0].get("content", "")) assumes that the messages list is not empty. If an API request is sent with an empty messages list (e.g., "messages": []), this will raise an IndexError, causing an unhandled exception. While the OpenAI API spec requires at least one message, it's best to code defensively.
A similar issue exists on lines 507-508. You should add checks to ensure messages is not empty before accessing its elements.
Here's a suggested way to fix this:
# At lines 438-439
messages = req_data["messages"]
origin_prompt = messages[0].get("content", "") if messages else ""
# And at lines 506-508
if chat_flag and messages:
messages[0][
"content"] = origin_prompt + generated_token| messages = req_data["messages"] | |
| origin_prompt = messages[0].get("content", "") | |
| messages = req_data["messages"] | |
| origin_prompt = messages[0].get("content", "") if messages else "" |
| completion_tokens = (completion_tokens + 1) if stream_flag else \ | ||
| (completion_tokens + usage.get("completion_tokens")) |
There was a problem hiding this comment.
In the non-streaming case, usage.get("completion_tokens") can return None if the key is not present in the usage dictionary. Adding None to completion_tokens will raise a TypeError. You should provide a default value of 0 to prevent this.
| completion_tokens = (completion_tokens + 1) if stream_flag else \ | |
| (completion_tokens + usage.get("completion_tokens")) | |
| completion_tokens = (completion_tokens + 1) if stream_flag else \ | |
| (completion_tokens + usage.get("completion_tokens", 0)) |
| nonlocal released_kv | ||
| generated_token = "" | ||
| released_kv = False |
There was a problem hiding this comment.
The released_kv variable is declared nonlocal and then reassigned, but it's never read within the generate_stream function. This indicates dead code. It seems that logic for releasing the KV cache, which is present in load_balance_proxy_server_example.py, is missing here. This could lead to a resource leak if the KV cache is not freed. Please either add the KV cache release logic or remove the unused released_kv variable and its related declarations.
| req_data[ | ||
| "max_tokens"] = origin_max_tokens - completion_tokens + retry_count |
There was a problem hiding this comment.
The calculation for max_tokens on retry, origin_max_tokens - completion_tokens + retry_count, is brittle. It seems to compensate for completion_tokens being incremented for the "recomputed" signal chunk, which is not a real token. This makes the logic hard to follow and prone to errors if the backend behavior changes.
A more robust approach would be to reorder the logic to check for stop_reason == "recomputed" before processing the chunk as a regular token and incrementing completion_tokens. This way, completion_tokens would accurately reflect the number of actual generated tokens, and the max_tokens calculation would simplify to origin_max_tokens - completion_tokens.
|
Could you please provide a more detailed explanation of the background and design of this PR? |
ae6ceb1 to
8ceb651
Compare
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
8ceb651 to
3e0d74a
Compare
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
77a2abb to
4f942a7
Compare
4f942a7 to
df51879
Compare
| @@ -451,7 +525,10 @@ async def generate_stream(): | |||
| # After streaming done, release tokens | |||
| proxy_state.release_decoder(decoder_idx, decoder_score) | |||
There was a problem hiding this comment.
Running to this line of the function signifies the end of stream return, releasing the key-value cache records of node D.
| else: | ||
| choice["text"] = generated_token | ||
| chunk = json.dumps(chunk_json).encode("utf-8") | ||
| yield chunk | ||
| except Exception as e: | ||
| logger.error( | ||
| f"Error during streaming from decoder {decoder.url}: {str(e)} " |
There was a problem hiding this comment.
可以把request_id/retry_count加入日志中,方便问题定位
There was a problem hiding this comment.
Thanks for reviewing the code. The number of retries will be printed in stream_service_response_with_retry, and the request_id will be printed on the next line.
…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [Patch] Remove the patch of MiniCPM (vllm-project#5975) [P/D] layerwise connector support recompute scheduler (vllm-project#5900) [CI] Add workflow support for lint image build (vllm-project#6489) [Bugfix] Fix problematic dummy_run & improper input_batch_size in eagle (vllm-project#6517) [Refactor]310p_e2e test case update (vllm-project#6539) [Refactor]refactor p2p connector (vllm-project#6551) [Refactor]refactor 310p attention impl and add ut (vllm-project#6579) [Refactor]refactor 310p ops and add ut (vllm-project#6591) [Ops][Refactor] Remove custom rotary_embedding operator (vllm-project#6523) [Lint]Style: Convert `vllm-ascend/` to ruff format(new Batch vllm-project#8) (vllm-project#6604) [Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder (vllm-project#5301) [CI] Fix broken CI (vllm-project#6599) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#10) (vllm-project#6173) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#11) (vllm-project#6176) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#8) (vllm-project#6129) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#7) (vllm-project#6023) [CI][Misc] Some improvement for github action (vllm-project#6587) [Image] Bump mooncake version to v0.3.8.post1 (vllm-project#6428)
) ### What this PR does / why we need it? layerwise connector support recompute scheduler. NOTE: Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support vllm-project#4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@bde38c1 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: luomin2005 <luomin2005@huawei.com>
) ### What this PR does / why we need it? layerwise connector support recompute scheduler. NOTE: Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support vllm-project#4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@bde38c1 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>
) ### What this PR does / why we need it? layerwise connector support recompute scheduler. NOTE: Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support vllm-project#4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@bde38c1 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
) ### What this PR does / why we need it? layerwise connector support recompute scheduler. NOTE: Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support vllm-project#4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@bde38c1 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
) ### What this PR does / why we need it? layerwise connector support recompute scheduler. NOTE: Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support vllm-project#4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@bde38c1 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
) ### What this PR does / why we need it? layerwise connector support recompute scheduler. NOTE: Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support vllm-project#4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@bde38c1 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
What this PR does / why we need it?
layerwise connector support recompute scheduler.
NOTE:
Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations.
[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support #4842
Does this PR introduce any user-facing change?
How was this patch tested?