[P/D]Improve the performance of Layerwise Connector by nwpu-zxr · Pull Request #5303 · vllm-project/vllm-ascend

nwpu-zxr · 2025-12-24T01:17:42Z

What this PR does / why we need it?

Improve the performance of Layerwise Connector, mainly includes the following points:

Use event synchronize to replace stream synchronize.
Access metaserver when scheduling.
Transfer kvcache each Chunk prefill segmentation.

This PR is related to [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support #4842

Does this PR introduce any user-facing change?

No.

How was this patch tested?

By CI.

vLLM version: release/v0.13.0
vLLM main: vllm-project/vllm@5fbfa8d

github-actions · 2025-12-24T01:17:48Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces several performance improvements to the Layerwise Connector. The main changes include replacing stream synchronization with event synchronization for better asynchrony, accessing a metaserver during scheduling, and adjusting KV cache transfer logic. The switch to event-based synchronization is a good optimization. However, I've found some critical issues related to resource management and error handling in the new metaserver access logic that need to be addressed.

gemini-code-assist · 2025-12-24T01:20:58Z

+    def _access_metaserver(self, url, message):
+        success = False
+        retry = 0
+        while retry < 3 and success is False:
+            retry += 1
+            try:
+                self.metaserver_client.post(url, json=message)
+                success = True
+            except Exception as e:
+                logger.error(
+                    f"Failed to connect to metaserver: {url}, retry {retry} time."
+                )
+                if retry == 3:
+                    raise e
+


The _access_metaserver method does not handle HTTP errors correctly. The httpx.Client.post method does not raise an exception for HTTP error status codes (e.g., 4xx, 5xx). The current implementation will treat such responses as successful, which can lead to silent failures and incorrect behavior. The retry logic also lacks a backoff delay, which can put unnecessary load on the metaserver during failures. The function also fails silently if all retries fail without an exception being caught on the last attempt.

def _access_metaserver(self, url, message): last_exception = None for attempt in range(1, 4): try: response = self.metaserver_client.post(url, json=message) response.raise_for_status() # Raise an exception for 4xx/5xx status codes return except Exception as e: last_exception = e logger.error( f"Failed to connect to metaserver: {url}, attempt {attempt}/3. Error: {e}" ) if attempt < 3: time.sleep(1) # Add a 1-second delay before retrying if last_exception: raise last_exception

gemini-code-assist · 2025-12-24T01:20:58Z

+        self.executor = ThreadPoolExecutor(32)
+        self.metaserver_client = httpx.Client(
+            limits=httpx.Limits(max_connections=100000),
+            timeout=None)


The ThreadPoolExecutor and httpx.Client are initialized here but are never shut down or closed. This will lead to resource leaks (threads, file descriptors). It's crucial to add a shutdown method to the MooncakeLayerwiseConnectorScheduler class to clean up these resources.

Additionally, setting timeout=None for the httpx.Client is risky in a production system, as a request to the metaserver could hang indefinitely, blocking a thread from the pool. It's strongly recommended to set a reasonable timeout.

It's recommended to add a shutdown method to the MooncakeLayerwiseConnectorScheduler class to properly close the ThreadPoolExecutor and httpx.Client to prevent resource leaks.

def shutdown(self): self.executor.shutdown(wait=True) self.metaserver_client.close()

This method should be called when the scheduler is no longer needed.

Suggested change

self.executor = ThreadPoolExecutor(32)

self.metaserver_client = httpx.Client(

limits=httpx.Limits(max_connections=100000),

timeout=None)

self.executor = ThreadPoolExecutor(32)

self.metaserver_client = httpx.Client(

limits=httpx.Limits(max_connections=100000),

timeout=60.0)

LCAIZJ · 2025-12-24T02:56:04Z

            if self.current_layer != layer_index:
                self.current_layer = layer_index
-                self.model_stream.synchronize()
+                reshape_cache_event.synchronize()


Have you encountered any hang issues during testing? #4976

In our self-validation, we used the latest 8.5 version of the CANN package, and did not encounter any hanging issues with event synchronize. And we will add the relevant explanations.

liziyu179 · 2025-12-24T03:10:00Z

            prefill_slots = attn_metadata.slot_mapping[
                num_decode_tokens:num_actual_tokens]
            prefill_q_pe = self.rope_single(prefill_q_pe, cos, sin)
+            attn_metadata.prefill.reshape_cache_event = torch.npu.Event()


It needs to be constrained such that the event is recorded only for P nodes and not merely for prefill.

liziyu179 · 2025-12-24T03:11:56Z

+                remote_host=self.side_channel_host,
+                remote_port=self.side_channel_port,
+            )
+            future = self.executor.submit(


Move the request for P nodes forward to the schedule stage, and delete the corresponding logic in the worker.

wujinyuan1 · 2025-12-24T10:41:17Z

                num_decode_tokens:num_actual_tokens]
            prefill_q_pe = self.rope_single(prefill_q_pe, cos, sin)
+            if self.is_kv_producer:
+                attn_metadata.prefill.reshape_cache_event = torch.npu.Event()


Does the long sequence CP mode need no this extra processing?

github-actions · 2025-12-24T14:37:58Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-24T14:37:58Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-28T02:43:31Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>

Signed-off-by: liziyu <liziyu16@huawei.com>

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

Signed-off-by: liziyu <liziyu16@huawei.com>

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

### What this PR does / why we need it? Improve the performance of Layerwise Connector, mainly includes the following points: 1. Use event synchronize to replace stream synchronize. 2. Access metaserver when scheduling. 3. Transfer kvcache each Chunk prefill segmentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@5fbfa8d --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: wjunLu <wjunlu217@gmail.com>

### What this PR does / why we need it? Improve the performance of Layerwise Connector, mainly includes the following points: 1. Use event synchronize to replace stream synchronize. 2. Access metaserver when scheduling. 3. Transfer kvcache each Chunk prefill segmentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@5fbfa8d --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>

### What this PR does / why we need it? Improve the performance of Layerwise Connector, mainly includes the following points: 1. Use event synchronize to replace stream synchronize. 2. Access metaserver when scheduling. 3. Transfer kvcache each Chunk prefill segmentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@5fbfa8d --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? Improve the performance of Layerwise Connector, mainly includes the following points: 1. Use event synchronize to replace stream synchronize. 2. Access metaserver when scheduling. 3. Transfer kvcache each Chunk prefill segmentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@5fbfa8d --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>

### What this PR does / why we need it? Improve the performance of Layerwise Connector, mainly includes the following points: 1. Use event synchronize to replace stream synchronize. 2. Access metaserver when scheduling. 3. Transfer kvcache each Chunk prefill segmentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@5fbfa8d --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

gemini-code-assist bot reviewed Dec 24, 2025

View reviewed changes

LCAIZJ reviewed Dec 24, 2025

View reviewed changes

liziyu179 reviewed Dec 24, 2025

View reviewed changes

liziyu179 force-pushed the layerwise_opt branch 5 times, most recently from a9e81e5 to f9e5a71 Compare December 24, 2025 08:57

wujinyuan1 reviewed Dec 24, 2025

View reviewed changes

liziyu179 force-pushed the layerwise_opt branch from f9e5a71 to 913747b Compare December 24, 2025 14:02

github-actions bot added the merge-conflicts label Dec 24, 2025

liziyu179 force-pushed the layerwise_opt branch 2 times, most recently from 4f7088d to 22c306b Compare December 27, 2025 06:24

github-actions bot removed the merge-conflicts label Dec 27, 2025

github-actions bot added the merge-conflicts label Dec 28, 2025

nwpu-zxr and others added 3 commits December 29, 2025 14:46

add event synchronize && schedule access metaserver

a042d9b

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>

del worker assess metaserver && bugfix

6cc02d5

Signed-off-by: liziyu <liziyu16@huawei.com>

push kv cache for each chunk

4c8d93d

Signed-off-by: liziyu <liziyu16@huawei.com>

liziyu179 force-pushed the layerwise_opt branch from 22c306b to 219ca59 Compare December 29, 2025 06:57

github-actions bot removed the merge-conflicts label Dec 29, 2025

liziyu179 force-pushed the layerwise_opt branch 2 times, most recently from b692e63 to cf72d91 Compare December 29, 2025 07:28

fix lint

dc1e705

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>

liziyu179 force-pushed the layerwise_opt branch from cf72d91 to dc1e705 Compare December 29, 2025 09:11

wangxiaoteng888 force-pushed the layerwise_opt branch 2 times, most recently from 437dec8 to 0430743 Compare December 29, 2025 12:49

wangxiaoteng888 force-pushed the layerwise_opt branch from 0430743 to 9e85e68 Compare December 29, 2025 12:49

wangxiaoteng888 and others added 3 commits December 29, 2025 20:50

refactoring

9e85e68

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

add super init for MooncakeLayerwiseConnector

4254885

Signed-off-by: liziyu <liziyu16@huawei.com>

add reshape_cache_event

7a68880

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

wangxiaoteng888 force-pushed the layerwise_opt branch 4 times, most recently from 7442647 to 279f2f2 Compare December 30, 2025 11:33

refactoring

5e2d8cc

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

wangxiaoteng888 force-pushed the layerwise_opt branch from 279f2f2 to 5e2d8cc Compare December 30, 2025 12:55

yiz-liu added ready read for review ready-for-test start test by label for PR labels Dec 31, 2025

yiz-liu approved these changes Dec 31, 2025

View reviewed changes

yiz-liu merged commit 46a1614 into vllm-project:main Dec 31, 2025
62 of 66 checks passed

liziyu179 mentioned this pull request Jan 6, 2026

[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support #4842

Open

Yikun mentioned this pull request Feb 5, 2026

[v0.13.0rc2] FAQ / Feedback | 问题/反馈 #6186

Closed

nwpu-zxr deleted the layerwise_opt branch February 27, 2026 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[P/D]Improve the performance of Layerwise Connector#5303

[P/D]Improve the performance of Layerwise Connector#5303
yiz-liu merged 8 commits intovllm-project:mainfrom
nwpu-zxr:layerwise_opt

nwpu-zxr commented Dec 24, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 24, 2025

Uh oh!

gemini-code-assist bot Dec 24, 2025

Uh oh!

LCAIZJ Dec 24, 2025

Uh oh!

liziyu179 Dec 24, 2025

Uh oh!

liziyu179 Dec 24, 2025

Uh oh!

liziyu179 Dec 24, 2025

Uh oh!

wujinyuan1 Dec 24, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

github-actions bot commented Dec 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

nwpu-zxr commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

LCAIZJ Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

liziyu179 Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

liziyu179 Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

liziyu179 Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

wujinyuan1 Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

github-actions bot commented Dec 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nwpu-zxr commented Dec 24, 2025 •

edited

Loading

wujinyuan1 Dec 24, 2025 •

edited

Loading