[P/D]Improve the performance of Layerwise Connector#5303
[P/D]Improve the performance of Layerwise Connector#5303yiz-liu merged 8 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request introduces several performance improvements to the Layerwise Connector. The main changes include replacing stream synchronization with event synchronization for better asynchrony, accessing a metaserver during scheduling, and adjusting KV cache transfer logic. The switch to event-based synchronization is a good optimization. However, I've found some critical issues related to resource management and error handling in the new metaserver access logic that need to be addressed.
| def _access_metaserver(self, url, message): | ||
| success = False | ||
| retry = 0 | ||
| while retry < 3 and success is False: | ||
| retry += 1 | ||
| try: | ||
| self.metaserver_client.post(url, json=message) | ||
| success = True | ||
| except Exception as e: | ||
| logger.error( | ||
| f"Failed to connect to metaserver: {url}, retry {retry} time." | ||
| ) | ||
| if retry == 3: | ||
| raise e | ||
|
|
There was a problem hiding this comment.
The _access_metaserver method does not handle HTTP errors correctly. The httpx.Client.post method does not raise an exception for HTTP error status codes (e.g., 4xx, 5xx). The current implementation will treat such responses as successful, which can lead to silent failures and incorrect behavior. The retry logic also lacks a backoff delay, which can put unnecessary load on the metaserver during failures. The function also fails silently if all retries fail without an exception being caught on the last attempt.
def _access_metaserver(self, url, message):
last_exception = None
for attempt in range(1, 4):
try:
response = self.metaserver_client.post(url, json=message)
response.raise_for_status() # Raise an exception for 4xx/5xx status codes
return
except Exception as e:
last_exception = e
logger.error(
f"Failed to connect to metaserver: {url}, attempt {attempt}/3. Error: {e}"
)
if attempt < 3:
time.sleep(1) # Add a 1-second delay before retrying
if last_exception:
raise last_exception| self.executor = ThreadPoolExecutor(32) | ||
| self.metaserver_client = httpx.Client( | ||
| limits=httpx.Limits(max_connections=100000), | ||
| timeout=None) |
There was a problem hiding this comment.
The ThreadPoolExecutor and httpx.Client are initialized here but are never shut down or closed. This will lead to resource leaks (threads, file descriptors). It's crucial to add a shutdown method to the MooncakeLayerwiseConnectorScheduler class to clean up these resources.
Additionally, setting timeout=None for the httpx.Client is risky in a production system, as a request to the metaserver could hang indefinitely, blocking a thread from the pool. It's strongly recommended to set a reasonable timeout.
It's recommended to add a shutdown method to the MooncakeLayerwiseConnectorScheduler class to properly close the ThreadPoolExecutor and httpx.Client to prevent resource leaks.
def shutdown(self):
self.executor.shutdown(wait=True)
self.metaserver_client.close()This method should be called when the scheduler is no longer needed.
| self.executor = ThreadPoolExecutor(32) | |
| self.metaserver_client = httpx.Client( | |
| limits=httpx.Limits(max_connections=100000), | |
| timeout=None) | |
| self.executor = ThreadPoolExecutor(32) | |
| self.metaserver_client = httpx.Client( | |
| limits=httpx.Limits(max_connections=100000), | |
| timeout=60.0) | |
| if self.current_layer != layer_index: | ||
| self.current_layer = layer_index | ||
| self.model_stream.synchronize() | ||
| reshape_cache_event.synchronize() |
There was a problem hiding this comment.
Have you encountered any hang issues during testing? #4976
There was a problem hiding this comment.
In our self-validation, we used the latest 8.5 version of the CANN package, and did not encounter any hanging issues with event synchronize. And we will add the relevant explanations.
| prefill_slots = attn_metadata.slot_mapping[ | ||
| num_decode_tokens:num_actual_tokens] | ||
| prefill_q_pe = self.rope_single(prefill_q_pe, cos, sin) | ||
| attn_metadata.prefill.reshape_cache_event = torch.npu.Event() |
There was a problem hiding this comment.
It needs to be constrained such that the event is recorded only for P nodes and not merely for prefill.
| remote_host=self.side_channel_host, | ||
| remote_port=self.side_channel_port, | ||
| ) | ||
| future = self.executor.submit( |
There was a problem hiding this comment.
Move the request for P nodes forward to the schedule stage, and delete the corresponding logic in the worker.
a9e81e5 to
f9e5a71
Compare
| num_decode_tokens:num_actual_tokens] | ||
| prefill_q_pe = self.rope_single(prefill_q_pe, cos, sin) | ||
| if self.is_kv_producer: | ||
| attn_metadata.prefill.reshape_cache_event = torch.npu.Event() |
There was a problem hiding this comment.
Does the long sequence CP mode need no this extra processing?
f9e5a71 to
913747b
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
1 similar comment
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
4f7088d to
22c306b
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
22c306b to
219ca59
Compare
b692e63 to
cf72d91
Compare
cf72d91 to
dc1e705
Compare
437dec8 to
0430743
Compare
0430743 to
9e85e68
Compare
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
7442647 to
279f2f2
Compare
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
279f2f2 to
5e2d8cc
Compare
### What this PR does / why we need it? Improve the performance of Layerwise Connector, mainly includes the following points: 1. Use event synchronize to replace stream synchronize. 2. Access metaserver when scheduling. 3. Transfer kvcache each Chunk prefill segmentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@5fbfa8d --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it? Improve the performance of Layerwise Connector, mainly includes the following points: 1. Use event synchronize to replace stream synchronize. 2. Access metaserver when scheduling. 3. Transfer kvcache each Chunk prefill segmentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@5fbfa8d --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it? Improve the performance of Layerwise Connector, mainly includes the following points: 1. Use event synchronize to replace stream synchronize. 2. Access metaserver when scheduling. 3. Transfer kvcache each Chunk prefill segmentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@5fbfa8d --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? Improve the performance of Layerwise Connector, mainly includes the following points: 1. Use event synchronize to replace stream synchronize. 2. Access metaserver when scheduling. 3. Transfer kvcache each Chunk prefill segmentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@5fbfa8d --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it? Improve the performance of Layerwise Connector, mainly includes the following points: 1. Use event synchronize to replace stream synchronize. 2. Access metaserver when scheduling. 3. Transfer kvcache each Chunk prefill segmentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@5fbfa8d --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
What this PR does / why we need it?
Improve the performance of Layerwise Connector, mainly includes the following points:
This PR is related to [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support #4842
Does this PR introduce any user-facing change?
No.
How was this patch tested?
By CI.