[bugfix] A2 Environment Pooling for Memcache Compatibility#5601
[bugfix] A2 Environment Pooling for Memcache Compatibility#5601wangxiyuan merged 9 commits intovllm-project:mainfrom
Conversation
Signed-off-by: fangjianwei <f30058701@china.huawei.com>
Signed-off-by: fangjianwei <f30058701@china.huawei.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a bugfix for Memcache compatibility on A2 environments by adding device-specific logic. My review identifies two key areas for improvement: refactoring duplicated code to enhance maintainability and strengthening error handling to prevent masking potential bugs. These changes will make the code cleaner and more robust.
| def register_buffer(self, ptrs: list[int], sizes: list[int]): | ||
| pass | ||
| try: | ||
| for ptr, size in zip(ptrs, sizes): | ||
| self.store.register_buffer(ptr, size) | ||
| except Exception as exc: | ||
| logger.info( | ||
| "A3 environment does not require cache registration.: %s", exc) |
There was a problem hiding this comment.
The try...except Exception block is too broad and can mask unexpected errors. The log message suggests this is intended for a specific case on the "A3 environment", but it will catch any exception on any environment and log it as an info message. This could hide critical bugs during buffer registration on non-A3 environments, leading to a broken state without a clear failure signal.
To make the error handling more robust, you should check the device type and only suppress the exception if it's running on an A3 device. For all other devices, any exception should be logged as an error and re-raised.
Additionally, the parameter sizes in this method's signature overrides lengths from the base Backend class. It's good practice to maintain consistency with the base class signature by using lengths.
def register_buffer(self, ptrs: list[int], lengths: list[int]):
soc_version = get_ascend_device_type()
try:
for ptr, length in zip(ptrs, lengths):
self.store.register_buffer(ptr, length)
except Exception as exc:
if soc_version in {AscendDeviceType.A3}:
logger.info(
"A3 environment does not require cache registration; "
"suppressing exception: %s", exc)
else:
logger.error("An unexpected error occurred during buffer registration: %s", exc)
raise| soc_version = get_ascend_device_type() | ||
| if soc_version in {AscendDeviceType.A2}: | ||
| import torch | ||
| from vllm.distributed import get_world_group | ||
| tmp_tensor = torch.zeros(1, device="npu") | ||
| output_tensor_list = [ | ||
| torch.empty_like(tmp_tensor) | ||
| for _ in range(torch.distributed.get_world_size()) | ||
| ] | ||
| torch.distributed.all_gather( | ||
| output_tensor_list, | ||
| tmp_tensor, | ||
| group=get_world_group().device_group) | ||
| self.rank = parallel_config.rank | ||
| self.store = DistributedObjectStore() | ||
| res = self.store.init(self.rank) | ||
| assert res == 0 | ||
| else: | ||
| self.rank = parallel_config.rank | ||
| self.store = DistributedObjectStore() | ||
| res = self.store.init(self.rank) | ||
| assert res == 0 |
There was a problem hiding this comment.
There is significant code duplication between the if and else blocks. The lines for initializing self.rank and self.store are identical in both branches. This can be refactored to reduce redundancy and improve maintainability, which is crucial for preventing bugs where a fix is applied to one copy but not the other.
Additionally, torch is already imported at the top of the file, so the local import torch inside the if block is redundant and should be removed.
soc_version = get_ascend_device_type()
if soc_version in {AscendDeviceType.A2}:
from vllm.distributed import get_world_group
tmp_tensor = torch.zeros(1, device="npu")
output_tensor_list = [
torch.empty_like(tmp_tensor)
for _ in range(torch.distributed.get_world_size())
]
torch.distributed.all_gather(
output_tensor_list,
tmp_tensor,
group=get_world_group().device_group)
self.rank = parallel_config.rank
self.store = DistributedObjectStore()
res = self.store.init(self.rank)
assert res == 0|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: fangjianwei <f30058701@china.huawei.com>
Signed-off-by: fangjianwei <f30058701@china.huawei.com>
Signed-off-by: fangjianwei <f30058701@china.huawei.com>
…to eplb_refactor * 'main' of https://github.com/vllm-project/vllm-ascend: [CI] Unblock 4-cards test (vllm-project#5831) [Refactor] Provide a framework to accommodate operators for different hardware devices (vllm-project#5735) [Refactor] Modify the binding logic to allocate CPU cores for each NPU card (vllm-project#5555) [BugFix] Support setting tp=1 for the Eagle draft model to take effect (vllm-project#5519) support triton of mrope (vllm-project#5664) [bugfix] A2 Environment Pooling for Memcache Compatibility (vllm-project#5601) [Doc] Update community contributors and versioning naming to follow vLLM (vllm-project#5820) [Refactor] Add comments for Metadata classes in attention module (vllm-project#5789) [Bugfix] bugfix for the order of dummy run pad and sync (vllm-project#5777) [CI] Move nightly-a2 test to hk (vllm-project#5807) [CI] Show disk usage for CI shared volume (vllm-project#5821) Bump actions/checkout from 4 to 6 (vllm-project#5795) Bump actions/github-script from 7 to 8 (vllm-project#5796) [bugfix](cp) align max_context_chunk to cp_virtual_block_size (vllm-project#5767) [bugfix]limit graph replay sync (vllm-project#5761) [CI]Add Kimi k2 nightly test (vllm-project#5682) [Doc] add tls check to pd disaggregation readme (vllm-project#5638) [CI] adpat v0.13.0 change (vllm-project#5793)
…ect#5601) ### What this PR does / why we need it? When running memcache in the A2 environment, the logic for registering memory needs to be added. Additionally, there is a link establishment conflict between memcache and HCCS during initialization in A2, so the link should be established in advance. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 --------- Signed-off-by: fangjianwei <f30058701@china.huawei.com> Co-authored-by: fangjianwei <f30058701@china.huawei.com>
…ect#5601) ### What this PR does / why we need it? When running memcache in the A2 environment, the logic for registering memory needs to be added. Additionally, there is a link establishment conflict between memcache and HCCS during initialization in A2, so the link should be established in advance. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 --------- Signed-off-by: fangjianwei <f30058701@china.huawei.com> Co-authored-by: fangjianwei <f30058701@china.huawei.com>
…5842) ### What this PR does / why we need it? When running memcache in the A2 environment, the logic for registering memory needs to be added. Additionally, there is a link establishment conflict between memcache and HCCS during initialization in A2, so the link should be established in advance. pick-from: #5601 Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
…ect#5601) ### What this PR does / why we need it? When running memcache in the A2 environment, the logic for registering memory needs to be added. Additionally, there is a link establishment conflict between memcache and HCCS during initialization in A2, so the link should be established in advance. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 --------- Signed-off-by: fangjianwei <f30058701@china.huawei.com> Co-authored-by: fangjianwei <f30058701@china.huawei.com>
…ect#5601) ### What this PR does / why we need it? When running memcache in the A2 environment, the logic for registering memory needs to be added. Additionally, there is a link establishment conflict between memcache and HCCS during initialization in A2, so the link should be established in advance. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 --------- Signed-off-by: fangjianwei <f30058701@china.huawei.com> Co-authored-by: fangjianwei <f30058701@china.huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…ect#5601) ### What this PR does / why we need it? When running memcache in the A2 environment, the logic for registering memory needs to be added. Additionally, there is a link establishment conflict between memcache and HCCS during initialization in A2, so the link should be established in advance. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 --------- Signed-off-by: fangjianwei <f30058701@china.huawei.com> Co-authored-by: fangjianwei <f30058701@china.huawei.com>
…ect#5601) ### What this PR does / why we need it? When running memcache in the A2 environment, the logic for registering memory needs to be added. Additionally, there is a link establishment conflict between memcache and HCCS during initialization in A2, so the link should be established in advance. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 --------- Signed-off-by: fangjianwei <f30058701@china.huawei.com> Co-authored-by: fangjianwei <f30058701@china.huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…ect#5601) ### What this PR does / why we need it? When running memcache in the A2 environment, the logic for registering memory needs to be added. Additionally, there is a link establishment conflict between memcache and HCCS during initialization in A2, so the link should be established in advance. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 --------- Signed-off-by: fangjianwei <f30058701@china.huawei.com> Co-authored-by: fangjianwei <f30058701@china.huawei.com>
What this PR does / why we need it?
When running memcache in the A2 environment, the logic for registering memory needs to be added. Additionally, there is a link establishment conflict between memcache and HCCS during initialization in A2, so the link should be established in advance.
Does this PR introduce any user-facing change?
How was this patch tested?