Skip to content

[0.13.0][cherry-pick][bugfix]Synchronize memcache adaptation on A2#5842

Merged
wangxiyuan merged 1 commit intovllm-project:releases/v0.13.0from
DreamerLeader:v0.13.0
Jan 14, 2026
Merged

[0.13.0][cherry-pick][bugfix]Synchronize memcache adaptation on A2#5842
wangxiyuan merged 1 commit intovllm-project:releases/v0.13.0from
DreamerLeader:v0.13.0

Conversation

@DreamerLeader
Copy link
Copy Markdown
Contributor

@DreamerLeader DreamerLeader commented Jan 13, 2026

What this PR does / why we need it?

When running memcache in the A2 environment, the logic for registering memory needs to be added. Additionally, there is a link establishment conflict between memcache and HCCS during initialization in A2, so the link should be established in advance.

pick-from: #5601

Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
@DreamerLeader DreamerLeader changed the title Synchronize memcache adaptation on A2 to the 0.13.0 branch [bugfix]Synchronize memcache adaptation on A2 to the 0.13.0 branch Jan 13, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request synchronizes memcache adaptations for A2 hardware. The changes introduce device-specific logic for A2 in memcache_backend.py, improve the robustness of token allocation calculation in pool_scheduler.py, and fix a path parsing issue in pool_worker.py. My review includes two main points: first, I've identified significant code duplication in memcache_backend.py and suggested a refactoring to improve maintainability. Second, I've proposed a more concise and idiomatic way to calculate need_to_allocate in pool_scheduler.py. Both are high-severity suggestions aimed at improving code quality and correctness.

Comment on lines +30 to +51
soc_version = get_ascend_device_type()
if soc_version in {AscendDeviceType.A2}:
import torch
from vllm.distributed import get_world_group
tmp_tensor = torch.zeros(1, device="npu")
output_tensor_list = [
torch.empty_like(tmp_tensor)
for _ in range(torch.distributed.get_world_size())
]
torch.distributed.all_gather(
output_tensor_list,
tmp_tensor,
group=get_world_group().device_group)
self.rank = parallel_config.rank
self.store = DistributedObjectStore()
res = self.store.init(self.rank)
assert res == 0
else:
self.rank = parallel_config.rank
self.store = DistributedObjectStore()
res = self.store.init(self.rank)
assert res == 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is significant code duplication between the if and else blocks. The initialization of self.rank and self.store is identical in both branches. This can be refactored by moving the common code outside the conditional block to improve maintainability and reduce redundancy.

            soc_version = get_ascend_device_type()
            if soc_version in {AscendDeviceType.A2}:
                import torch
                from vllm.distributed import get_world_group
                tmp_tensor = torch.zeros(1, device="npu")
                output_tensor_list = [
                    torch.empty_like(tmp_tensor)
                    for _ in range(torch.distributed.get_world_size())
                ]
                torch.distributed.all_gather(
                    output_tensor_list,
                    tmp_tensor,
                    group=get_world_group().device_group)
            self.rank = parallel_config.rank
            self.store = DistributedObjectStore()
            res = self.store.init(self.rank)
            assert res == 0

Comment on lines +85 to +88
if num_external_hit_tokens < num_computed_tokens:
need_to_allocate = 0
else:
need_to_allocate = num_external_hit_tokens - num_computed_tokens
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic to ensure need_to_allocate is not negative can be expressed more concisely using max(0, ...). This improves readability and is a common Python idiom for this pattern.

        need_to_allocate = max(0, num_external_hit_tokens - num_computed_tokens)

@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 13, 2026
@wangxiyuan wangxiyuan changed the title [bugfix]Synchronize memcache adaptation on A2 to the 0.13.0 branch [0.13.0][cherry-pick][bugfix]Synchronize memcache adaptation on A2 Jan 14, 2026
@wangxiyuan wangxiyuan merged commit 1d4aaab into vllm-project:releases/v0.13.0 Jan 14, 2026
17 checks passed
@DreamerLeader DreamerLeader deleted the v0.13.0 branch March 14, 2026 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants