Skip to content

Add monkey patch#24

Closed
wangxiyuan wants to merge 1 commit intovllm-project:mainfrom
wangxiyuan:add_patch
Closed

Add monkey patch#24
wangxiyuan wants to merge 1 commit intovllm-project:mainfrom
wangxiyuan:add_patch

Conversation

@wangxiyuan
Copy link
Collaborator

Some PR for plugin support is not merged by vllm yet. This PR add monkey patch to vllm-ascend to make vllm-ascend work with vllm directly.

This patch code should be removed once the related function is supported by vllm originally.

@wangxiyuan wangxiyuan changed the title Add monckey patch Add monkey patch Feb 10, 2025
from vllm.platforms import current_platform
device_comm_cls = resolve_obj_by_qualname(
current_platform.get_device_communicator_cls())
self.communicator = device_comm_cls(group=self.device_group,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check if use_xxx_communicator (any is fine because they remain the same) and world_size > 1 is true before creating communicator.
https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L167-L169

Besides model parallel group, there will be a world group, which won't use any device communication. Adding this check will reduce time when creating the world group.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added world_size check in the new Patch. There is no use_xxx_communicator in vllm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean use_tpu_communicator, use_xpu_communicator or use_hpu_communicator, any one of them is ok

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are checked in supper.init, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, the check of use_tpu_communicator in supper.init only work for tpu_communicator, we use it here for npu communicator, because there is no bool value for npu to control this check.
I think we could just use use_tpu_communicator as all the use_xxx_communicator remains the same in vLLM

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your idea. Thanks. i'll update then

from vllm.platforms import current_platform
device_comm_cls = resolve_obj_by_qualname(
current_platform.get_device_communicator_cls())
self.communicator = device_comm_cls(group=self.device_group,
Copy link
Member

@Yikun Yikun Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this still depends on the vllm-project/vllm CommunicatorBase? Seems CommunicatorBase should also move to vllm-ascend?

https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/communicator.py#L21

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed CommunicatorBase in the new patchset

# Remove this file when vllm support by
# https://github.com/vllm-project/vllm/pull/11324.

from vllm.distributed.parallel_state import GroupCoordinator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated but just curious: should vllm be a dependency of vllm-ascend as oneline in requriement and pyproject?

Copy link
Collaborator Author

@wangxiyuan wangxiyuan Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm. Let's have a try. we can add it.

While IMO, it maybe raises error because there is no CPU version of pytorch on pypi.

Once it's added, the install step in the future from my sight is:

  1. install cpu version of Pytorch by hand. (torch==2.5.1+cpu)
  2. pip install vllm-ascend

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no warries, we can do it in followup

@wangxiyuan wangxiyuan force-pushed the add_patch branch 2 times, most recently from 4da98ee to 57f3aca Compare February 10, 2025 07:50
dist.all_reduce(x, group=self.group)
return x

def gather(self, input_: torch.Tensor, dst: int = 0, dim: int = -1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have any UT to check the functionality?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

communicator test need more than one NPU card which is not supported by current CI. We're working on multi card support for CI system.

In this comment, we need test this PR by hand locally and be careful to merge it.

output_tensor = None
return output_tensor

def all_gather(self, input_: torch.Tensor, dim: int = -1) -> torch.Tensor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@wangxiyuan
Copy link
Collaborator Author

Do not merge until it's fully tested locally. Thanks.

@Yikun
Copy link
Member

Yikun commented Feb 10, 2025

vllm-ascend/mypy.ini

Lines 12 to 14 in 7006835

; Remove this after https://github.com/vllm-project/vllm/pull/11324 merged
[mypy-vllm.distributed.device_communicators.base_communicator]
ignore_missing_imports = True

This should also be removed

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Copy link
Member

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if it passed in multi-card env

@wangxiyuan wangxiyuan closed this Feb 11, 2025
@wangxiyuan wangxiyuan deleted the add_patch branch February 11, 2025 02:46
@wangxiyuan wangxiyuan restored the add_patch branch February 11, 2025 02:53
@wangxiyuan
Copy link
Collaborator Author

See #30

wangxiyuan pushed a commit that referenced this pull request Feb 12, 2025
### What this PR does / why we need it?
- Remove on communicator mypy to address:
#24 (comment)
- Add mypy.ini to trigger list

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
ttanzhiqiang pushed a commit to ttanzhiqiang/vllm-ascend that referenced this pull request Apr 27, 2025
…m-project#45)

### What this PR does / why we need it?
- Remove on communicator mypy to address:
vllm-project#24 (comment)
- Add mypy.ini to trigger list

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
zhangsicheng5 pushed a commit to zhangsicheng5/vllm-ascend that referenced this pull request Sep 18, 2025
wuhang2014 pushed a commit to wuhang2014/vllm-ascend that referenced this pull request Nov 27, 2025
mooncake store connector support ipv6

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Bounty-hunter pushed a commit to Bounty-hunter/vllm-ascend that referenced this pull request Dec 9, 2025
* [Feature][EPD] Supports EPD (vllm-project#13)

* epd shm



* epd shm



* epd shm



---------

Signed-off-by: wuhang <wuhang6@huawei.com>
Co-authored-by: amy-why-3459 <wuhaiyan17@huawei.com>
Co-authored-by: wuhang <wuhang6@huawei.com>

* [Bugfix] change the input param for func maybe_save_ec_to_connector

[Bugfix] change the input param for func maybe_save_ec_to_connector

* [Bugfix] change the wrong input param name for func maybe_save_ec_to_connector (vllm-project#15)

* [Feature]Mooncake store ECConnector: wait for ec save (vllm-project#16)

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* support kv mooncake store connector (vllm-project#22)

* wait for ec save

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* KV mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

---------

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* mooncake store connector support ipv6

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* adapt EC connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* support kv mooncake store connector (vllm-project#22)

* wait for ec save

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* KV mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

---------

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* kv mooncake store connector support Ipv6 (vllm-project#24)

mooncake store connector support ipv6

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* Implement yuanrong-datasystem connector.

* adapt EC connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* adapt to ascend v0.11.0rc2

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* fix precommit

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

fix precommit

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

fix precommit

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

fix precommit

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

fix precommit

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* fix precommit

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* adapt to v0.11.0rc2

* fix precommit

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

---------

Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Co-authored-by: John Liu BUAA <liukecheng97@gmail.com>
Co-authored-by: amy-why-3459 <wuhaiyan17@huawei.com>
Co-authored-by: wuhang <wuhang6@huawei.com>
Co-authored-by: Zeng Chuang <zengchuang3@huawei.com>
Co-authored-by: yangsonglin <yangsonglin0821@163.com>
Bounty-hunter pushed a commit to Bounty-hunter/vllm-ascend that referenced this pull request Dec 10, 2025
* [Feature][EPD] Supports EPD (vllm-project#13)

* epd shm

* epd shm

* epd shm

---------

Signed-off-by: wuhang <wuhang6@huawei.com>
Co-authored-by: amy-why-3459 <wuhaiyan17@huawei.com>
Co-authored-by: wuhang <wuhang6@huawei.com>

* [Bugfix] change the input param for func maybe_save_ec_to_connector

[Bugfix] change the input param for func maybe_save_ec_to_connector

* [Bugfix] change the wrong input param name for func maybe_save_ec_to_connector (vllm-project#15)

* [Feature]Mooncake store ECConnector: wait for ec save (vllm-project#16)

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* support kv mooncake store connector (vllm-project#22)

* wait for ec save

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* KV mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

---------

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* mooncake store connector support ipv6

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* adapt EC connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* support kv mooncake store connector (vllm-project#22)

* wait for ec save

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* KV mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

---------

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* kv mooncake store connector support Ipv6 (vllm-project#24)

mooncake store connector support ipv6

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* Implement yuanrong-datasystem connector.

* adapt EC connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* adapt to ascend v0.11.0rc2

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* fix precommit

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

fix precommit

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

fix precommit

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

fix precommit

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

fix precommit

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* fix precommit

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* adapt to v0.11.0rc2

* fix precommit

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

---------

Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Co-authored-by: John Liu BUAA <liukecheng97@gmail.com>
Co-authored-by: amy-why-3459 <wuhaiyan17@huawei.com>
Co-authored-by: wuhang <wuhang6@huawei.com>
Co-authored-by: Zeng Chuang <zengchuang3@huawei.com>
Co-authored-by: yangsonglin <yangsonglin0821@163.com>
NickJudyHvv pushed a commit to NickJudyHvv/vllm-ascend that referenced this pull request Mar 2, 2026
* 适配chunked prefill

* 适配mla BNSD

* 适配mla 2048 mask
NickJudyHvv pushed a commit to NickJudyHvv/vllm-ascend that referenced this pull request Mar 2, 2026
LICO1314 added a commit to LICO1314/vllm-ascend that referenced this pull request Mar 24, 2026
…atch sunset plan

- Move _c8_kv_scale_weight_loader and AscendC8KVCacheAttentionMethod from
  w8a8_dynamic.py into kv_c8.py so all KV-cache C8 quantization methods
  (both MLA/FAKQuant and dense-attention/QuaRot) live in one place
- Update import paths in modelslim_config.py, attention_v1.py, and tests
- Clean up patch_qwen3_c8.py: remove redundant module docstring to match
  the style of other patch files (license header only)
- Add entry vllm-project#24 for patch_qwen3_c8.py in patch/__init__.py with Why/How/
  Related PR/Future Plan following the established sunset-plan format

Signed-off-by: lico67373 <918688502@qq.com>
Made-with: Cursor
LICO1314 added a commit to LICO1314/vllm-ascend that referenced this pull request Mar 24, 2026
Adds C8 (INT8) KV cache quantization support for standard GQA attention
models (e.g., Qwen3-32B W8A8C8). C8 uses static per-channel scales to
store KV cache in INT8, reducing KV cache memory by ~50% compared to
BF16, enabling higher batch concurrency and longer context on the same
hardware.

Key changes:
- attention_v1.py: new AscendC8AttentionBackendImpl subclass with
  _prepare_c8_scales, _quantize_kv_to_int8, _forward_c8_decode,
  _forward_c8_chunked_prefill, and _forward_c8_fused_infer_attention.
- kv_c8.py: AscendC8KVCacheAttentionMethod creates k/v_cache_scale/
  offset parameters and upgrades the attention impl via class surgery;
  _c8_kv_scale_weight_loader handles per-channel scale shapes.
- modelslim_config.py: activates C8 branch when kv_cache_type == "C8"
  in quant_model_description.json.
- patch_qwen3_c8.py: intercepts C8 scale/offset weights before
  AutoWeightsLoader discards them.
- patch/__init__.py: documents patch vllm-project#24 with a sunset plan.
- tests/ut/quantization/test_kv_c8.py: unit tests for all C8 helpers.

Signed-off-by: lico67373 <918688502@qq.com>
Made-with: Cursor
LICO1314 added a commit to LICO1314/vllm-ascend that referenced this pull request Mar 24, 2026
Adds C8 (INT8) KV cache quantization support for standard GQA attention
models (e.g., Qwen3-32B W8A8C8). C8 uses static per-channel scales to
store KV cache in INT8, reducing KV cache memory by ~50% compared to
BF16, enabling higher batch concurrency and longer context on the same
hardware.

Key changes:
- attention_v1.py: new AscendC8AttentionBackendImpl subclass with
  _prepare_c8_scales, _quantize_kv_to_int8, _forward_c8_decode,
  _forward_c8_chunked_prefill, and _forward_c8_fused_infer_attention.
- kv_c8.py: AscendC8KVCacheAttentionMethod creates k/v_cache_scale/
  offset parameters and upgrades the attention impl via class surgery;
  _c8_kv_scale_weight_loader handles per-channel scale shapes.
- modelslim_config.py: activates C8 branch when kv_cache_type == "C8"
  in quant_model_description.json.
- patch_qwen3_c8.py: intercepts C8 scale/offset weights before
  AutoWeightsLoader discards them.
- patch/__init__.py: documents patch vllm-project#24 with a sunset plan.
- tests/ut/quantization/test_kv_c8.py: unit tests for all C8 helpers.
- tests/ut/quantization/test_modelslim_config.py: C8 branch coverage.

Signed-off-by: lico67373 <918688502@qq.com>
Made-with: Cursor
LICO1314 added a commit to LICO1314/vllm-ascend that referenced this pull request Mar 24, 2026
Adds C8 (INT8) KV cache quantization support for standard GQA attention
models (e.g., Qwen3-32B W8A8C8). C8 uses static per-channel scales to
store KV cache in INT8, reducing KV cache memory by ~50% compared to
BF16, enabling higher batch concurrency and longer context on the same
hardware.

Key changes:
- attention_v1.py: new AscendC8AttentionBackendImpl subclass with
  _prepare_c8_scales, _quantize_kv_to_int8, _forward_c8_decode,
  _forward_c8_chunked_prefill, and _forward_c8_fused_infer_attention.
- kv_c8.py: AscendC8KVCacheAttentionMethod creates k/v_cache_scale/
  offset parameters and upgrades the attention impl via class surgery;
  _c8_kv_scale_weight_loader handles per-channel scale shapes.
- modelslim_config.py: activates C8 branch when kv_cache_type == "C8"
  in quant_model_description.json.
- patch_qwen3_c8.py: intercepts C8 scale/offset weights before
  AutoWeightsLoader discards them.
- patch/__init__.py: documents patch vllm-project#24 with a sunset plan.
- tests/ut/quantization/test_kv_c8.py: unit tests for all C8 helpers.
- tests/ut/quantization/test_modelslim_config.py: C8 branch coverage.

Signed-off-by: lico67373 <918688502@qq.com>
Made-with: Cursor
LICO1314 added a commit to LICO1314/vllm-ascend that referenced this pull request Mar 24, 2026
Adds C8 (INT8) KV cache quantization support for standard GQA attention
models (e.g., Qwen3-32B W8A8C8). C8 uses static per-channel scales to
store KV cache in INT8, reducing KV cache memory by ~50% compared to
BF16, enabling higher batch concurrency and longer context on the same
hardware.

Key changes:
- attention_v1.py: new AscendC8AttentionBackendImpl subclass with
  _prepare_c8_scales, _quantize_kv_to_int8, _forward_c8_decode,
  _forward_c8_chunked_prefill, and _forward_c8_fused_infer_attention.
- kv_c8.py: AscendC8KVCacheAttentionMethod creates k/v_cache_scale/
  offset parameters and upgrades the attention impl via class surgery;
  _c8_kv_scale_weight_loader handles per-channel scale shapes.
- modelslim_config.py: activates C8 branch when kv_cache_type == "C8"
  in quant_model_description.json.
- patch_qwen3_c8.py: intercepts C8 scale/offset weights before
  AutoWeightsLoader discards them.
- patch/__init__.py: documents patch vllm-project#24 with a sunset plan.
- tests/ut/quantization/test_kv_c8.py: unit tests for all C8 helpers.
- tests/ut/quantization/test_modelslim_config.py: C8 branch coverage.

Signed-off-by: lico67373 <918688502@qq.com>
Made-with: Cursor
LICO1314 added a commit to LICO1314/vllm-ascend that referenced this pull request Mar 25, 2026
Adds C8 (INT8) KV cache quantization support for standard GQA attention
models (e.g., Qwen3-32B W8A8C8). C8 uses static per-channel scales to
store KV cache in INT8, reducing KV cache memory by ~50% compared to
BF16, enabling higher batch concurrency and longer context on the same
hardware.

Key changes:
- attention_v1.py: new AscendC8AttentionBackendImpl subclass with
  _prepare_c8_scales, _quantize_kv_to_int8, _forward_c8_decode,
  _forward_c8_chunked_prefill, and _forward_c8_fused_infer_attention.
- kv_c8.py: AscendC8KVCacheAttentionMethod creates k/v_cache_scale/
  offset parameters and upgrades the attention impl via class surgery;
  _c8_kv_scale_weight_loader handles per-channel scale shapes.
- modelslim_config.py: activates C8 branch when kv_cache_type == "C8"
  in quant_model_description.json.
- patch_qwen3_c8.py: intercepts C8 scale/offset weights before
  AutoWeightsLoader discards them.
- patch/__init__.py: documents patch vllm-project#24 with a sunset plan.
- tests/ut/quantization/test_kv_c8.py: unit tests for all C8 helpers.
- tests/ut/quantization/test_modelslim_config.py: C8 branch coverage.

Signed-off-by: lico67373 <918688502@qq.com>
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants