Conversation
02dce4b to
6fe137e
Compare
6fe137e to
30b9edf
Compare
| from vllm.platforms import current_platform | ||
| device_comm_cls = resolve_obj_by_qualname( | ||
| current_platform.get_device_communicator_cls()) | ||
| self.communicator = device_comm_cls(group=self.device_group, |
There was a problem hiding this comment.
I think we should check if use_xxx_communicator (any is fine because they remain the same) and world_size > 1 is true before creating communicator.
https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L167-L169
Besides model parallel group, there will be a world group, which won't use any device communication. Adding this check will reduce time when creating the world group.
There was a problem hiding this comment.
I added world_size check in the new Patch. There is no use_xxx_communicator in vllm.
There was a problem hiding this comment.
I mean use_tpu_communicator, use_xpu_communicator or use_hpu_communicator, any one of them is ok
There was a problem hiding this comment.
They are checked in supper.init, right?
There was a problem hiding this comment.
For example, the check of use_tpu_communicator in supper.init only work for tpu_communicator, we use it here for npu communicator, because there is no bool value for npu to control this check.
I think we could just use use_tpu_communicator as all the use_xxx_communicator remains the same in vLLM
There was a problem hiding this comment.
I got your idea. Thanks. i'll update then
| from vllm.platforms import current_platform | ||
| device_comm_cls = resolve_obj_by_qualname( | ||
| current_platform.get_device_communicator_cls()) | ||
| self.communicator = device_comm_cls(group=self.device_group, |
There was a problem hiding this comment.
does this still depends on the vllm-project/vllm CommunicatorBase? Seems CommunicatorBase should also move to vllm-ascend?
There was a problem hiding this comment.
Removed CommunicatorBase in the new patchset
| # Remove this file when vllm support by | ||
| # https://github.com/vllm-project/vllm/pull/11324. | ||
|
|
||
| from vllm.distributed.parallel_state import GroupCoordinator |
There was a problem hiding this comment.
unrelated but just curious: should vllm be a dependency of vllm-ascend as oneline in requriement and pyproject?
There was a problem hiding this comment.
emm. Let's have a try. we can add it.
While IMO, it maybe raises error because there is no CPU version of pytorch on pypi.
Once it's added, the install step in the future from my sight is:
- install cpu version of Pytorch by hand. (torch==2.5.1+cpu)
- pip install vllm-ascend
There was a problem hiding this comment.
no warries, we can do it in followup
4da98ee to
57f3aca
Compare
| dist.all_reduce(x, group=self.group) | ||
| return x | ||
|
|
||
| def gather(self, input_: torch.Tensor, dst: int = 0, dim: int = -1): |
There was a problem hiding this comment.
do we have any UT to check the functionality?
There was a problem hiding this comment.
communicator test need more than one NPU card which is not supported by current CI. We're working on multi card support for CI system.
In this comment, we need test this PR by hand locally and be careful to merge it.
| output_tensor = None | ||
| return output_tensor | ||
|
|
||
| def all_gather(self, input_: torch.Tensor, dim: int = -1) -> torch.Tensor: |
|
Do not merge until it's fully tested locally. Thanks. |
|
Lines 12 to 14 in 7006835 This should also be removed |
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
57f3aca to
07f2a16
Compare
Yikun
left a comment
There was a problem hiding this comment.
LGTM if it passed in multi-card env
|
See #30 |
### What this PR does / why we need it? - Remove on communicator mypy to address: #24 (comment) - Add mypy.ini to trigger list ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
…m-project#45) ### What this PR does / why we need it? - Remove on communicator mypy to address: vllm-project#24 (comment) - Add mypy.ini to trigger list ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
…q_pr_new fix pre-commit + fix comment
mooncake store connector support ipv6 Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
* [Feature][EPD] Supports EPD (vllm-project#13) * epd shm * epd shm * epd shm --------- Signed-off-by: wuhang <wuhang6@huawei.com> Co-authored-by: amy-why-3459 <wuhaiyan17@huawei.com> Co-authored-by: wuhang <wuhang6@huawei.com> * [Bugfix] change the input param for func maybe_save_ec_to_connector [Bugfix] change the input param for func maybe_save_ec_to_connector * [Bugfix] change the wrong input param name for func maybe_save_ec_to_connector (vllm-project#15) * [Feature]Mooncake store ECConnector: wait for ec save (vllm-project#16) Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * support kv mooncake store connector (vllm-project#22) * wait for ec save Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * KV mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> --------- Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * mooncake store connector support ipv6 Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * adapt EC connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * support kv mooncake store connector (vllm-project#22) * wait for ec save Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * KV mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> --------- Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * kv mooncake store connector support Ipv6 (vllm-project#24) mooncake store connector support ipv6 Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * Implement yuanrong-datasystem connector. * adapt EC connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * adapt to ascend v0.11.0rc2 Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * fix precommit Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> fix precommit Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> fix precommit Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> fix precommit Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> fix precommit Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * fix precommit Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * adapt to v0.11.0rc2 * fix precommit Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> --------- Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> Co-authored-by: John Liu BUAA <liukecheng97@gmail.com> Co-authored-by: amy-why-3459 <wuhaiyan17@huawei.com> Co-authored-by: wuhang <wuhang6@huawei.com> Co-authored-by: Zeng Chuang <zengchuang3@huawei.com> Co-authored-by: yangsonglin <yangsonglin0821@163.com>
* [Feature][EPD] Supports EPD (vllm-project#13) * epd shm * epd shm * epd shm --------- Signed-off-by: wuhang <wuhang6@huawei.com> Co-authored-by: amy-why-3459 <wuhaiyan17@huawei.com> Co-authored-by: wuhang <wuhang6@huawei.com> * [Bugfix] change the input param for func maybe_save_ec_to_connector [Bugfix] change the input param for func maybe_save_ec_to_connector * [Bugfix] change the wrong input param name for func maybe_save_ec_to_connector (vllm-project#15) * [Feature]Mooncake store ECConnector: wait for ec save (vllm-project#16) Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * support kv mooncake store connector (vllm-project#22) * wait for ec save Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * KV mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> --------- Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * mooncake store connector support ipv6 Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * adapt EC connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * support kv mooncake store connector (vllm-project#22) * wait for ec save Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * KV mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> --------- Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * kv mooncake store connector support Ipv6 (vllm-project#24) mooncake store connector support ipv6 Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * Implement yuanrong-datasystem connector. * adapt EC connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * adapt to ascend v0.11.0rc2 Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * fix precommit Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> fix precommit Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> fix precommit Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> fix precommit Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> fix precommit Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * fix precommit Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * adapt to v0.11.0rc2 * fix precommit Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> --------- Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> Co-authored-by: John Liu BUAA <liukecheng97@gmail.com> Co-authored-by: amy-why-3459 <wuhaiyan17@huawei.com> Co-authored-by: wuhang <wuhang6@huawei.com> Co-authored-by: Zeng Chuang <zengchuang3@huawei.com> Co-authored-by: yangsonglin <yangsonglin0821@163.com>
* 适配chunked prefill * 适配mla BNSD * 适配mla 2048 mask
…atch sunset plan - Move _c8_kv_scale_weight_loader and AscendC8KVCacheAttentionMethod from w8a8_dynamic.py into kv_c8.py so all KV-cache C8 quantization methods (both MLA/FAKQuant and dense-attention/QuaRot) live in one place - Update import paths in modelslim_config.py, attention_v1.py, and tests - Clean up patch_qwen3_c8.py: remove redundant module docstring to match the style of other patch files (license header only) - Add entry vllm-project#24 for patch_qwen3_c8.py in patch/__init__.py with Why/How/ Related PR/Future Plan following the established sunset-plan format Signed-off-by: lico67373 <918688502@qq.com> Made-with: Cursor
Adds C8 (INT8) KV cache quantization support for standard GQA attention models (e.g., Qwen3-32B W8A8C8). C8 uses static per-channel scales to store KV cache in INT8, reducing KV cache memory by ~50% compared to BF16, enabling higher batch concurrency and longer context on the same hardware. Key changes: - attention_v1.py: new AscendC8AttentionBackendImpl subclass with _prepare_c8_scales, _quantize_kv_to_int8, _forward_c8_decode, _forward_c8_chunked_prefill, and _forward_c8_fused_infer_attention. - kv_c8.py: AscendC8KVCacheAttentionMethod creates k/v_cache_scale/ offset parameters and upgrades the attention impl via class surgery; _c8_kv_scale_weight_loader handles per-channel scale shapes. - modelslim_config.py: activates C8 branch when kv_cache_type == "C8" in quant_model_description.json. - patch_qwen3_c8.py: intercepts C8 scale/offset weights before AutoWeightsLoader discards them. - patch/__init__.py: documents patch vllm-project#24 with a sunset plan. - tests/ut/quantization/test_kv_c8.py: unit tests for all C8 helpers. Signed-off-by: lico67373 <918688502@qq.com> Made-with: Cursor
Adds C8 (INT8) KV cache quantization support for standard GQA attention models (e.g., Qwen3-32B W8A8C8). C8 uses static per-channel scales to store KV cache in INT8, reducing KV cache memory by ~50% compared to BF16, enabling higher batch concurrency and longer context on the same hardware. Key changes: - attention_v1.py: new AscendC8AttentionBackendImpl subclass with _prepare_c8_scales, _quantize_kv_to_int8, _forward_c8_decode, _forward_c8_chunked_prefill, and _forward_c8_fused_infer_attention. - kv_c8.py: AscendC8KVCacheAttentionMethod creates k/v_cache_scale/ offset parameters and upgrades the attention impl via class surgery; _c8_kv_scale_weight_loader handles per-channel scale shapes. - modelslim_config.py: activates C8 branch when kv_cache_type == "C8" in quant_model_description.json. - patch_qwen3_c8.py: intercepts C8 scale/offset weights before AutoWeightsLoader discards them. - patch/__init__.py: documents patch vllm-project#24 with a sunset plan. - tests/ut/quantization/test_kv_c8.py: unit tests for all C8 helpers. - tests/ut/quantization/test_modelslim_config.py: C8 branch coverage. Signed-off-by: lico67373 <918688502@qq.com> Made-with: Cursor
Adds C8 (INT8) KV cache quantization support for standard GQA attention models (e.g., Qwen3-32B W8A8C8). C8 uses static per-channel scales to store KV cache in INT8, reducing KV cache memory by ~50% compared to BF16, enabling higher batch concurrency and longer context on the same hardware. Key changes: - attention_v1.py: new AscendC8AttentionBackendImpl subclass with _prepare_c8_scales, _quantize_kv_to_int8, _forward_c8_decode, _forward_c8_chunked_prefill, and _forward_c8_fused_infer_attention. - kv_c8.py: AscendC8KVCacheAttentionMethod creates k/v_cache_scale/ offset parameters and upgrades the attention impl via class surgery; _c8_kv_scale_weight_loader handles per-channel scale shapes. - modelslim_config.py: activates C8 branch when kv_cache_type == "C8" in quant_model_description.json. - patch_qwen3_c8.py: intercepts C8 scale/offset weights before AutoWeightsLoader discards them. - patch/__init__.py: documents patch vllm-project#24 with a sunset plan. - tests/ut/quantization/test_kv_c8.py: unit tests for all C8 helpers. - tests/ut/quantization/test_modelslim_config.py: C8 branch coverage. Signed-off-by: lico67373 <918688502@qq.com> Made-with: Cursor
Adds C8 (INT8) KV cache quantization support for standard GQA attention models (e.g., Qwen3-32B W8A8C8). C8 uses static per-channel scales to store KV cache in INT8, reducing KV cache memory by ~50% compared to BF16, enabling higher batch concurrency and longer context on the same hardware. Key changes: - attention_v1.py: new AscendC8AttentionBackendImpl subclass with _prepare_c8_scales, _quantize_kv_to_int8, _forward_c8_decode, _forward_c8_chunked_prefill, and _forward_c8_fused_infer_attention. - kv_c8.py: AscendC8KVCacheAttentionMethod creates k/v_cache_scale/ offset parameters and upgrades the attention impl via class surgery; _c8_kv_scale_weight_loader handles per-channel scale shapes. - modelslim_config.py: activates C8 branch when kv_cache_type == "C8" in quant_model_description.json. - patch_qwen3_c8.py: intercepts C8 scale/offset weights before AutoWeightsLoader discards them. - patch/__init__.py: documents patch vllm-project#24 with a sunset plan. - tests/ut/quantization/test_kv_c8.py: unit tests for all C8 helpers. - tests/ut/quantization/test_modelslim_config.py: C8 branch coverage. Signed-off-by: lico67373 <918688502@qq.com> Made-with: Cursor
Adds C8 (INT8) KV cache quantization support for standard GQA attention models (e.g., Qwen3-32B W8A8C8). C8 uses static per-channel scales to store KV cache in INT8, reducing KV cache memory by ~50% compared to BF16, enabling higher batch concurrency and longer context on the same hardware. Key changes: - attention_v1.py: new AscendC8AttentionBackendImpl subclass with _prepare_c8_scales, _quantize_kv_to_int8, _forward_c8_decode, _forward_c8_chunked_prefill, and _forward_c8_fused_infer_attention. - kv_c8.py: AscendC8KVCacheAttentionMethod creates k/v_cache_scale/ offset parameters and upgrades the attention impl via class surgery; _c8_kv_scale_weight_loader handles per-channel scale shapes. - modelslim_config.py: activates C8 branch when kv_cache_type == "C8" in quant_model_description.json. - patch_qwen3_c8.py: intercepts C8 scale/offset weights before AutoWeightsLoader discards them. - patch/__init__.py: documents patch vllm-project#24 with a sunset plan. - tests/ut/quantization/test_kv_c8.py: unit tests for all C8 helpers. - tests/ut/quantization/test_modelslim_config.py: C8 branch coverage. Signed-off-by: lico67373 <918688502@qq.com> Made-with: Cursor
Some PR for plugin support is not merged by vllm yet. This PR add monkey patch to vllm-ascend to make vllm-ascend work with vllm directly.
This patch code should be removed once the related function is supported by vllm originally.