Custom AscendC op support in vllm_ascend by ganyi1996ppo · Pull Request #371 · vllm-project/vllm-ascend

ganyi1996ppo · 2025-03-21T03:46:23Z

What this PR does / why we need it?

Add custom ascendc kernel support in vllm-ascend, this PR mainly include 3 parts:

AscendC implementation of rotary_embedding, and its unitest.
CMakeLists.txt to compile AscendC kernel and related torch library binding to this kernel.
Build and pack all the compiled so into the vllm_ascend's package.

For now, this rotary embedding kernel dose not support the scenario with neoxStyle=False, So its not used in the actual modeling parts. We will soon add this implements into the vllm-ascend and integrate it into the modeling parts.

Does this PR introduce any user-facing change?

No change at all

MengqingCao · 2025-03-21T06:19:25Z

+ROPE_CUSTOM_KERNEL(half)
+ROPE_CUSTOM_KERNEL(bfloat16_t)
+
+enum struct TurboTypes {


Is this repeated with AscendTypes?

Yes, this part should be removed

MengqingCao · 2025-03-21T07:19:09Z

+};
+
+template <typename scalar_t>
+__aicore__ inline void smem2smem(AscendC::LocalTensor<scalar_t> dst, AscendC::LocalTensor<scalar_t> src, int size)


If my understanding is correct, this method is used to copy tensor. Maybe we can give it a more understandable name, like tensorCopy?

BTW, just curious, why we use AscendC::Copy, instead of AscendC::DataCopy here?

DataCopy indicate HBM to on-chip-memory, but Copy stands for on-chip-memory to on-chip-memory

tensorCopy is bit of confusing actually, its more related to the memory location, I adopt the name of shared memory here, but maybe I should use another name

antonlisq · 2025-03-25T07:38:15Z

Please merge this PR sooner, "sleep mode" feature depends on this. @wangxiyuan @MengqingCao

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

wuhuikx · 2025-03-27T08:01:52Z

+        fe::PlatformInfoManager::GeInstance().GetRuntimePlatformInfosByDevice(device_id, platform_infos);
+        uint32_t aivNum = platform_infos.GetCoreNumByType("aiv");
+        uint32_t loop_cnt = (num_tokens + aivNum - 1) / aivNum;
+        rotary_embedding_kernel(dtype_num, is_neox, stream, position_ids_ptr, query_ptr, key_ptr, query_ptr,


For any case the kernel does not support, please fallback to the native implementation. so can we add the native implementation here, like what we do in torch.

Maybe we can add a fallback path in next PR in python?

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

Add custom ascendc kernel support in vllm-ascend, this PR mainly include 3 parts: - AscendC implementation of rotary_embedding, and its unitest. - CMakeLists.txt to compile AscendC kernel and related torch library binding to this kernel. - Build and pack all the compiled so into the vllm_ascend's package. For now, this rotary embedding kernel dose not support the scenario with `neoxStyle=False`, So its not used in the actual modeling parts. We will soon add this implements into the vllm-ascend and integrate it into the modeling parts. No change at all --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

### What this PR does / why we need it? This PR adds sleep mode feature for vllm-ascend, when sleeps, we do mainly two things: - offload model weights - discard kv cache RLHF tools(such as https://github.com/volcengine/verl and https://github.com/OpenRLHF/OpenRLHF) have a strong need of sleep mode to accelerate the training process. This PR may solve #375 and #320 . ### Does this PR introduce _any_ user-facing change? No existing user interfaces changed. Users will have two new methods(`sleep()` and `wake_up()`) to use. ### How was this patch tested? This PR is tested with Qwen/Qwen2.5-0.5B-Instruct. At first, we have free NPU memory M1. After `llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)` executed, we have free NPU memory M2. M2 < M1. Then we call `llm.sleep(level=1)`, we have free NPU memory M3. We have M3 > M2, M3 is very close to M1. Plus, we have the same output tokens before sleep and after wake up, with the config of `SamplingParams(temperature=0, max_tokens=10)` and with the same input tokens of course. This PR is utilizing the CMake procedure of #371 , thanks a lot. Signed-off-by: Shuqiao Li <celestialli@outlook.com>

### What this PR does / why we need it? This PR adds sleep mode feature for vllm-ascend, when sleeps, we do mainly two things: - offload model weights - discard kv cache RLHF tools(such as https://github.com/volcengine/verl and https://github.com/OpenRLHF/OpenRLHF) have a strong need of sleep mode to accelerate the training process. This PR may solve vllm-project#375 and vllm-project#320 . ### Does this PR introduce _any_ user-facing change? No existing user interfaces changed. Users will have two new methods(`sleep()` and `wake_up()`) to use. ### How was this patch tested? This PR is tested with Qwen/Qwen2.5-0.5B-Instruct. At first, we have free NPU memory M1. After `llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)` executed, we have free NPU memory M2. M2 < M1. Then we call `llm.sleep(level=1)`, we have free NPU memory M3. We have M3 > M2, M3 is very close to M1. Plus, we have the same output tokens before sleep and after wake up, with the config of `SamplingParams(temperature=0, max_tokens=10)` and with the same input tokens of course. This PR is utilizing the CMake procedure of vllm-project#371 , thanks a lot. Signed-off-by: Shuqiao Li <celestialli@outlook.com>

@wangxiyuan

@wangxiyuan

### What this PR does / why we need it? This PR adds sleep mode feature for vllm-ascend, when sleeps, we do mainly two things: - offload model weights - discard kv cache RLHF tools(such as https://github.com/volcengine/verl and https://github.com/OpenRLHF/OpenRLHF) have a strong need of sleep mode to accelerate the training process. This PR may solve vllm-project#375 and vllm-project#320 . ### Does this PR introduce _any_ user-facing change? No existing user interfaces changed. Users will have two new methods(`sleep()` and `wake_up()`) to use. ### How was this patch tested? This PR is tested with Qwen/Qwen2.5-0.5B-Instruct. At first, we have free NPU memory M1. After `llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)` executed, we have free NPU memory M2. M2 < M1. Then we call `llm.sleep(level=1)`, we have free NPU memory M3. We have M3 > M2, M3 is very close to M1. Plus, we have the same output tokens before sleep and after wake up, with the config of `SamplingParams(temperature=0, max_tokens=10)` and with the same input tokens of course. This PR is utilizing the CMake procedure of vllm-project#371 , thanks a lot. Signed-off-by: Shuqiao Li <celestialli@outlook.com>

@wangxiyuan

github-actions Bot added module:tests module:core labels Mar 21, 2025

ganyi1996ppo requested review from Yikun and wangxiyuan and removed request for Yikun March 21, 2025 03:48

ganyi1996ppo force-pushed the ganyi/cus_ops_0.7.3 branch from b760c43 to e0ce409 Compare March 21, 2025 03:58

wangxiyuan reviewed Mar 21, 2025

View reviewed changes

Comment thread setup.py Outdated

Comment thread setup.py Outdated

MengqingCao reviewed Mar 24, 2025

View reviewed changes

wuhuikx reviewed Mar 26, 2025

View reviewed changes

Comment thread csrc/ops.h

Comment thread csrc/kernels/pos_encoding_kernels.cpp

ganyi1996ppo force-pushed the ganyi/cus_ops_0.7.3 branch from 36dc29c to aff6e58 Compare March 26, 2025 12:25

ganyi1996ppo requested review from MengqingCao, wangxiyuan and wuhuikx March 26, 2025 12:30

Custom AscendC op support in vllm_ascend

c907e07

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

ganyi1996ppo force-pushed the ganyi/cus_ops_0.7.3 branch from fc1aebe to c907e07 Compare March 27, 2025 02:46

ganyi1996ppo added 5 commits March 27, 2025 13:38

update cmake

6c2de65

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

update setup

b274d83

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

yapf issue fix

fed5709

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

adopt TORCH_LIBRARY_IMPL on schema impl specify

6295094

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

adopt PrivateUse1 as key

3196f32

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>