[Bug] Fix compile error for swap_blocks_batch in CUDA 13#38915
Merged
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces conditional compilation to handle an API change in cuMemcpyBatchAsync for CUDA 13.0, which removed the fail_idx parameter. The code now provides a specific implementation for CUDA 13.0 while maintaining the previous logic for older versions (12.8+). I have no feedback to provide.
tlrmchlsmth
approved these changes
Apr 3, 2026
| CUresult result = cuMemcpyBatchAsync( | ||
| reinterpret_cast<CUdeviceptr*>(const_cast<int64_t*>(dst_data)), | ||
| reinterpret_cast<CUdeviceptr*>(const_cast<int64_t*>(src_data)), | ||
| reinterpret_cast<size_t*>(const_cast<int64_t*>(size_data)), |
Member
There was a problem hiding this comment.
I realize this was there before, but we should not need to const cast these. Perhaps we should remove the constness of dst_data in the declaration above
Member
Author
There was a problem hiding this comment.
Nice catch, fixed, thanks!
Signed-off-by: yewentao256 <zhyanwentao@126.com>
1 task
HenryTangDev
pushed a commit
to HenryTangMain/vllm
that referenced
this pull request
Apr 6, 2026
puririshi98
pushed a commit
to puririshi98/vllm
that referenced
this pull request
Apr 7, 2026
…ect#38915) Signed-off-by: Rishi Puri <riship@nvidia.com>
mtparet
pushed a commit
to blackfuel-ai/vllm
that referenced
this pull request
Apr 9, 2026
wangxiyuan
added a commit
to vllm-project/vllm-ascend
that referenced
this pull request
Apr 21, 2026
### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
weijinqian0
pushed a commit
to weijinqian0/vllm-ascend
that referenced
this pull request
Apr 21, 2026
…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
anning-2026
pushed a commit
to anning-2026/vllm-ascend
that referenced
this pull request
Apr 21, 2026
…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
guxin108
pushed a commit
to guxin108/vllm-ascend
that referenced
this pull request
Apr 24, 2026
…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: guxin108 <1252896542@qq.com>
zouyida2052
pushed a commit
to zouyida2052/vllm-ascend
that referenced
this pull request
Apr 28, 2026
…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
yangzhe-2026
pushed a commit
to yangzhe-2026/vllm-ascend
that referenced
this pull request
May 6, 2026
…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
PiratePai
pushed a commit
to PiratePai/vllm-ascend
that referenced
this pull request
May 7, 2026
…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: PiratePai <416932041@qq.com>
mystous
pushed a commit
to mystous/vllm_hybrid
that referenced
this pull request
May 10, 2026
yangzhe-2026
pushed a commit
to yangzhe-2026/vllm-ascend
that referenced
this pull request
May 10, 2026
…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
ZhuQi-seu
pushed a commit
to ZhuQi-seu/vllm-ascend
that referenced
this pull request
May 12, 2026
…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
nanxingMy
pushed a commit
to nanxingMy/vllm-ascend
that referenced
this pull request
May 15, 2026
…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy(need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: nanxing <1014662416@qq.com>
my-other-github-account
pushed a commit
to my-other-github-account/vllm
that referenced
this pull request
May 15, 2026
my-other-github-account
pushed a commit
to my-other-github-account/vllm
that referenced
this pull request
May 15, 2026
jhu960213
pushed a commit
to jhu960213/vllm
that referenced
this pull request
May 20, 2026
mvanhorn
pushed a commit
to mvanhorn/vllm
that referenced
this pull request
Jun 4, 2026
…ect#38915) Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Originally
Now is fixed