[Perf] Batch KV cache swap copies via cuMemcpyBatchAsync by Etelis · Pull Request #38460 · vllm-project/vllm

Etelis · 2026-03-29T09:58:50Z

Replace per-layer per-block swap_blocks calls in the KV cache offloading
handler with a single swap_blocks_batch call that submits all copies in
one driver invocation.

On CUDA 12.8+ this uses cuMemcpyBatchAsync; on older CUDA/ROCm it falls
back to a flat cudaMemcpyAsync loop with cudaMemcpyDefault. Zero extra
GPU memory. No behavior change.

Supersedes #38216 (rebased clean).

Benchmark Results

Hardware: 8xH100 80GB HBM3, CUDA 12.8/12.9.

Baseline runs the original files via
pip install vllm==0.18.0 --force-reinstall --no-deps. Each mode ran in a
fresh Python process to avoid stale module state.

Handler-level benchmark

Directly measures the transfer path in isolation — no model inference.

Setup: Instantiate CpuGpuOffloadingHandlers with BF16 GPU tensors shaped
by FlashAttentionBackend.get_kv_cache_shape() using each model's real
architecture.
Call handler.transfer_async(), poll get_finished() until complete. Measure
with time.perf_counter() (includes CUDA sync). 5 warmup + 100 measured
iterations. Baseline and batched ran as separate processes.

Config	Layers	KV Heads	Head Dim	Tensors	Per-tensor block size
LLaMA-8B	32	8	128	64	32 KB
LLaMA-70B	80	8	128	160	32 KB
LLaMA-70B TP=4	80	2	128	160	8 KB
Qwen2.5-0.5B	24	2	64	48	4 KB
Qwen2.5-1.5B	28	2	128	56	8 KB
Qwen2.5-3B	36	2	128	72	8 KB
Phi-3-mini	32	8	96	64	24 KB

Per-layer tensors (FlashAttn K/V split, no cross-layer allocation):

Config	Tensors	Baseline	Batched	Speedup
LLaMA-8B, 64 blocks	64	16,358 us	4,519 us	3.6x
LLaMA-70B, 64 blocks	160	41,124 us	10,239 us	4.0x
LLaMA-70B TP=4, 32 blocks	160	20,427 us	3,299 us	6.2x
Qwen2.5-0.5B, 128 blocks	48	23,484 us	3,184 us	7.4x
Qwen2.5-1.5B, 64 blocks	56	14,042 us	2,272 us	6.2x
Qwen2.5-3B, 64 blocks	72	18,045 us	2,935 us	6.2x
Phi-3-mini, 64 blocks	64	15,963 us	3,888 us	4.1x

E2E vLLM serve — KV transfer bandwidth

Setup:

Start vllm serve with each real model, offloading enabled:

python -m vllm.entrypoints.openai.api_server \
    --model <model> --gpu-memory-utilization <0.4-0.5> \
    --kv-transfer-config '{\"kv_connector\":\"OffloadingConnector\",
        \"kv_role\":\"kv_both\",
        \"kv_connector_extra_config\":{\"cpu_bytes_to_use\":2147483648,
        \"block_size\":48}}' \
    --max-model-len 4096 --enforce-eager

Send sustained load: 8 concurrent Python threads, each sending prompts in a loop for 45 seconds.
Read the KV transfer bandwidth from vLLM's server log
total_time is measured by CUDA events (start_event.elapsed_time(end_event))
inside SingleDirectionOffloadingHandler.get_finished() Bandwidth = total_bytes / total_time.

Model	KV Heads	Baseline BW	Batched BW	Improvement
LLaMA-3.2-1B	8	27.9 GB/s	32.4 GB/s	+16%
Qwen2.5-1.5B	2	25.8 GB/s	31.8 GB/s	+23%
Qwen2.5-3B	2	29.5 GB/s	34.3 GB/s	+16%
Gemma-2-2B	4	40.2 GB/s	43.3 GB/s	+8%
Phi-3.5-mini	8	51.6 GB/s	53.3 GB/s	+3%
Qwen2.5-7B	4	34.2 GB/s	40.7 GB/s	+19%
LLaMA-3.1-8B	8	44.0 GB/s	47.0 GB/s	+7%
Mistral-7B	8	45.1 GB/s	48.6 GB/s	+8%

3. E2E serving throughput

Model	Baseline req/s	Batched req/s	Baseline p99 TTFT	Batched p99 TTFT
Qwen2.5-7B	6.0	7.1 (+18%)	6,077 ms	117 ms
LLaMA-3.1-8B	6.7	6.7	100.7 ms	105.7 ms
Mistral-7B	6.7	6.6	91.4 ms	82.5 ms

Qwen2.5-7B shows 18% throughput improvement and p99 TTFT reduction from
6 seconds to 117ms under concurrent load. This I guess it because the model has 4 KV heads
(vs 8 for LLaMA/Mistral), producing smaller per-block copies where
submission overhead dominates.

Replace per-layer per-block swap_blocks calls with a single batched swap_blocks_batch call that submits all copies in one driver invocation. The existing offloading path issues L×N individual cudaMemcpyAsync calls (layers × block pairs), each incurring ~2μs of CPU submission overhead. For typical configurations (80 layers × 32 blocks = 2,560 calls), this overhead dominates the actual transfer time. swap_blocks_batch collects all source/destination pointers and sizes into flat arrays, then: - On CUDA 12.8+: submits them via cuMemcpyBatchAsync (one driver call) - On older CUDA/ROCm: falls back to a loop of cudaMemcpyAsync with cudaMemcpyDefault (same behavior as before, no regression) The Python side pre-computes base pointers and block sizes at init time, then builds the flat pointer arrays using vectorized numpy arithmetic. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request introduces swap_blocks_batch, a batched memory copy operation designed to reduce overhead during KV cache offloading. The implementation leverages cuMemcpyBatchAsync on supported CUDA versions (12.8+) and provides a fallback for older environments. The CPUGPUWorker is updated to aggregate transfer requests into a single batch call. A review comment suggests that the operation should be registered for the CUDA device in the Torch bindings instead of CPU.

gemini-code-assist · 2026-03-29T10:05:29Z

+  cache_ops.def(
+      "swap_blocks_batch(Tensor src_ptrs, Tensor dst_ptrs,"
+      "                  Tensor sizes) -> ()");
+  cache_ops.impl("swap_blocks_batch", torch::kCPU, &swap_blocks_batch);


The swap_blocks_batch operation is implemented in csrc/cache_kernels.cu and involves CUDA operations. However, it's being registered here for the torch::kCPU device. This is likely incorrect as the implementation relies on CUDA streams and memory copies. It should be registered for torch::kCUDA to ensure it's dispatched correctly when called on CUDA tensors.

Suggested change

cache_ops.impl("swap_blocks_batch", torch::kCPU, &swap_blocks_batch);

cache_ops.impl("swap_blocks_batch", torch::kCUDA, &swap_blocks_batch);

The input tensors (src_ptrs, dst_ptrs, sizes) are CPU tensors — they're numpy arrays of raw pointers/sizes converted via torch.from_numpy(). PyTorch dispatches based on the input tensor device, so kCPU is correct here. The existing swap_blocks uses kCUDA because its inputs are the actual GPU KV cache tensors. Registering with kCUDA would actually break dispatch since no input tensor lives on GPU.

ivanium · 2026-04-01T01:20:53Z

+  static_assert(sizeof(size_t) == sizeof(int64_t));
+#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12080
+  CUmemcpyAttributes attr = {};
+  attr.srcAccessOrder = CU_MEMCPY_SRC_ACCESS_ORDER_STREAM;


Minor comment: Curious have you tried CU_MEMCPY_SRC_ACCESS_ORDER_ANY (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g6f1ff58e3065df3eb4b573dba77ad31f)? I found it gives me better CPU->GPU bandwidth on Grace Blackwell nodes.

I see you also applied this parameter to GPU srcs.
According to the documentation this means access to srcs can be out of stream, so potentially not waiting for the compute (default) stream to complete?

@Etelis Anyhow for CPU->GPU this seems safe. Let's test it towards a follow up.

Thanks @ivanium for this suggestion!

Right, it won't wait for previous ops in the stream. Since we typically call this API in a separate copy stream, I guess we cannot rely on this AccessOrder param anyway. If we want to stay safe as a general purpose API, maybe we can expose a configurable param to users.

I'll test it as a followup

…t#38460) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

…t#38460) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>

### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy（need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>

…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy（need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>

…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy（need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: guxin108 <1252896542@qq.com>

…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy（need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>

…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy（need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>

…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy（need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: PiratePai <416932041@qq.com>

…t#38460) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>

…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy（need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>

…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy（need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>

…-project#7819) ### What this PR does / why we need it? refer to vllm-project/vllm#38460 and vllm-project/vllm#38915 , cann 8.5.0+ use aclrtMemcpyBatchAsync, old cann version use aclrtMemcpyAsync to do kvcache offloading. It can automatically compile and select the appropriate transmission function based on the CANN environment, and also supports manual parameter transmission to choose the suitable transmission function. manual parameter : 1. batch memcpy（need CANN ≥ 8.5): export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=1 pip install -e . 2. normal memcpy: export VLLM_ASCEND_ENABLE_BATCH_MEMCPY=0 pip install -e . ### How was this patch tested? test results: main : TTFT 307 ms TPOT 49.96ms this pr : TTFT 272.82ms TPOT 41.04ms model script: export TP=1 export MODEL_PATH=/nas/disk1/Qwen3-14B export MODEL_NAME=Qwen3-14B export PORT=10113 export CUDA_VISIBLE_DEVICES=3 export ASCEND_RT_VISIBLE_DEVICES=3 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7 --no-enable-prefix-caching --max-model-len 32768 --trust-remote-code \ --block-size 128 \ --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec", "spec_module_path": "vllm_ascend.kv_offload.npu"}}' test script: export MODEL_NAME=/nas/disk1/Qwen3-14B python /model/xk/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model $MODEL_NAME --served-model-name Qwen3-14B --seed 1234 --input-file /model/xk/vllm/benchmarks/multi_turn/generate_multi_turn.json \ --num-clients 8 --max-active-conversations 24 - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: kx <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: nanxing <1014662416@qq.com>

…t#38460) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>

…m-project#38460)" This reverts commit e4a13ce.

…t#38460) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>

…m-project#38460)" This reverts commit 7b1c5e0.

…t#38460) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>

…t#38460) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

Etelis requested review from ApostaC and orozery as code owners March 29, 2026 09:58

claude Bot reviewed Mar 29, 2026

View reviewed changes

Etelis mentioned this pull request Mar 29, 2026

[Perf] Batch KV cache swap copies via cuMemcpyBatchAsync #38216

Closed

mergify Bot added the v1 label Mar 29, 2026

Merge branch 'main' into swap-blocks-batch-v2

bb8c84c

gemini-code-assist Bot reviewed Mar 29, 2026

View reviewed changes

orozery added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 29, 2026

orozery approved these changes Mar 29, 2026

View reviewed changes

Merge branch 'main' into swap-blocks-batch-v2

48e1981

HF-001 mentioned this pull request Mar 30, 2026

[Performance]Batch kvcache offloading via aclrtMemcpyBatchAsync vllm-project/vllm-ascend#7819

Merged

Merge branch 'main' into swap-blocks-batch-v2

43371f7

orozery enabled auto-merge (squash) March 30, 2026 09:58

Etelis and others added 10 commits March 30, 2026 15:24

Merge branch 'main' into swap-blocks-batch-v2

cfd1990

Merge branch 'main' into swap-blocks-batch-v2

f4e6fd3

Merge branch 'main' into swap-blocks-batch-v2

01cf9a8

Merge branch 'main' into swap-blocks-batch-v2

ce30ed0

Merge branch 'main' into swap-blocks-batch-v2

0733407

Merge branch 'main' into swap-blocks-batch-v2

6926c12

Merge branch 'main' into swap-blocks-batch-v2

1ac17a2

Merge branch 'main' into swap-blocks-batch-v2

f238c88

Merge branch 'main' into swap-blocks-batch-v2

d42ace5

Merge branch 'main' into swap-blocks-batch-v2

24f9716

ivanium reviewed Apr 1, 2026

View reviewed changes

orozery added 5 commits April 1, 2026 08:43

Merge branch 'main' into swap-blocks-batch-v2

47b1d6c

Merge branch 'main' into swap-blocks-batch-v2

b161b1a

Merge branch 'main' into swap-blocks-batch-v2

2712093

Merge branch 'main' into swap-blocks-batch-v2

602cdc1

Merge branch 'main' into swap-blocks-batch-v2

a78f1e2

Etelis mentioned this pull request Apr 8, 2026

Use CU_MEMCPY_SRC_ACCESS_ORDER_ANY for batch KV cache swaps #39306

Merged

xiaobao520123 mentioned this pull request May 3, 2026

[Perf] Batch Weight Prefetching via cuMemcpyBatchAsync to Reduce Latency #41474

Open

4 tasks

Etelis mentioned this pull request May 5, 2026

[Perf][ROCm] Use hipMemcpyBatchAsync in swap_blocks_batch #41737

Closed

3 tasks

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

Revert "[Perf] Batch KV cache swap copies via cuMemcpyBatchAsync (vll…

7ab8f43

…m-project#38460)" This reverts commit e4a13ce.

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

Revert "[Perf] Batch KV cache swap copies via cuMemcpyBatchAsync (vll…

f4471ff

…m-project#38460)" This reverts commit 7b1c5e0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Batch KV cache swap copies via cuMemcpyBatchAsync#38460

[Perf] Batch KV cache swap copies via cuMemcpyBatchAsync#38460
orozery merged 32 commits into
vllm-project:mainfrom
Etelis:swap-blocks-batch-v2

Etelis commented Mar 29, 2026

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 29, 2026

Uh oh!

Etelis Mar 29, 2026

Uh oh!

ivanium Apr 1, 2026 •

edited

Loading

Uh oh!

orozery Apr 1, 2026

Uh oh!

ivanium Apr 1, 2026

Uh oh!

Etelis Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

	cache_ops.impl("swap_blocks_batch", torch::kCPU, &swap_blocks_batch);
	cache_ops.impl("swap_blocks_batch", torch::kCUDA, &swap_blocks_batch);

Uh oh!

Conversation

Etelis commented Mar 29, 2026

Benchmark Results

Handler-level benchmark

E2E vLLM serve — KV transfer bandwidth

3. E2E serving throughput

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Etelis Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

ivanium Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orozery Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

ivanium Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Etelis Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ivanium Apr 1, 2026 •

edited

Loading