[Bugfix] Runtime driver check for cuMemcpyBatchAsync in swap_blocks_batch by Etelis · Pull Request #38919 · vllm-project/vllm

Etelis · 2026-04-03T15:18:33Z

Fixes two issues introduced by swap_blocks_batch (#38460):

undefined symbol: cuMemcpyBatchAsync on CUDA drivers < 12.8 (@JaheimLee) — pre-built wheels hard-link the symbol, crashing at import vllm._C time on older drivers.
Compile error on CUDA 13.0 (@bbrowning, @eugr) — CUDA 13.0 headers #define cuMemcpyBatchAsync cuMemcpyBatchAsync_v2 (8 params), breaking the original 9-param call.

#38915 fixed problem 2 with compile-time #ifdef branching but left problem 1 open. This PR supersedes that approach by resolving the function at runtime via cuGetProcAddress("cuMemcpyBatchAsync", ..., 12080):

No direct symbol in the binary -> no crash on old drivers
String-based lookup -> immune to CUDA 13.0 #define remapping

…ks_batch Replace the compile-time-only #ifdef guard for cuMemcpyBatchAsync with a runtime resolution via cuGetProcAddress. Pre-built wheels compiled with CUDA 12.8+ would fail with "undefined symbol: cuMemcpyBatchAsync" on systems with older CUDA drivers (e.g. driver 12.1). The function pointer is now resolved lazily and cached, falling back to individual cudaMemcpyAsync calls when the driver lacks support. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

gemini-code-assist

Code Review

This pull request updates the swap_blocks_batch function in csrc/cache_kernels.cu to resolve cuMemcpyBatchAsync at runtime using cuGetProcAddress. This change ensures that binaries compiled with CUDA 12.8+ remain compatible with older drivers by falling back to individual async copies if the batch function is unavailable. I have no feedback to provide as the implementation correctly handles the dynamic loading and fallback logic.

johnnynunez · 2026-04-03T17:18:57Z

cc @mgoin

eugr · 2026-04-03T17:23:47Z

Thanks, building with this PR now

bbrowning · 2026-04-03T18:08:20Z

After applying this patch on top of latest main I was able to build vLLM from source again with CUDA 13 on my DGX Spark. So, I'm that hopeful eugr will report success as well.

eugr · 2026-04-03T18:23:34Z

The rebuild has been successful, the regression test pipeline is half way through now, so far so good

Etelis · 2026-04-03T19:39:51Z

I will also rerun it.
But please do update.

eugr · 2026-04-03T20:44:54Z

All checks passed, everything is good! Thanks for a quick turnaround!

johnnynunez · 2026-04-04T08:36:37Z

Resolve the conflicts

Etelis · 2026-04-04T11:37:21Z

@orozery cc

…ch-runtime-driver-check # Conflicts: # csrc/cache_kernels.cu Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

mgoin

Seems reasonable to have the fallback, and thanks folks for confirming the fix

orozery · 2026-04-05T07:04:27Z

@Etelis How does this work for CUDA 13 if it expects 8 arguments instead of 9?

Etelis · 2026-04-05T13:45:21Z

@Etelis How does this work for CUDA 13 if it expects 8 arguments instead of 9?

cuGetProcAddress with version 12080 always returns the 9-param function pointer, regardless of the driver version. CUDA drivers maintain all old function versions — the 13.0 change was only a header-level #define remapping to cuMemcpyBatchAsync_v2. Since we resolve by string name + explicit version at runtime, the header macro doesn't apply. The BatchFn typedef matches the v1 signature exactly.

Tested with both.

Etelis · 2026-04-06T15:27:31Z

@mgoin Can we merge this?

markmc · 2026-04-08T07:43:56Z

Needed this on RHEL 9 with nvidia-driver-550.163.01 and CUDA 13, seems to work fine

JaheimLee · 2026-04-11T09:42:56Z

Hi, any update? @mgoin

mgoin · 2026-04-11T17:02:30Z

Thanks for the ping!

…atch (vllm-project#38919) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

…atch (vllm-project#38919) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

…atch (vllm-project#38919) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

…atch (vllm-project#38919) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

mergify Bot added the bug Something isn't working label Apr 3, 2026

gemini-code-assist Bot reviewed Apr 3, 2026

View reviewed changes

Etelis mentioned this pull request Apr 3, 2026

[Perf] Batch KV cache swap copies via cuMemcpyBatchAsync #38460

Merged

Merge remote-tracking branch 'upstream/main' into fix/swap-blocks-bat…

dce438a

…ch-runtime-driver-check # Conflicts: # csrc/cache_kernels.cu Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

mgoin approved these changes Apr 4, 2026

View reviewed changes

mgoin added ready ONLY add when PR is ready to merge/full CI is needed nvidia labels Apr 4, 2026

github-project-automation Bot moved this to Ready in NVIDIA Apr 4, 2026

github-project-automation Bot added this to NVIDIA Apr 4, 2026

Etelis added 2 commits April 4, 2026 17:57

Merge branch 'main' into fix/swap-blocks-batch-runtime-driver-check

c28f27f

Merge branch 'main' into fix/swap-blocks-batch-runtime-driver-check

1ac370a

Merge branch 'main' into fix/swap-blocks-batch-runtime-driver-check

a02d256

Etelis added 5 commits April 8, 2026 17:19

Merge branch 'main' into fix/swap-blocks-batch-runtime-driver-check

ccf54fd

Merge branch 'main' into fix/swap-blocks-batch-runtime-driver-check

fa2deef

Merge branch 'main' into fix/swap-blocks-batch-runtime-driver-check

3d1d34c

Merge branch 'main' into fix/swap-blocks-batch-runtime-driver-check

18c2f4d

Merge branch 'main' into fix/swap-blocks-batch-runtime-driver-check

3fdfed6

Merge branch 'main' into fix/swap-blocks-batch-runtime-driver-check

5a0ebfa

Merge branch 'main' into fix/swap-blocks-batch-runtime-driver-check

1f17c9b

mgoin merged commit bd8bd52 into vllm-project:main Apr 11, 2026
142 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Apr 11, 2026

jikunshang mentioned this pull request Apr 12, 2026

Add swap_blocks_batch op with batched async memcpy vllm-project/vllm-xpu-kernels#265

Merged

4 tasks

Uh oh!

Conversation

Etelis commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

johnnynunez commented Apr 3, 2026

Uh oh!

eugr commented Apr 3, 2026

Uh oh!

bbrowning commented Apr 3, 2026

Uh oh!

eugr commented Apr 3, 2026

Uh oh!

Etelis commented Apr 3, 2026

Uh oh!

eugr commented Apr 3, 2026

Uh oh!

johnnynunez commented Apr 4, 2026

Uh oh!

Etelis commented Apr 4, 2026

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

orozery commented Apr 5, 2026

Uh oh!

Etelis commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Etelis commented Apr 6, 2026

Uh oh!

markmc commented Apr 8, 2026

Uh oh!

JaheimLee commented Apr 11, 2026

Uh oh!

mgoin commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Etelis commented Apr 3, 2026 •

edited

Loading

Etelis commented Apr 5, 2026 •

edited

Loading