[Perf] Triton fast path for small CPU→GPU swap_blocks_batch in the offloading connector#42212
Conversation
cuMemcpyBatchAsync is descriptor-overhead-bound for small blocks. For uniformly-sized batches under 28 KiB per block, route the copy through a single Triton-kernel launch instead. Falls back to the existing C++ path above the threshold or for non-uniform sizes. Empirically tuned on H100 (PCIe Gen5): * num_sms = 12 — knee within 5% of peak; 9% of compute taken * threshold = 28 KiB — exact crossover identical across pair counts * 1.13-3.16x speedup below threshold; 1.00x above (no regression) Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a Triton-based fast path for the swap_blocks_batch operation, optimizing small uniform batches during KV offloading. The implementation includes a fallback to the original custom ops for large or non-uniform batches. Feedback indicates that the address tensors are currently allocated in unpinned memory, which causes the .to('cuda', non_blocking=True) calls to perform synchronous copies and block the CPU thread. It is recommended to use pinned memory for these tensors to ensure true asynchronous execution and avoid blocking the worker thread.
I am having trouble creating individual review comments. Click here to see my feedback.
vllm/v1/kv_offload/cpu/gpu_worker.py (319)
The batch_src and batch_dst tensors are created from unpinned numpy arrays (lines 292-293). Consequently, the .to('cuda', non_blocking=True) calls inside swap_blocks_batch (lines 56-57 of triton_swap.py) will result in synchronous host-to-device copies, blocking the CPU thread. To achieve true asynchronous execution and avoid blocking the scheduler/worker thread, consider using pinned memory for these address tensors (e.g., by allocating them with torch.empty(..., pin_memory=True) and using their numpy views).
|
Hi @Etelis, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| job += num_progs | ||
|
|
||
|
|
||
| def swap_blocks_batch( |
There was a problem hiding this comment.
I think this function should move to gpu_worker.py.
| n = src_addrs.numel() | ||
| if n == 0: | ||
| return | ||
| bpj = int(sizes[0].item()) |
There was a problem hiding this comment.
Can we choose a more meaningful variable name?
| if n == 0: | ||
| return | ||
| bpj = int(sizes[0].item()) | ||
| if bpj >= _THRESHOLD_BYTES or bpj % 8 != 0 or not bool((sizes == bpj).all()): |
There was a problem hiding this comment.
Can we add a comment explaining this criteria for choosing between cudamemcpybatch/triton?
| _NUM_SMS = 12 | ||
| _THRESHOLD_BYTES = 28 * 1024 |
There was a problem hiding this comment.
Let's add a comment on why did we choose these default values.
| @triton.jit | ||
| def _kernel( |
There was a problem hiding this comment.
will this work for other architectures?
e.g. AMD, XPU, HPU, TPU?
There was a problem hiding this comment.
AMD — the kernel is plain Triton, so it should run on ROCm, but the premise doesn't carry over: So _THRESHOLD_BYTES and _NUM_SMS (SMs vs CUs, different PCIe gen) would need re-measuring on AMD before it'd be worth enabling there.
XPU / HPU / TPU — these can't use the OffloadingConnector CPU-offload path at all today? or am I missing something there?
There was a problem hiding this comment.
XPU / HPU / TPU — these can't use the OffloadingConnector CPU-offload path at all today? or am I missing something there?
Right. I'm just wondering if this can lead to an easy path to support offloading on these platforms using this triton kernel.
There was a problem hiding this comment.
Ah I get it.
The kernel itself would run on ROCm, but not on HPU/TPU (no Triton)
XPU might be but hacky.
|
Thanks @Etelis ! |
|
Also, I think that we can select which swap_blocks function to use on init (instead of per-transfer) after examining the minimum size of Refs' |
…eneous sizes Adds an explicit gpu_to_cpu kwarg used to gate the Triton fast path so only CPU->GPU reads take it; GPU->CPU writes always defer to the C++ DMA path. Generalizes the kernel to handle non-uniform per-job sizes (each job loads its own size from the sizes tensor) and adds an n >= 16 batch-size floor so n=1 calls don't take the fast path. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
…constants Wrapper now lives next to its only caller. triton_swap keeps the kernel and the empirically-tuned constants (_NUM_SMS, _THRESHOLD_BYTES, _MIN_N) with a comment explaining how each was derived. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Move the gate (direction, page-size threshold, 8-byte alignment) and the chunk-size computation out of the per-call path. The handler binds either ops.swap_blocks_batch or a Triton closure with chunk pre-baked, so per-call work shrinks to a single n>=_MIN_N check. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
|
What's the e2e performance for the tested models? I deprecated the SM-based approach because I found that copy kernels will contend SMs and L1/L2 caches with concurrent GPU kernels and slow down the computation. |
swap_blocks_batch on small uniform batchesswap_blocks_batch in the offloading connector
Thanks @ivanium — full detail (cliff charts, threshold/SM sweeps, complete tables) is in the updated PR description; e2e — gpt-oss-120B / TP=4,
The onload transfer goes ~3.8 → ~18–19 GB/s (4 KiB descriptors, right on the wdyt? it looks decent? do you want other benchmarking or models? |
…atch Squashed cherry-pick of vllm-project#42212 (Itay Etelis). Adds Triton fast path for small uniform CPU->GPU swap_blocks_batch, gated at handler init. Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
|
|
||
| from vllm import _custom_ops as ops | ||
| from vllm.logger import init_logger | ||
| from vllm.triton_utils import triton |
There was a problem hiding this comment.
Claude comment:
The import from vllm.triton_utils import triton is unconditional. If running on a platform where Triton isn't
available (e.g., ROCm older builds, CPU-only), this will fail at import time even for GPU→CPU handlers that wouldn't
use the Triton path. Consider guarding with a lazy import or checking triton is not None in _select_swap_blocks_fn.
Nice results! I am actually also curious about the overhead in the no-cache-hit settings (e.g., random workload), where the SM interference can be more visible. Also for results here, is |
…ocks_fn Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
The (src, dst, sizes) descriptor tensors were built with torch.from_numpy
(pageable), so `.to("cuda", non_blocking=True)` on the Triton swap path
silently fell back to a synchronous copy. Back them with pinned host memory
and fill via numpy views, so the per-swap H2D actually overlaps. No extra
copy: the numpy views share the pinned buffers.
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
The test compared against a device->device ops.swap_blocks_batch call, which cuMemcpyBatchAsync rejects with CUDA_ERROR_INVALID_VALUE. Validate the Triton kernel's output against the source bytes directly instead. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
|
Sorry all passing now |
…offloading connector (vllm-project#42212) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Root cause: upstream PR #42212 added a pinned descriptor-buffer pool (self._buffer_pool) and Transfer.batch_* fields to SingleDirectionOffloadingHandler. The HPU plugin only overrides __init__ and transfer_async, so the inherited upstream get_finished/shutdown access self._buffer_pool (never initialized in the HPU __init__), raising AttributeError and an EngineDeadError in the CPU offloading test. Upstream: vllm-project/vllm#42212 Fix: add HPU-specific get_finished/shutdown overrides consistent with the HPU Transfer dataclass and __init__ (no buffer-pool/batch_* usage). Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
…offloading connector (vllm-project#42212) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: JisoLya <523420504@qq.com>
…offloading connector (vllm-project#42212) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
…de HPU offloading handler get_finished/shutdown (+2 more) (#1525) This PR is the rolling hourly-CI fix PR. It consolidates 3 fixes against vllm@`4efd6ffde09477800294a8ed9cc752017812c3b1` per the single-rolling-PR rule (invariant I9). ## Bug 1: Override HPU offloading handler get_finished/shutdown - **State machine id**: offloading_handler_buffer_pool_attrerror - **Commit**: e05b5e9 ### Root cause Upstream vLLM PR #42212 added a pinned descriptor-buffer pool (`self._buffer_pool`) and `Transfer.batch_*` fields to `SingleDirectionOffloadingHandler`. The HPU plugin only overrides `__init__` and `transfer_async`, so the inherited upstream `get_finished`/`shutdown` reach for `self._buffer_pool`, which the HPU `__init__` never initializes. This raised an `AttributeError` (surfacing as `EngineDeadError`) in the CPU offloading test. ### Upstream PR vllm-project/vllm#42212 ### Fix Add HPU-specific `get_finished` and `shutdown` overrides that are consistent with the HPU `Transfer` dataclass and `__init__` — recycling streams/events from the HPU pools and clearing HPU-local state, without touching the upstream buffer-pool / `batch_*` machinery. ## Bug 2: Fix minimax_m2 import after mamba LINEAR refactor - **State machine id**: mamba_linear_attn_import_missing - **Commit**: e0a7774 ### Root cause Upstream vLLM moved `MiniMaxText01RMSNormTP` out of `vllm.model_executor.layers.mamba.linear_attn` into a dedicated module `vllm.model_executor.layers.minimax_rms_norm`. The HPU `minimax_m2` model is eagerly imported by `register_model()`, so the stale import broke every CI test at import time. ### Upstream PR vllm-project/vllm#43556 ### Fix Update the import of `MiniMaxText01RMSNormTP` to `vllm.model_executor.layers.minimax_rms_norm`. ## Bug 3: Fix multi_model_api_server imports after serving-utils consolidation - **State machine id**: multi_model_entrypoints_logger_missing - **Commit**: a9ba162 ### Root cause Upstream vLLM consolidated the online serving utilities, removing `entrypoints/logger.py`, `entrypoints/openai/server_utils.py` and `entrypoints/utils.py`. The HPU multi-model API server imported three symbols from those removed modules, breaking the unit-test import path. ### Upstream PR vllm-project/vllm#44479 ### Fix Repoint three imports in `multi_model_api_server.py` to the consolidated locations: `serve.utils.request_logger` (RequestLogger), `serve.utils.server_utils` (get_uvicorn_log_config) and `serve.utils.api_utils` (cli_env_setup, process_lora_modules). ## HPU verification - Pod: Gaudi g3 - Full commit stack (`origin/main..HEAD`) re-verified against vllm@`4efd6ffde09477800294a8ed9cc752017812c3b1`: import-clean with the HPU platform plugin active (minimax_m2, multi_model_api_server, kv_offload cpu_hpu, and register_model all load). ## Related PRs None --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
…offloading connector (vllm-project#42212) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
…offloading connector (vllm-project#42212) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Summary
OffloadingConnectorcopies KV between host and device viacuMemcpyBatchAsync. That call saturates PCIe for large contiguous copies,but on the CPU→GPU (onload / "read") direction it collapses for small
per-descriptor payloads — the regime KV offload actually runs in. This PR adds
a small Triton kernel (
_swap_blocks_kernel) that takes over the CPU→GPUdirection, gated on batch size (
n ≥ 16) and small payloads(
max(sizes) < 28 KiB).Inspired by @ivanium's prototype, adapted to the connector's flat
(src_addr, dst_addr, size)interface.Motivation — the DMA small-page cliff (onload)
A single
cuMemcpyBatchAsyncof N small descriptors (one batched copy = oneproduction swap) is flat at ~5–7 GB/s for every page from 4 KiB up to 24 KiB,
then jumps ~6× at 28 KiB to the copy-engine ceiling. The cliff deepens with
descriptor count N (i.e. with prefix length) — exactly the wrong direction for
KV offload. Per-call DMA read throughput vs page (H100, fast GPU):
The GPU→CPU (offload) direction has no such cliff in the aggregate (posted
writes degrade gradually), so this PR leaves offload on
cuMemcpyBatchAsync.Threshold choice (
_THRESHOLD_BYTES = 28 KiB)DMA vs Triton, onload, matched N=10000, fast GPU. DMA shown back-to-back
(sustained, reaches the ~55 ceiling):
The crossover is 28 KiB across every SM count ≥ 12.
SM count (
_NUM_SMS = 12)Onload bandwidth vs SM count, median over N ∈ {5K…200K}, fast GPU. Ceiling ≈ 51 GB/s:
E2E req/s is identical for sm12 / sm16 / sm20 despite their 38 / 45 / 51 GB/s
kernel ladder, so the smallest slice wins (least decode contention).
End-to-end — gpt-oss-120b, TP=2, OffloadingConnector (24 GB), repeated measures
3 separate-boot reps per cell, median [min,max]; fast-path GPUs; per-layer KV
(8 KiB onload descriptors → kernel engages). B1 = vanilla DMA (HMA-on),
B2 = crosslayers (HMA-off, bundled DMA), PR = Triton onload.
PR beats vanilla DMA (B1) +105–154% req/s at every prefix and lifts onload
delivered bandwidth 4.8 → 37 GB/s.