[Bugfix] Fix CPU backend crash in KV cache block zeroing#37550
[Bugfix] Fix CPU backend crash in KV cache block zeroing#37550bigPYJ1151 merged 3 commits intovllm-project:mainfrom
Conversation
Override _zero_block_ids in CPUModelRunner with a pure PyTorch implementation to avoid calling the Triton kernel that fails when Triton has no active GPU driver. Closes vllm-project#37546 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>
There was a problem hiding this comment.
Code Review
The changes effectively address the reported bug by providing a CPU-specific implementation for zeroing KV cache blocks. This prevents the TypeError that occurred when GPU-specific kernels were invoked on CPU-only environments. The solution is straightforward and uses standard PyTorch operations, ensuring compatibility and correctness for the CPU backend.
vllm/v1/worker/cpu_model_runner.py
Outdated
| if not block_ids: | ||
| return | ||
| for kv_cache in self.kv_caches: | ||
| # CPU attention backend shape: (2, num_blocks, heads, block_sz, head_sz) | ||
| # block_dim = 1 | ||
| kv_cache[:, block_ids].zero_() |
There was a problem hiding this comment.
For CPU attention backend the zeroing is not required. Different from FlashAttention, in CPU attention logits of invalid postions will be assigned to -INF, so invalid KV cache will not affect computation. Therefor _zero_block_ids can be passed.
vllm/csrc/cpu/cpu_attn_impl.hpp
Lines 1129 to 1134 in 35141a7
There was a problem hiding this comment.
@bigPYJ1151 youre right - updated _zero_block_ids to be a no-op since the cpu attention backend already handles invalid positions by assigning -inf to their logits.
…tions CPU attention backend assigns -INF to logits at invalid KV cache positions, so zeroing is unnecessary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>
…t#37550) Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>
…t#37550) Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>
…t#37550) Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>
…t#37550) Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>
…t#37550) Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>
…t#37550) Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…t#37550) Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…t#37550) Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>
…t#37550) Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…t#37550) Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
Override
_zero_block_idsinCPUModelRunnerwith a pure PyTorch implementation to avoid calling the Triton GPU kernel (_zero_kv_blocks_kernel), which crashes on CPU nodes without an active GPU driver.CPUModelRunnerlacked a CPU-safe fallback. This caused aTypeError: 'function' object is not subscriptableon the first inference request for all models using the CPU backend.Closes #37546
Test plan
PyTorch(tensor.zero_()) to replace the Triton kernel path only for CPU