Skip to content

[Bugfix] Fix CPU backend crash in KV cache block zeroing#37550

Merged
bigPYJ1151 merged 3 commits intovllm-project:mainfrom
DorBernsohn:fix/cpu-kv-block-zero-triton-crash
Mar 23, 2026
Merged

[Bugfix] Fix CPU backend crash in KV cache block zeroing#37550
bigPYJ1151 merged 3 commits intovllm-project:mainfrom
DorBernsohn:fix/cpu-kv-block-zero-triton-crash

Conversation

@DorBernsohn
Copy link
Copy Markdown
Contributor

@DorBernsohn DorBernsohn commented Mar 19, 2026

  • Override _zero_block_ids in CPUModelRunner with a pure PyTorch implementation to avoid calling the Triton GPU kernel (_zero_kv_blocks_kernel), which crashes on CPU nodes without an active GPU driver.

Closes #37546

Test plan

  • Verified syntax and pre-commit hooks pass
  • Implemented a minimal override using PyTorch (tensor.zero_()) to replace the Triton kernel path only for CPU
  • Existing CPU CI tests cover the integration path

Override _zero_block_ids in CPUModelRunner with a pure PyTorch
implementation to avoid calling the Triton kernel that fails when
Triton has no active GPU driver.

Closes vllm-project#37546

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>
@DorBernsohn DorBernsohn requested a review from bigPYJ1151 as a code owner March 19, 2026 10:48
@mergify mergify bot added cpu Related to CPU backends v1 bug Something isn't working labels Mar 19, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The changes effectively address the reported bug by providing a CPU-specific implementation for zeroing KV cache blocks. This prevents the TypeError that occurred when GPU-specific kernels were invoked on CPU-only environments. The solution is straightforward and uses standard PyTorch operations, ensuring compatibility and correctness for the CPU backend.

Comment on lines +93 to +98
if not block_ids:
return
for kv_cache in self.kv_caches:
# CPU attention backend shape: (2, num_blocks, heads, block_sz, head_sz)
# block_dim = 1
kv_cache[:, block_ids].zero_()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CPU attention backend the zeroing is not required. Different from FlashAttention, in CPU attention logits of invalid postions will be assigned to -INF, so invalid KV cache will not affect computation. Therefor _zero_block_ids can be passed.

for (int32_t i = 0; i < left_invalid_token_num; ++i) {
curr_logits_buffer[i] = neg_inf;
}
for (int32_t i = 0; i < right_invalid_token_num; ++i) {
curr_logits_buffer_tail[i] = neg_inf;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bigPYJ1151 youre right - updated _zero_block_ids to be a no-op since the cpu attention backend already handles invalid positions by assigning -inf to their logits.

…tions

CPU attention backend assigns -INF to logits at invalid KV cache positions,
so zeroing is unnecessary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>
Copy link
Copy Markdown
Member

@bigPYJ1151 bigPYJ1151 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

@bigPYJ1151 bigPYJ1151 enabled auto-merge (squash) March 23, 2026 09:28
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 23, 2026
@bigPYJ1151 bigPYJ1151 merged commit 7938d12 into vllm-project:main Mar 23, 2026
51 checks passed
yzong-rh pushed a commit to yzong-rh/vllm that referenced this pull request Mar 23, 2026
RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026
HenryTangDev pushed a commit to HenryTangMain/vllm that referenced this pull request Mar 27, 2026
SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026
…t#37550)

Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
…t#37550)

Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026
…t#37550)

Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>

Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
…t#37550)

Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
…t#37550)

Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cpu Related to CPU backends ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: CPU backend crashes with TypeError: 'function' object is not subscriptable on first inference request

2 participants