Skip to content

Conversation

@fadara01
Copy link
Contributor

@fadara01 fadara01 commented Oct 16, 2025

[fix][cpu] fix prefill attention in CPU attention backend

  • Disables prefix caching because prefill attention can't handle paged KV cache
  • Fixes Q/K/V used during prefill on mixed prefill/decode requests

Purpose

Fixes #27034

Test Plan

test script attached to #27034

Test Result

Output of test script attached to #27034 is same when prompts are batched and when prompts are ran one at a time


Essential Elements of an Effective PR Description Checklist
  • [Y] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [Y] The test plan, such as providing test command.
  • [Y] The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@fadara01
Copy link
Contributor Author

Hi @bigPYJ1151 - would you be able to review this please?

@mergify mergify bot added the v1 label Oct 16, 2025
- Disables prefix caching because prefill attention can't handle paged KV cache
- Fixes Q/K/V used during prefill on mixed prefill/decode requests

Signed-off-by: Fadi Arafeh <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces fixes for prefill attention in the CPU attention backend. The changes correctly handle mixed prefill/decode requests by adjusting the starting indices for Q/K/V tensors and correctly slicing sequence lengths for prefill requests. Additionally, it disables prefix caching on certain CPU architectures where it is not supported. The changes are logical, well-implemented, and address the described issues effectively. I have no major concerns.

@LucasWilkinson
Copy link
Collaborator

@bigPYJ1151 do you think you can help look at this? not that well versed in the CPU backend

@bigPYJ1151
Copy link
Member

After some tests I found even set enable_chunked_prefill=False, the attention backend still got mixed batches. It's different from V0.
This PR looks reasonable and fixed the bug. Please fix the failed pre-commit check, thanks :)

@fadara01
Copy link
Contributor Author

Thanks for your review @bigPYJ1151
pre-commit is passing now.

Copy link
Member

@bigPYJ1151 bigPYJ1151 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bigPYJ1151 bigPYJ1151 enabled auto-merge (squash) October 18, 2025 11:29
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 18, 2025
@bigPYJ1151 bigPYJ1151 merged commit ab4be40 into vllm-project:main Oct 18, 2025
50 checks passed
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
adabeyta pushed a commit to adabeyta/vllm that referenced this pull request Oct 20, 2025
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 23, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Zhathw pushed a commit to Zhathw/vllm that referenced this pull request Nov 12, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Incorrect outputs with batch size > 1 on AArch64 CPU

3 participants