[CPU] Add head sizes 80 and 112 with vec16 fallback#31968
[CPU] Add head sizes 80 and 112 with vec16 fallback#31968bigPYJ1151 merged 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request reintroduces support for head dimensions 80 and 112 in the CPU attention backend, with a fallback to vec16 since these sizes are not optimal for AMX instructions. The changes look good overall, correctly implementing the fallback logic in the C++ backend.
My review includes two main points:
- There is redundant logic in the Python code for determining the ISA, which is already handled robustly in the C++ backend. I've suggested removing the Python-side logic to maintain a single source of truth.
- The automated tests have not been updated to include the new head sizes (80 and 112). I've recommended adding them to the test suite to ensure the new functionality is properly verified.
Addressing these points will improve the maintainability and robustness of the code.
| def get_supported_head_sizes(cls) -> list[int]: | ||
| return [32, 64, 96, 128, 160, 192, 224, 256] | ||
| return [32, 64, 80, 96, 112, 128, 160, 192, 224, 256] |
There was a problem hiding this comment.
| def _get_attn_isa( | ||
| dtype: torch.dtype, block_size: int, head_size: int | None = None | ||
| ) -> str: | ||
| if head_size in (80, 112): | ||
| return "vec16" |
There was a problem hiding this comment.
The logic to force vec16 for specific head sizes is duplicated here and in the C++ backend. The C++ code in csrc/cpu/cpu_attn.cpp already handles this fallback in get_scheduler_metadata, cpu_attn_reshape_and_cache, and cpu_attention_with_kv_cache using the requires_vec16_fallback helper.
To maintain a single source of truth and avoid redundancy, this logic should only reside in the C++ backend. Please remove the head_size check from this function and revert its signature.
Also, the call to this function in CPUAttentionMetadataBuilder.__init__ (line 140) should be reverted to self.isa = _get_attn_isa(self.dtype, self.block_size).
def _get_attn_isa(dtype: torch.dtype, block_size: int) -> str:| @@ -15,6 +15,7 @@ | |||
|
|
|||
| #ifdef __aarch64__ | |||
| #include "cpu_attn_neon.hpp" | |||
| // NEON requires head_dim to be a multiple of 32 | |||
There was a problem hiding this comment.
Ok...now we created too much reduandant template instantiations, requires to reorgnize the dispatch procedure.
There was a problem hiding this comment.
Maybe a future PR?
|
@bigPYJ1151 looks like static assertion is causing this failure, is it ok to remove this assertion for now? |
Yes, there are some static checks relate to headsize in AMX and Vec impls need to be disabled temporarily. |
3d38760 to
9b9584d
Compare
Reintroduce support for head dimensions 80 and 112 in CPU attention backend. These sizes bypass Intel AMX, NEON and wide-vector (vec) optimizations, forcing the vec16. Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
|
@bigPYJ1151 @R3hankhan123 could you please create issues for the stuff that need to be addressed for future PRs s.t. we do not forget about them? Why did we add no tests for the new supported head dimensions? |
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
Purpose
Reintroduce support for head dimensions 80 and 112 in CPU attention backend which were previously removed in #27954 but these head dimensions are commonly used by granite models deployed on Z archs. Since these heads are not friendly for Intel AMX instruction set. The implementation now falls back to vec16.
Test Plan
Build Docker image and test using
ibm-granite/granite-3b-code-base-2kmodel which has head size of 80.Test Result
Server Logs
curl request:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Note
CPU attention: add 80/112 head sizes with
vec16fallback80and112to head-dim dispatch incpu_attn.cppand toCPUAttentionBackend.get_supported_head_sizes._get_attn_isanow returns"vec16"whenhead_size % 32 != 0andhead_size % 16 == 0; builder passeshead_dimto_get_attn_isa.cpu_attn_neon.hppto avoid hard failure.Written by Cursor Bugbot for commit 66ee4fc1271157cd45665968cb4a5a978bdef2f6. This will update automatically on new commits. Configure here.
Note
Adds support for additional CPU attention head sizes with safe ISA fallback.
CPUAttentionBackend.get_supported_head_sizesto include80and112_get_attn_isa(dtype, block_size, head_size)returns"vec16"whenhead_size % 32 != 0andhead_size % 16 == 0; builder now passeshead_dimcpu_attn_neon.hppto avoid hard failureWritten by Cursor Bugbot for commit c5aece2e4d413c3fcfcf7f2bde649f192e7e42e7. This will update automatically on new commits. Configure here.
Note
CPU attention: expand supported head sizes and adjust ISA dispatch
80and112to head-dim dispatch (cpu_attn.cpp) and toCPUAttentionBackend.get_supported_head_sizes_get_attn_isa(dtype, block_size, head_size)now returns"vec16"whenhead_size % 32 != 0andhead_size % 16 == 0; builder passeshead_dimWritten by Cursor Bugbot for commit 3d38760edeabe82878fad968a67114bd8dd547f7. This will update automatically on new commits. Configure here.
Note
Expands CPU attention compatibility and adjusts ISA dispatching.
80and112to head-dim dispatch andCPUAttentionBackend.get_supported_head_sizeshead_diminto_get_attn_isa(dtype, block_size, head_size); returns"vec16"whenhead_size % 32 != 0andhead_size % 16 == 0Written by Cursor Bugbot for commit 9b9584dd9b67c8ad9608c23897becb04d0b57c51. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 63a8afa. Configure here.