Skip to content

[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON#29193

Merged
vllm-bot merged 1 commit intovllm-project:mainfrom
fadara01:accelerate_arm_attention
Nov 22, 2025
Merged

[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON#29193
vllm-bot merged 1 commit intovllm-project:mainfrom
fadara01:accelerate_arm_attention

Conversation

@fadara01
Copy link
Contributor

@fadara01 fadara01 commented Nov 21, 2025

[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON

NEON is an Arm SIMD instruction set extension, compulsory since armv8-a

Purpose

Fixes #28981 for Arm CPUs
Related to #23934 (since it enables attention sinks for Arm)

PR #27954 added cpu_attention_with_kv_cache which supports chucked prefill, prefix caching, SWA, alibi, softcap and sinks.

However, it's currently disabled for the prefill phase on Arm CPUs because it's slower than torch.sdpa for relatively long prefills. Hence chunked prefill, prefix caching, sinks, etc remained disabled on Arm.

This PR accelerates cpu_attention_with_kv_cache on Arm CPUs by introducing NEON accelerated GEMMs (enabled with ISA::NEON) for QK and PV. With the new GEMMs, performance of cpu_attention_with_kv_cache is similar to torch.sdpa for long prefills, which allows us to enable cpu_attention_with_kv_cache for prefill path on Arm and therefore enable chunked prefill, prefix caching, sinks, alibi, softcap, etc.

Performance

Uplift with ISA::NEON vs ISA::VEC:
For batch size = 64, query tokens = kv tokens = 512, q heads = 32, kv heads = 8, head size = 128, block size = 128: using ISA::NEON for cpu_attention_with_kv_cache accelerates prefill attention by 2x compared to the current state with ISA::VEC

For the throughput benchmark below on Arm Neoverse-V2, using cpu_attention_with_kv_cache for prefills and decodes: ISA::NEON yields ~ %13 higher throughput than ISA::VEC and similar throughput to using torch.sdpa for prefill.

export VLLM_CPU_OMP_THREADS_BIND=0-63
export LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1"
export VLLM_TARGET_DEVICE=cpu
export VLLM_CPU_KVCACHE_SPACE=64
vllm bench throughput \
  --num-prompts 128 \
  --seed 0 \
  --dataset-name sharegpt \
  --input-len 1024 \
  --output-len 128 \
  --max-model-len 2048 \
  --max-num-batched-tokens 8192 \
  --model  meta-llama/Llama-3.1-8B-Instruct \
  --load-format dummy

Test Plan

./run-cpu-test-arm.sh
which includes tests/kernels/attention/test_cpu_attn.py

Test Result

All tests pass

Future Work

This PR enables a solid reference path for attention GEMMs on Arm CPUs.
Future PRs will accelerate attention further by introducing faster/vectorized exp implementations and leveraging bfmmla/bfdot for QK, PV on Arm CPUs with bf16 support, which should yield another 2x speedup for attention.


Essential Elements of an Effective PR Description Checklist
  • [Y] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [Y] The test plan, such as providing test command.
  • [Y] The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces NEON acceleration for Arm CPU Attention GEMMs, which is a significant performance improvement. The changes are well-structured, with a new cpu_attn_neon.hpp file containing the optimized kernels and modifications in other files to integrate the new ISA path. The NEON implementation itself is solid, using intrinsics and unrolling to achieve better performance. I've found an important issue regarding naming clarity in the new NEON implementation that should be addressed to improve maintainability.

@fadara01 fadara01 force-pushed the accelerate_arm_attention branch 2 times, most recently from 1846ffd to 28a7367 Compare November 21, 2025 18:52
@fadara01 fadara01 changed the title Accelerate Arm CPU Attention GEMMs with NEON Accelerate CPU Attention GEMMs on Arm with NEON Nov 21, 2025
@fadara01
Copy link
Contributor Author

fadara01 commented Nov 21, 2025

@mgoin @bigPYJ1151

Can you guys have a look? this deprecates torch.sdpa prefill path for Arm CPUs and enables chunked prefill, prefix caching, sinks, etc

@fadara01 fadara01 changed the title Accelerate CPU Attention GEMMs on Arm with NEON [perf][cpu] Accelerate attention GEMMs (QK, PV) on Arm CPUs with NEON Nov 21, 2025
@mgoin mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed aarch64-cpu labels Nov 21, 2025
@fadara01 fadara01 changed the title [perf][cpu] Accelerate attention GEMMs (QK, PV) on Arm CPUs with NEON [perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON Nov 22, 2025
PR vllm-project#27954 added cpu_attention_with_kv_cache which supports chucked prefill, prefix caching,
SWA, alibi, softcap and sinks.

However, it's currently disabled for prefill on Arm CPUs because it's slower than torch.sdpa
for relatively long prefills. Hence chunked prefill, prefix caching, sinks, etc remained unsupported on Arm.

This PR accelerates cpu_attention_with_kv_cache on Arm CPUs by introducing NEON accelerated GEMMs
(enabled with ISA::NEON) for QK and PV. With the new GEMMs, performance of cpu_attention_with_kv_cache
is similar to torch.sdpa for long prefills, which allows us to enable cpu_attention_with_kv_cache for
prefill path on Arm and thus enable chunked prefill, prefix caching, sinks, alibi, softcap, etc.

Performance:

Uplift with ISA::NEON vs ISA::VEC:
For batch size = 64, query tokens = kv tokens = 512, q heads = 32, kv heads - 8, head size = 128, block size = 128:
using ISA::NEON for cpu_attention_with_kv_cache accelerates prefill attention by 2x compared to the current state with ISA::VEC

For the throughput benchmark below on Arm Neoverse-V2, using cpu_attention_with_kv_cache for prefills and decodes:
ISA::NEON yields ~ %13 higher throughput than ISA::VEC and similar throughput to using torch.sdpa for prefill.
```
export VLLM_CPU_OMP_THREADS_BIND=0-63
export LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1"
export VLLM_TARGET_DEVICE=cpu
export VLLM_CPU_KVCACHE_SPACE=64
vllm bench throughput \
  --num-prompts 128 \
  --seed 0 \
  --dataset-name sharegpt \
  --input-len 1024 \
  --output-len 128 \
  --max-model-len 2048 \
  --max-num-batched-tokens 8192 \
  --model  meta-llama/Llama-3.1-8B-Instruct \
  --load-format dummy
```

Future PRs will accelerate attention further by introducing faster/vectorized exp implementations
and leveraging bfmmla/bfdot for QK, PV on Arm CPUs with bf16.

Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
@fadara01 fadara01 force-pushed the accelerate_arm_attention branch from 28a7367 to 30e1900 Compare November 22, 2025 09:59
@fadara01
Copy link
Contributor Author

Hmmm CI used to pass, current failures are unrelated.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works great on M1 Mac, validated with a Qwen3-0.6B eval on GSM8k. LGTM!

@vllm-bot vllm-bot merged commit 730bd35 into vllm-project:main Nov 22, 2025
53 of 55 checks passed
ywang96 pushed a commit to ywang96/vllm that referenced this pull request Nov 23, 2025
…h NEON (vllm-project#29193)

Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
lpapavassiliou pushed a commit to lpapavassiliou/vllm that referenced this pull request Nov 24, 2025
…h NEON (vllm-project#29193)

Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025
…h NEON (vllm-project#29193)

Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…h NEON (vllm-project#29193)

Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
…h NEON (vllm-project#29193)

Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
fadara01 added a commit to fadara01/vllm that referenced this pull request Dec 9, 2025
Should really have been part of vllm-project#29193 but I missed it

Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: chunked prefill disabled & max batched tokens not compatible with max model length on non-X86 CPU Backend

3 participants