[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON by fadara01 · Pull Request #29193 · vllm-project/vllm

fadara01 · 2025-11-21T18:32:19Z

[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON

NEON is an Arm SIMD instruction set extension, compulsory since armv8-a

Purpose

Fixes #28981 for Arm CPUs
Related to #23934 (since it enables attention sinks for Arm)

PR #27954 added cpu_attention_with_kv_cache which supports chucked prefill, prefix caching, SWA, alibi, softcap and sinks.

However, it's currently disabled for the prefill phase on Arm CPUs because it's slower than torch.sdpa for relatively long prefills. Hence chunked prefill, prefix caching, sinks, etc remained disabled on Arm.

This PR accelerates cpu_attention_with_kv_cache on Arm CPUs by introducing NEON accelerated GEMMs (enabled with ISA::NEON) for QK and PV. With the new GEMMs, performance of cpu_attention_with_kv_cache is similar to torch.sdpa for long prefills, which allows us to enable cpu_attention_with_kv_cache for prefill path on Arm and therefore enable chunked prefill, prefix caching, sinks, alibi, softcap, etc.

Performance

Uplift with ISA::NEON vs ISA::VEC:
For batch size = 64, query tokens = kv tokens = 512, q heads = 32, kv heads = 8, head size = 128, block size = 128: using ISA::NEON for cpu_attention_with_kv_cache accelerates prefill attention by 2x compared to the current state with ISA::VEC

For the throughput benchmark below on Arm Neoverse-V2, using cpu_attention_with_kv_cache for prefills and decodes: ISA::NEON yields ~ %13 higher throughput than ISA::VEC and similar throughput to using torch.sdpa for prefill.

export VLLM_CPU_OMP_THREADS_BIND=0-63
export LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1"
export VLLM_TARGET_DEVICE=cpu
export VLLM_CPU_KVCACHE_SPACE=64
vllm bench throughput \
  --num-prompts 128 \
  --seed 0 \
  --dataset-name sharegpt \
  --input-len 1024 \
  --output-len 128 \
  --max-model-len 2048 \
  --max-num-batched-tokens 8192 \
  --model  meta-llama/Llama-3.1-8B-Instruct \
  --load-format dummy

Test Plan

./run-cpu-test-arm.sh
which includes tests/kernels/attention/test_cpu_attn.py

Test Result

All tests pass

Future Work

This PR enables a solid reference path for attention GEMMs on Arm CPUs.
Future PRs will accelerate attention further by introducing faster/vectorized exp implementations and leveraging bfmmla/bfdot for QK, PV on Arm CPUs with bf16 support, which should yield another 2x speedup for attention.

Essential Elements of an Effective PR Description Checklist

[Y] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
[Y] The test plan, such as providing test command.
[Y] The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces NEON acceleration for Arm CPU Attention GEMMs, which is a significant performance improvement. The changes are well-structured, with a new cpu_attn_neon.hpp file containing the optimized kernels and modifications in other files to integrate the new ISA path. The NEON implementation itself is solid, using intrinsics and unrolling to achieve better performance. I've found an important issue regarding naming clarity in the new NEON implementation that should be addressed to improve maintainability.

csrc/cpu/cpu_attn_neon.hpp

fadara01 · 2025-11-21T19:08:01Z

@mgoin @bigPYJ1151

Can you guys have a look? this deprecates torch.sdpa prefill path for Arm CPUs and enables chunked prefill, prefix caching, sinks, etc

PR vllm-project#27954 added cpu_attention_with_kv_cache which supports chucked prefill, prefix caching, SWA, alibi, softcap and sinks. However, it's currently disabled for prefill on Arm CPUs because it's slower than torch.sdpa for relatively long prefills. Hence chunked prefill, prefix caching, sinks, etc remained unsupported on Arm. This PR accelerates cpu_attention_with_kv_cache on Arm CPUs by introducing NEON accelerated GEMMs (enabled with ISA::NEON) for QK and PV. With the new GEMMs, performance of cpu_attention_with_kv_cache is similar to torch.sdpa for long prefills, which allows us to enable cpu_attention_with_kv_cache for prefill path on Arm and thus enable chunked prefill, prefix caching, sinks, alibi, softcap, etc. Performance: Uplift with ISA::NEON vs ISA::VEC: For batch size = 64, query tokens = kv tokens = 512, q heads = 32, kv heads - 8, head size = 128, block size = 128: using ISA::NEON for cpu_attention_with_kv_cache accelerates prefill attention by 2x compared to the current state with ISA::VEC For the throughput benchmark below on Arm Neoverse-V2, using cpu_attention_with_kv_cache for prefills and decodes: ISA::NEON yields ~ %13 higher throughput than ISA::VEC and similar throughput to using torch.sdpa for prefill. ``` export VLLM_CPU_OMP_THREADS_BIND=0-63 export LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" export VLLM_TARGET_DEVICE=cpu export VLLM_CPU_KVCACHE_SPACE=64 vllm bench throughput \ --num-prompts 128 \ --seed 0 \ --dataset-name sharegpt \ --input-len 1024 \ --output-len 128 \ --max-model-len 2048 \ --max-num-batched-tokens 8192 \ --model meta-llama/Llama-3.1-8B-Instruct \ --load-format dummy ``` Future PRs will accelerate attention further by introducing faster/vectorized exp implementations and leveraging bfmmla/bfdot for QK, PV on Arm CPUs with bf16. Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>

fadara01 · 2025-11-22T12:54:49Z

Hmmm CI used to pass, current failures are unrelated.

mgoin

This works great on M1 Mac, validated with a Qwen3-0.6B eval on GSM8k. LGTM!

…h NEON (vllm-project#29193) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>

…h NEON (vllm-project#29193) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

…h NEON (vllm-project#29193) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>

Should really have been part of vllm-project#29193 but I missed it Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>

fadara01 requested review from LucasWilkinson and bigPYJ1151 as code owners November 21, 2025 18:32

mergify bot added the v1 label Nov 21, 2025

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

csrc/cpu/cpu_attn_neon.hpp Outdated Show resolved Hide resolved

csrc/cpu/cpu_attn_neon.hpp Outdated Show resolved Hide resolved

fadara01 force-pushed the accelerate_arm_attention branch 2 times, most recently from 1846ffd to 28a7367 Compare November 21, 2025 18:52

fadara01 changed the title ~~Accelerate Arm CPU Attention GEMMs with NEON~~ Accelerate CPU Attention GEMMs on Arm with NEON Nov 21, 2025

fadara01 changed the title ~~Accelerate CPU Attention GEMMs on Arm with NEON~~ [perf][cpu] Accelerate attention GEMMs (QK, PV) on Arm CPUs with NEON Nov 21, 2025

mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed aarch64-cpu labels Nov 21, 2025

fadara01 changed the title ~~[perf][cpu] Accelerate attention GEMMs (QK, PV) on Arm CPUs with NEON~~ [perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON Nov 22, 2025

fadara01 force-pushed the accelerate_arm_attention branch from 28a7367 to 30e1900 Compare November 22, 2025 09:59

mgoin approved these changes Nov 22, 2025

View reviewed changes

vllm-bot merged commit 730bd35 into vllm-project:main Nov 22, 2025
53 of 55 checks passed

ywang96 pushed a commit to ywang96/vllm that referenced this pull request Nov 23, 2025

[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs wit…

c563125

…h NEON (vllm-project#29193) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>

lpapavassiliou pushed a commit to lpapavassiliou/vllm that referenced this pull request Nov 24, 2025

[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs wit…

eae3f53

…h NEON (vllm-project#29193) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs wit…

4ac5372

…h NEON (vllm-project#29193) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs wit…

bc9008b

…h NEON (vllm-project#29193) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>

huangye123 mentioned this pull request Dec 9, 2025

Request to compile and package a cuda12.9-vllm0.12.0 docker image. gpustack/gpustack#3789

Closed

fadara01 mentioned this pull request Dec 9, 2025

[cpu][ci] Add CPU Attention Tests for Neon Backend #30347

Merged

2 tasks

fadara01 added a commit to fadara01/vllm that referenced this pull request Dec 9, 2025

[cpu][ci] Add CPU Attention Tests for Neon Backend

5ff4735

Should really have been part of vllm-project#29193 but I missed it Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>

fadara01 mentioned this pull request Dec 10, 2025

[Feature][CPU Backend]: Add Paged Attention Benchmarks for CPU backend #30374

Closed

1 task

lingebeng mentioned this pull request Jan 25, 2026

[Bugfix] Fix display error (inconsistent with context) #33020

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON#29193

[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON#29193
vllm-bot merged 1 commit intovllm-project:mainfrom
fadara01:accelerate_arm_attention

fadara01 commented Nov 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

fadara01 commented Nov 21, 2025 •

edited

Loading

Uh oh!

fadara01 commented Nov 22, 2025

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

fadara01 commented Nov 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Performance

Test Plan

Test Result

Future Work

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

fadara01 commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fadara01 commented Nov 22, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fadara01 commented Nov 21, 2025 •

edited by github-actions bot

Loading

fadara01 commented Nov 21, 2025 •

edited

Loading