[cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation by gassan-arm · Pull Request #32263 · vllm-project/vllm

gassan-arm · 2026-01-13T14:40:23Z

Purpose

CPU Paged Attention NEON BFMMLA BF16 Implementation

Co-authored-by: GitHub Copilot

Test Results

Using: #31720 Benchmark Suite,
Against Current NEON Implementation:
Prefill: 2.32x
Decode: 2.07x

cc. @aditew01 @fadara01

github-actions · 2026-01-13T14:40:56Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces a new CPU attention backend for ARM NEON with BFMMLA instruction support, specifically targeting bfloat16 data types. The implementation promises significant performance improvements for both prefill and decode stages. The changes include a new ISA dispatch path, a highly optimized BFMMLA GEMM kernel, and custom data layouts for key and value caches to leverage the hardware capabilities. The code is well-structured, using if constexpr for compile-time specialization and providing new tests for the added functionality. My review found the implementation to be solid, with one suggestion to improve the robustness of a kernel function by adding an explicit check for an implicit assumption.

csrc/cpu/cpu_attn_neon_bfmmla.hpp

vllm/v1/attention/backends/cpu_attn.py

cursor · 2026-01-13T14:51:54Z

vllm/v1/attention/backends/cpu_attn.py

-        if current_platform.get_cpu_architecture() == CpuArchEnum.ARM:
+    elif current_platform.get_cpu_architecture() == CpuArchEnum.ARM:
+        if block_size % 128 == 0 and dtype == torch.bfloat16:
+            return "neon_bfmmla"


Missing BFMMLA hardware capability check may cause crashes

Medium Severity

The _get_attn_isa function selects "neon_bfmmla" based solely on ARM architecture, block size alignment, and bfloat16 dtype, without checking if the CPU actually supports the BFMMLA instruction (ARMv8.6-A FEAT_BF16MM extension). This is inconsistent with how AMX is handled, which has an explicit torch._C._cpu._is_amx_tile_supported() check. On ARM64 CPUs that support BF16 conversions but lack BFMMLA (such as Apple Silicon M1/M2/M3 or AWS Graviton2), selecting this ISA would cause an illegal instruction crash at runtime.

tests/kernels/attention/test_cpu_attn.py

fadara01

Great work! Thank you :)

I put in some initial comments, around:

do we need to introduce NEON_BFMMLA as a new ISA - can't your BFMMLA implementation can be just under the existing NEON isa?
with the BFMMLA implementation we don't need to (and shouldn't) worry about any types other than BF16 - so please specialize for c10::BFloat16 and not scalar_t. Once this is done, the code will be much clearer/simpler and we'll go through another round of reviews.

csrc/cpu/cpu_attn.cpp

csrc/cpu/cpu_attn_impl.hpp

csrc/cpu/cpu_attn_neon.hpp

csrc/cpu/cpu_attn_neon_bfmmla.hpp

mergify · 2026-01-15T23:20:20Z

Hi @gassan-arm, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

csrc/cpu/cpu_attn_neon_bfmmla.hpp

fadara01 · 2026-01-16T04:28:40Z

csrc/cpu/cpu_attn_neon_bfmmla.hpp

+    float* C_blk = C + m * ldc;
+
+#define DISPATCH_MB(mb)                                                    \
+  gemm_packA_compute_MB_xN<mb, N, K, BFMMLABLayout::TokenColumn>(          \


For QK phase, why don't we just pack the query in copy_q_heads_tile while it's hot in cache, similar to what we do for AMX?
For PV phase, is the cost to pack P amortized?

we can revisit this later in a future PR

csrc/cpu/cpu_attn_neon_bfmmla.hpp

csrc/cpu/cpu_attn_neon.hpp

fadara01

LGTM, could you please just test on a machine with no BF16 HW (e.g. c6g) to make sure this works as expected?

fadara01 · 2026-01-23T09:00:44Z

@bigPYJ1151 could you please take a look at this?

tests/kernels/attention/test_cpu_attn.py

gassan-arm · 2026-01-23T11:05:02Z

LGTM, could you please just test on a machine with no BF16 HW (e.g. c6g) to make sure this works as expected?

please see: #32932

aditew01

Overall, neat changes. Thanks.
Nit: it'd be great if you could add comments with TODO: if there's something which needs to be addressed and is not in the scope for the PR

fadara01

Actually, let's hold-off merging this. I'm seeing regressions with end to end throughout benchmarks.
I'll add more details in a bit.

csrc/cpu/cpu_attn_neon_bfmmla.hpp

vllm/v1/attention/backends/cpu_attn.py

fadara01 · 2026-01-30T10:13:29Z

@bigPYJ1151 could you please review and hopefully merge this :)

csrc/cpu/cpu_attn_impl.hpp

mergify · 2026-02-04T03:42:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gassan-arm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gassan-arm · 2026-02-04T14:18:10Z

Hi,
@bigPYJ1151 thanks for your comments! I've incorporated them and rebased.

mergify · 2026-02-04T14:19:23Z

Hi @gassan-arm, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Implementation of paged attention using BFMMLA for increased BF16 performance Co-authored-by: GitHub Copilot Signed-off-by: Gassan <gassan.salama@arm.com>

bigPYJ1151 · 2026-02-06T07:01:36Z

Looks like ARM testing is broken, but should not relate to this PR.

vllm-project#32263) Signed-off-by: Gassan <gassan.salama@arm.com>

gassan-arm requested review from LucasWilkinson, WoosukKwon, bigPYJ1151, mgoin, tlrmchlsmth and yewentao256 as code owners January 13, 2026 14:40

mergify bot added cpu Related to CPU backends v1 labels Jan 13, 2026

gemini-code-assist bot reviewed Jan 13, 2026

View reviewed changes

csrc/cpu/cpu_attn_neon_bfmmla.hpp Outdated Show resolved Hide resolved

cursor bot reviewed Jan 13, 2026

View reviewed changes

gassan-arm force-pushed the neon-bfmmla-04-01-upstremable branch from 8efd7e7 to a15fb20 Compare January 13, 2026 15:04

cursor bot reviewed Jan 13, 2026

View reviewed changes

tests/kernels/attention/test_cpu_attn.py Show resolved Hide resolved

fadara01 suggested changes Jan 13, 2026

View reviewed changes

gassan-arm force-pushed the neon-bfmmla-04-01-upstremable branch from a15fb20 to ea256d0 Compare January 15, 2026 23:16

gassan-arm force-pushed the neon-bfmmla-04-01-upstremable branch from ea256d0 to 5008f62 Compare January 15, 2026 23:26

fadara01 suggested changes Jan 16, 2026

View reviewed changes

gassan-arm force-pushed the neon-bfmmla-04-01-upstremable branch from 5008f62 to a1d3bd6 Compare January 16, 2026 15:59

fadara01 approved these changes Jan 23, 2026

View reviewed changes

mergify bot assigned fadara01 Jan 23, 2026

gassan-arm force-pushed the neon-bfmmla-04-01-upstremable branch from a1d3bd6 to 75339ab Compare January 23, 2026 09:52

fadara01 reviewed Jan 23, 2026

View reviewed changes

tests/kernels/attention/test_cpu_attn.py Outdated Show resolved Hide resolved

aditew01 approved these changes Jan 26, 2026

View reviewed changes

fadara01 suggested changes Jan 26, 2026

View reviewed changes

fadara01 reviewed Jan 28, 2026

View reviewed changes

csrc/cpu/cpu_attn_neon_bfmmla.hpp Outdated Show resolved Hide resolved

gassan-arm force-pushed the neon-bfmmla-04-01-upstremable branch from 75339ab to 6c65fa7 Compare January 29, 2026 14:14

fadara01 reviewed Jan 29, 2026

View reviewed changes

vllm/v1/attention/backends/cpu_attn.py Outdated Show resolved Hide resolved

fadara01 approved these changes Jan 30, 2026

View reviewed changes

bigPYJ1151 self-assigned this Jan 30, 2026

bigPYJ1151 reviewed Feb 2, 2026

View reviewed changes

csrc/cpu/cpu_attn_impl.hpp Outdated Show resolved Hide resolved

csrc/cpu/cpu_attn_impl.hpp Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Feb 4, 2026

gassan-arm force-pushed the neon-bfmmla-04-01-upstremable branch from 6c65fa7 to a9e44c6 Compare February 4, 2026 14:15

mergify bot removed the needs-rebase label Feb 4, 2026

[cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation

eb17631

Implementation of paged attention using BFMMLA for increased BF16 performance Co-authored-by: GitHub Copilot Signed-off-by: Gassan <gassan.salama@arm.com>

gassan-arm force-pushed the neon-bfmmla-04-01-upstremable branch from a9e44c6 to eb17631 Compare February 4, 2026 14:23

bigPYJ1151 approved these changes Feb 6, 2026

View reviewed changes

bigPYJ1151 added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 6, 2026

Merge branch 'main' into neon-bfmmla-04-01-upstremable

af07337

bigPYJ1151 merged commit 1363e3d into vllm-project:main Feb 6, 2026
108 checks passed

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation (

523490d

vllm-project#32263) Signed-off-by: Gassan <gassan.salama@arm.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation (

b173ee2

vllm-project#32263) Signed-off-by: Gassan <gassan.salama@arm.com>

Uh oh!

Conversation

gassan-arm commented Jan 13, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Results

Uh oh!

github-actions bot commented Jan 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 13, 2026

Choose a reason for hiding this comment

Missing BFMMLA hardware capability check may cause crashes

Uh oh!

Uh oh!

fadara01 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jan 15, 2026

Uh oh!

Uh oh!

fadara01 Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

fadara01 Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fadara01 left a comment

Choose a reason for hiding this comment

Uh oh!

fadara01 commented Jan 23, 2026

Uh oh!

Uh oh!

gassan-arm commented Jan 23, 2026

Uh oh!

aditew01 left a comment

Choose a reason for hiding this comment

Uh oh!

fadara01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fadara01 commented Jan 30, 2026

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 4, 2026

Uh oh!

gassan-arm commented Feb 4, 2026

Uh oh!

mergify bot commented Feb 4, 2026

Uh oh!

bigPYJ1151 commented Feb 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

gassan-arm commented Jan 13, 2026 •

edited by github-actions bot

Loading

fadara01 left a comment •

edited

Loading

fadara01 Jan 23, 2026 •

edited

Loading