Skip to content

[cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation#32263

Merged
bigPYJ1151 merged 2 commits intovllm-project:mainfrom
gassan-arm:neon-bfmmla-04-01-upstremable
Feb 6, 2026
Merged

[cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation#32263
bigPYJ1151 merged 2 commits intovllm-project:mainfrom
gassan-arm:neon-bfmmla-04-01-upstremable

Conversation

@gassan-arm
Copy link
Copy Markdown
Contributor

@gassan-arm gassan-arm commented Jan 13, 2026

Purpose

CPU Paged Attention NEON BFMMLA BF16 Implementation

Co-authored-by: GitHub Copilot

Test Results

Using: #31720 Benchmark Suite,
Against Current NEON Implementation:
Prefill: 2.32x
Decode: 2.07x

cc. @aditew01 @fadara01

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added cpu Related to CPU backends v1 labels Jan 13, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new CPU attention backend for ARM NEON with BFMMLA instruction support, specifically targeting bfloat16 data types. The implementation promises significant performance improvements for both prefill and decode stages. The changes include a new ISA dispatch path, a highly optimized BFMMLA GEMM kernel, and custom data layouts for key and value caches to leverage the hardware capabilities. The code is well-structured, using if constexpr for compile-time specialization and providing new tests for the added functionality. My review found the implementation to be solid, with one suggestion to improve the robustness of a kernel function by adding an explicit check for an implicit assumption.

if current_platform.get_cpu_architecture() == CpuArchEnum.ARM:
elif current_platform.get_cpu_architecture() == CpuArchEnum.ARM:
if block_size % 128 == 0 and dtype == torch.bfloat16:
return "neon_bfmmla"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing BFMMLA hardware capability check may cause crashes

Medium Severity

The _get_attn_isa function selects "neon_bfmmla" based solely on ARM architecture, block size alignment, and bfloat16 dtype, without checking if the CPU actually supports the BFMMLA instruction (ARMv8.6-A FEAT_BF16MM extension). This is inconsistent with how AMX is handled, which has an explicit torch._C._cpu._is_amx_tile_supported() check. On ARM64 CPUs that support BF16 conversions but lack BFMMLA (such as Apple Silicon M1/M2/M3 or AWS Graviton2), selecting this ISA would cause an illegal instruction crash at runtime.

Fix in Cursor Fix in Web

@gassan-arm gassan-arm force-pushed the neon-bfmmla-04-01-upstremable branch from 8efd7e7 to a15fb20 Compare January 13, 2026 15:04
Copy link
Copy Markdown
Contributor

@fadara01 fadara01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thank you :)

I put in some initial comments, around:

  • do we need to introduce NEON_BFMMLA as a new ISA - can't your BFMMLA implementation can be just under the existing NEON isa?
  • with the BFMMLA implementation we don't need to (and shouldn't) worry about any types other than BF16 - so please specialize for c10::BFloat16 and not scalar_t. Once this is done, the code will be much clearer/simpler and we'll go through another round of reviews.

@gassan-arm gassan-arm force-pushed the neon-bfmmla-04-01-upstremable branch from a15fb20 to ea256d0 Compare January 15, 2026 23:16
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 15, 2026

Hi @gassan-arm, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@gassan-arm gassan-arm force-pushed the neon-bfmmla-04-01-upstremable branch from ea256d0 to 5008f62 Compare January 15, 2026 23:26
float* C_blk = C + m * ldc;

#define DISPATCH_MB(mb) \
gemm_packA_compute_MB_xN<mb, N, K, BFMMLABLayout::TokenColumn>( \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For QK phase, why don't we just pack the query in copy_q_heads_tile while it's hot in cache, similar to what we do for AMX?
For PV phase, is the cost to pack P amortized?

Copy link
Copy Markdown
Contributor

@fadara01 fadara01 Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can revisit this later in a future PR

@gassan-arm gassan-arm force-pushed the neon-bfmmla-04-01-upstremable branch from 5008f62 to a1d3bd6 Compare January 16, 2026 15:59
Copy link
Copy Markdown
Contributor

@fadara01 fadara01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, could you please just test on a machine with no BF16 HW (e.g. c6g) to make sure this works as expected?

@fadara01
Copy link
Copy Markdown
Contributor

@bigPYJ1151 could you please take a look at this?

@gassan-arm gassan-arm force-pushed the neon-bfmmla-04-01-upstremable branch from a1d3bd6 to 75339ab Compare January 23, 2026 09:52
@gassan-arm
Copy link
Copy Markdown
Contributor Author

LGTM, could you please just test on a machine with no BF16 HW (e.g. c6g) to make sure this works as expected?

please see: #32932

Copy link
Copy Markdown
Contributor

@aditew01 aditew01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, neat changes. Thanks.
Nit: it'd be great if you could add comments with TODO: if there's something which needs to be addressed and is not in the scope for the PR

Copy link
Copy Markdown
Contributor

@fadara01 fadara01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, let's hold-off merging this. I'm seeing regressions with end to end throughout benchmarks.
I'll add more details in a bit.

@gassan-arm gassan-arm force-pushed the neon-bfmmla-04-01-upstremable branch from 75339ab to 6c65fa7 Compare January 29, 2026 14:14
@fadara01
Copy link
Copy Markdown
Contributor

@bigPYJ1151 could you please review and hopefully merge this :)

@bigPYJ1151 bigPYJ1151 self-assigned this Jan 30, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 4, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gassan-arm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 4, 2026
@gassan-arm gassan-arm force-pushed the neon-bfmmla-04-01-upstremable branch from 6c65fa7 to a9e44c6 Compare February 4, 2026 14:15
@mergify mergify bot removed the needs-rebase label Feb 4, 2026
@gassan-arm
Copy link
Copy Markdown
Contributor Author

Hi,
@bigPYJ1151 thanks for your comments! I've incorporated them and rebased.

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 4, 2026

Hi @gassan-arm, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Implementation of paged attention using BFMMLA for increased BF16 performance
Co-authored-by: GitHub Copilot

Signed-off-by: Gassan <gassan.salama@arm.com>
@gassan-arm gassan-arm force-pushed the neon-bfmmla-04-01-upstremable branch from a9e44c6 to eb17631 Compare February 4, 2026 14:23
@bigPYJ1151 bigPYJ1151 added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 6, 2026
@bigPYJ1151
Copy link
Copy Markdown
Member

Looks like ARM testing is broken, but should not relate to this PR.

@bigPYJ1151 bigPYJ1151 merged commit 1363e3d into vllm-project:main Feb 6, 2026
108 checks passed
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpu Related to CPU backends ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants