[Spyre-Next] Pytorch Native Attention on Spyre: 4D Attention Kernel#914

Open

jvlunteren wants to merge 54 commits intotorch-spyre:mainfrom

jvlunteren:pytorch_native_attention_v2

Collaborator

jvlunteren commented Apr 14, 2026

Description

This PR extends PR #853 by replacing the 2D transposed attention kernel with a 4D broadcast matmul kernel, eliminating per‑sequence and per‑chunk loops, GQA head duplication, and block‑diagonal masking.

Related Issues

Relates to #647

Test Plan

Same approach as in PR #853.

Checklist

I have read the contributing guidelines
My code follows the project's code style (run bash format.sh)
I have added tests for my changes (if applicable)
I have updated the documentation (if applicable)
My commits include a Signed-off-by: line (DCO compliance)

bohnstingl and others added 30 commits

March 23, 2026 09:30


          Integrated custom attention backend

a0375d7

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Formatting issues

11255ac

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Changed the name of the attention operation

89d5a75

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Changed filename

bfbc64a

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Implemented gather to avoid using full KV cache

f8afb02

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Removed .item() calls

df3ab2c

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Cleanup and adding of example

8e0bd74

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Lint

90ce563

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Added testcase for attention backend

8d314b9

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Added missing utils file

0f34475

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Reformat

3bc3ee6

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Functional update

c2d264b

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Lint issues

2e8e4aa

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          🎨 linting, vllm compatibility, test integration

14b6ef7

Signed-off-by: Joe Runde <joe@joerun.de>


          refactored attention backend to support compilation and execution on …

c98a9a2

…Spyre

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          formatting

825a95c

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          add unit test

a5c719f

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          formatting

6da9be4

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          removed redundant code

61e22f1

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          added empty line back

2d8bb12

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          formatting

139ab4a

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          removed custom num_heads handling

9bf1283

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          removed compat_utils.py

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          renamed spyre_paged_attn.py to spyre_attn.py

621df53

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          add dynamic=False argument to torch.compile

2bf45c1

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          adapted test_spyre_attn.py to previous name change

9919ba2

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          limit supported data types to float16

a8c26f6

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          limit supported kv cache data types to float16

6338c32

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          removed redundant code

e118284

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          indicated if steps are executed on CPU and/or Spyre

82d2daf

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

jvlunteren and others added 19 commits

March 25, 2026 10:56


          various updates to test

0c6100f

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          formatting

d25ec3f

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          WIP: reworked D2H movements

49b6109

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          fixed supports_head_size()

88e12ea

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>


          Merge branch 'pytorch_native_attention' of github.com:jvlunteren/vllm…

17d2194

…-spyre into pytorch_native_attention


          Enforce dtype="float16"

a13a657

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Moved assert

91a24d6

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Corrected stripped attention test

dc5a07c

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Updates to address review comments

80c7cc9

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Merge branch 'main' of github.com:vllm-project/vllm-spyre into pytorc…

8adae0a

…h_native_attention


          Integrated minor review findings

af9d8f9

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Merge branch 'main' of github.com:vllm-project/vllm-spyre into pytorc…

c6fd7f9

…h_native_attention


          Integrated reviewer comments and suggestions

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Fixing formatting errors

fef4e7f

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Switched KV cache format to (num_blocks, 2, ...)

3d3a169

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Removed outdated max_num_seqs==1 restriction

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Removed enforce_eager argument

a3eecc5

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>


          Merge branch 'main' into pytorch_native_attention

3b02c2c

Signed-off-by: Jan van Lunteren <161835099+jvlunteren@users.noreply.github.com>


          replace 2D transposed attention kernel with batched 4D broadcast matmul

32e0b63

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

jvlunteren requested review from joerunde and prashantgupta24 as code owners

April 14, 2026 13:03

github-actions Bot commented Apr 14, 2026

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.


          Merge branch 'main' into pytorch_native_attention_v2

da41c0e

Signed-off-by: Jan van Lunteren <161835099+jvlunteren@users.noreply.github.com>

jvlunteren requested review from bohnstingl, bringlein and tdoublep

April 14, 2026 13:06

tdoublep reviewed

View reviewed changes

vllm_spyre_next/vllm_spyre_next/v1/attention/backends/spyre_attn.py Outdated

-                      Prepares tensors on CPU (reshape, stickify, build mask), transfers to
-                      Spyre for the compiled matmul kernel, then transfers the result back.
+                      # Q: [B, padQ, num_heads, D] -> [B, num_heads, padQ, D]

Collaborator

tdoublep Apr 14, 2026

what is B here?

Collaborator Author

jvlunteren Apr 15, 2026

Batch size (number of sequences)

Collaborator Author

jvlunteren Apr 15, 2026

The shapes listed in the comments were originally based on shortened variable names to keep the comments brief and within the line-width limit. For clarity, I have now replaced these abbreviated names with the full variable names used in the code.

Collaborator

tdoublep Apr 15, 2026

OK, but in vLLM we should never have a batch size dimension in that way? Everything should be "flat"?

Collaborator Author

jvlunteren Apr 15, 2026

The query argument in the forward method in line 240 has a "flat" vLLM v1 shape [num_tokens, num_heads, head_size].

This gets converted in line 283 to [num_seqs, max_query_len, num_heads, head_size] in order to be able to use torch.matmul.

In line 310 the output is converted back into the "flat" vLLM v1 shape [num_actual_tokens, num_heads, head_size].


          clarify shape comments

3b7ac99

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

bringlein approved these changes

View reviewed changes

Collaborator

bringlein left a comment

looks great. I had just two questions for my understanding.

vllm_spyre_next/vllm_spyre_next/v1/attention/backends/spyre_attn.py

Comment on lines +545 to +547

    
                      query: torch.Tensor,  # [num_seqs, max_query_len, num_heads, head_size]

                      key: torch.Tensor,  # [num_seqs, aligned_max_seq_len, num_kv_heads, head_size]

                      value: torch.Tensor,  # [num_seqs, aligned_max_seq_len, num_kv_heads, head_size]

Collaborator

bringlein Apr 15, 2026

so we expect key and value to be padded, but not the query? What is the rational behind this interface? (if there is one, I'm fully aware this could also just be temporary)

Collaborator

bringlein Apr 15, 2026

and as @tdoublep pointed out, is there a way to support the flattened varlen format?

Collaborator Author

jvlunteren Apr 15, 2026

The query at the input is "flat" [num_tokens, num_heads, head_size]. The query gets padded inside the code ( lines 573-577).

vllm_spyre_next/vllm_spyre_next/v1/attention/backends/spyre_attn.py

    
                      # Compiled attention on Spyre

                      output_spyre_t = self.attn_op(qt_spyre, k_spyre, vt_spyre, sm_scale_spyre, mask_spyre)

                      output_spyre = self.attn_op(q_spyre, k_spyre, v_spyre, self.scale, mask_spyre)

Collaborator

bringlein Apr 15, 2026

can we actually start profiling the performance of the different versions?

Collaborator Author

jvlunteren Apr 15, 2026

Yes

Collaborator

tdoublep commented Apr 21, 2026

Could we re-open this PR against the new spyre-inference repo? They we can merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

tdoublep tdoublep left review comments

bringlein bringlein approved these changes

joerunde Awaiting requested review from joerunde joerunde is a code owner

prashantgupta24 Awaiting requested review from prashantgupta24 prashantgupta24 is a code owner

bohnstingl Awaiting requested review from bohnstingl

Labels

None yet