[Spyre-Next] Integrated custom attention backend by bohnstingl · Pull Request #798 · torch-spyre/sendnn-inference

bohnstingl · 2026-03-06T09:20:26Z

Description

This PR provides the sekeleton to add new attention backends to the vllm_spyre_next plugin.
It uses the first draft implementation of the pytorch-native paged attention as an example

Related Issues

#648 and former PR #774

Checklist

I have read the contributing guidelines
My code follows the project's code style (run bash format.sh)
I have added tests for my changes (if applicable)
I have updated the documentation (if applicable)
My commits include a Signed-off-by: line (DCO compliance)

cc @tdoublep @dilipgb @joerunde

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>

…/pytorch_paged_attn

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>

github-actions · 2026-03-06T09:27:28Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>

bohnstingl · 2026-03-08T20:00:39Z

I updated this PR with some new features. In particular, I

Vectorized some of the parts which where Python loops before (referred to as SpyreAttentionPaged (native)
Introduced a flag use_sdpa, which enables the use of torch.nn.functional.scaled_dot_product_attention, (referred to as SpyreAttentionPaged (torch.sdpa) in the table below)
Functionally verified the two different implementations against the stripped correctness test from vLLM upstream.

Also, I evaluated the implementation with three runs on the GSM8K test from upstream vLLM for the full 1319 questions. Below is a table with the results

Backend	Accuracy (μ ± σ)	Invalid rate	Tok/s (μ ± σ)	Latency (μ ± σ)
CPU_ATTN	0.7056 ± 0.0023	0.03% ± 0.06%	319 ± 33	538s ± 58s
SpyreAttentionPaged (native)	0.6972 ± 0.0057	0.03% ± 0.06%	16.3 ± 1.3	10,496s ± 891s
SpyreAttentionPaged (torch.sdpa)	0.7058 ± 0.0057	0.08% ± 0.06%	25.3 ± 0.4	6,711s ± 110s

As one can see, all implementation appear to be functionally equivalent and provide similar results in terms of accuracy. However, there are significant differences in terms of throughput performance. Most notably, there is an order of magnitude drop from the upstream CPU attention to the variant using torch.nn.functional.scaled_dot_product_attention, which should be investigated.

joerunde · 2026-03-09T20:15:01Z

Most notably, there is an order of magnitude drop from the upstream CPU attention to the variant using torch.nn.functional.scaled_dot_product_attention, which should be investigated.

@bohnstingl wouldn't this be expected since we're passing data back and forth from the cpu and spyre cards a bunch of times during every forward pass to run the attention on spyre but everything else on cpu?

edit- oh nevermind this is all running on cpu anyway. I think I'd still expect though that the cpu kernels shipped with vllm are much more efficient

joerunde · 2026-03-09T21:25:01Z

+        "What are IBMs main businesses?",
+    ]
+
+    engine = LLM(


We should probably give this file a descriptive name, also this seems to fail for me with NotImplementedError: Sliding window not supported yet as is, is this missing a configuration to disable sliding window?

Signed-off-by: Joe Runde <joe@joerun.de>

joerunde · 2026-03-09T21:53:48Z

bot:next-test

bohnstingl added 12 commits February 27, 2026 11:37

Integrated custom attention backend

c97e8db

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>

Formatting issues

89ca158

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>

Changed the name of the attention operation

c75872f

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>

Changed filename

ed004e4

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>

Implemented gather to avoid using full KV cache

7a7acd7

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>

Removed .item() calls

1257ef4

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>

Cleanup and adding of example

1fcf175

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>

Lint

abb0663

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>

Added testcase for attention backend

95fb6e1

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>

Added missing utils file

5780450

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>

Merge branch 'main' of github.com:vllm-project/vllm-spyre into origin…

2a54dfc

…/pytorch_paged_attn

Reformat

77da1ee

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>

bohnstingl requested review from joerunde and prashantgupta24 as code owners March 6, 2026 09:20

bohnstingl mentioned this pull request Mar 6, 2026

[Spyre-Next] Integrated custom attention backend #774

Closed

5 tasks

bohnstingl requested review from joerunde, tdoublep and yannicks1 and removed request for joerunde March 6, 2026 09:21

bohnstingl self-assigned this Mar 6, 2026

bohnstingl added 2 commits March 8, 2026 19:50

Functional update

474fc71

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>

Lint issues

8a3ba59

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>

bohnstingl mentioned this pull request Mar 8, 2026

Paged KV-cache implementation of AttentionBackend using torch-spyre #648

Open

joerunde reviewed Mar 9, 2026

View reviewed changes

joerunde added 2 commits March 9, 2026 15:38

Merge branch 'main' into pytorch_paged_attn_pr

c088f2e

🎨 linting, vllm compatibility, test integration

32aa78d

Signed-off-by: Joe Runde <joe@joerun.de>

bohnstingl marked this pull request as draft March 9, 2026 22:37

bohnstingl requested a review from jvlunteren March 9, 2026 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spyre-Next] Integrated custom attention backend#798

[Spyre-Next] Integrated custom attention backend#798
bohnstingl wants to merge 16 commits intotorch-spyre:mainfrom
bohnstingl:origin/pytorch_paged_attn

bohnstingl commented Mar 6, 2026

Uh oh!

github-actions Bot commented Mar 6, 2026

Uh oh!

bohnstingl commented Mar 8, 2026

Uh oh!

joerunde commented Mar 9, 2026 •

edited

Loading

Uh oh!

joerunde Mar 9, 2026

Uh oh!

joerunde commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bohnstingl commented Mar 6, 2026

Description

Related Issues

Checklist

Uh oh!

github-actions Bot commented Mar 6, 2026

Uh oh!

bohnstingl commented Mar 8, 2026

Uh oh!

joerunde commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joerunde Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

joerunde commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joerunde commented Mar 9, 2026 •

edited

Loading