Skip to content

[Spyre-Next] Integrated custom attention backend#798

Draft
bohnstingl wants to merge 16 commits intotorch-spyre:mainfrom
bohnstingl:origin/pytorch_paged_attn
Draft

[Spyre-Next] Integrated custom attention backend#798
bohnstingl wants to merge 16 commits intotorch-spyre:mainfrom
bohnstingl:origin/pytorch_paged_attn

Conversation

@bohnstingl
Copy link
Copy Markdown
Collaborator

Description

This PR provides the sekeleton to add new attention backends to the vllm_spyre_next plugin.
It uses the first draft implementation of the pytorch-native paged attention as an example

Related Issues

#648 and former PR #774

Checklist

  • I have read the contributing guidelines
  • My code follows the project's code style (run bash format.sh)
  • I have added tests for my changes (if applicable)
  • I have updated the documentation (if applicable)
  • My commits include a Signed-off-by: line (DCO compliance)

cc @tdoublep @dilipgb @joerunde

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
@bohnstingl bohnstingl requested review from joerunde, tdoublep and yannicks1 and removed request for joerunde March 6, 2026 09:21
@bohnstingl bohnstingl self-assigned this Mar 6, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 6, 2026

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
@bohnstingl
Copy link
Copy Markdown
Collaborator Author

I updated this PR with some new features. In particular, I

  • Vectorized some of the parts which where Python loops before (referred to as SpyreAttentionPaged (native)
  • Introduced a flag use_sdpa, which enables the use of torch.nn.functional.scaled_dot_product_attention, (referred to as SpyreAttentionPaged (torch.sdpa) in the table below)
  • Functionally verified the two different implementations against the stripped correctness test from vLLM upstream.

Also, I evaluated the implementation with three runs on the GSM8K test from upstream vLLM for the full 1319 questions. Below is a table with the results

Backend Accuracy (μ ± σ) Invalid rate Tok/s (μ ± σ) Latency (μ ± σ)
CPU_ATTN 0.7056 ± 0.0023 0.03% ± 0.06% 319 ± 33 538s ± 58s
SpyreAttentionPaged (native) 0.6972 ± 0.0057 0.03% ± 0.06% 16.3 ± 1.3 10,496s ± 891s
SpyreAttentionPaged (torch.sdpa) 0.7058 ± 0.0057 0.08% ± 0.06% 25.3 ± 0.4 6,711s ± 110s

As one can see, all implementation appear to be functionally equivalent and provide similar results in terms of accuracy. However, there are significant differences in terms of throughput performance. Most notably, there is an order of magnitude drop from the upstream CPU attention to the variant using torch.nn.functional.scaled_dot_product_attention, which should be investigated.

@joerunde
Copy link
Copy Markdown
Collaborator

joerunde commented Mar 9, 2026

Most notably, there is an order of magnitude drop from the upstream CPU attention to the variant using torch.nn.functional.scaled_dot_product_attention, which should be investigated.

@bohnstingl wouldn't this be expected since we're passing data back and forth from the cpu and spyre cards a bunch of times during every forward pass to run the attention on spyre but everything else on cpu?

edit- oh nevermind this is all running on cpu anyway. I think I'd still expect though that the cpu kernels shipped with vllm are much more efficient

"What are IBMs main businesses?",
]

engine = LLM(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably give this file a descriptive name, also this seems to fail for me with NotImplementedError: Sliding window not supported yet as is, is this missing a configuration to disable sliding window?

@joerunde
Copy link
Copy Markdown
Collaborator

joerunde commented Mar 9, 2026

bot:next-test

@bohnstingl bohnstingl marked this pull request as draft March 9, 2026 22:37
@bohnstingl bohnstingl requested a review from jvlunteren March 9, 2026 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants