-
Notifications
You must be signed in to change notification settings - Fork 54
[Spyre-Next] Pytorch Native Attention on Spyre #853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jvlunteren
merged 51 commits into
torch-spyre:main
from
jvlunteren:pytorch_native_attention
Apr 13, 2026
Merged
Changes from all commits
Commits
Show all changes
51 commits
Select commit
Hold shift + click to select a range
a0375d7
Integrated custom attention backend
bohnstingl 11255ac
Formatting issues
bohnstingl 89d5a75
Changed the name of the attention operation
bohnstingl bfbc64a
Changed filename
bohnstingl f8afb02
Implemented gather to avoid using full KV cache
bohnstingl df3ab2c
Removed .item() calls
bohnstingl 8e0bd74
Cleanup and adding of example
bohnstingl 90ce563
Lint
bohnstingl 8d314b9
Added testcase for attention backend
bohnstingl 0f34475
Added missing utils file
bohnstingl 3bc3ee6
Reformat
bohnstingl c2d264b
Functional update
bohnstingl 2e8e4aa
Lint issues
bohnstingl 14b6ef7
:art: linting, vllm compatibility, test integration
joerunde c98a9a2
refactored attention backend to support compilation and execution on …
jvlunteren 825a95c
formatting
jvlunteren a5c719f
add unit test
jvlunteren 6da9be4
formatting
jvlunteren 61e22f1
removed redundant code
jvlunteren 2d8bb12
added empty line back
jvlunteren 139ab4a
formatting
jvlunteren 9bf1283
removed custom num_heads handling
jvlunteren 0931224
removed compat_utils.py
jvlunteren 621df53
renamed spyre_paged_attn.py to spyre_attn.py
jvlunteren 2bf45c1
add dynamic=False argument to torch.compile
jvlunteren 9919ba2
adapted test_spyre_attn.py to previous name change
jvlunteren a8c26f6
limit supported data types to float16
jvlunteren 6338c32
limit supported kv cache data types to float16
jvlunteren e118284
removed redundant code
jvlunteren 82d2daf
indicated if steps are executed on CPU and/or Spyre
jvlunteren 781c095
renaming
jvlunteren c2556b3
further renaming
jvlunteren a2eadbd
use utils for transfers between cpu and spyre
jvlunteren 0c6100f
various updates to test
jvlunteren d25ec3f
formatting
jvlunteren 49b6109
WIP: reworked D2H movements
bohnstingl 88e12ea
fixed supports_head_size()
jvlunteren 17d2194
Merge branch 'pytorch_native_attention' of github.com:jvlunteren/vllm…
bohnstingl a13a657
Enforce dtype="float16"
bohnstingl 91a24d6
Moved assert
bohnstingl dc5a07c
Corrected stripped attention test
bohnstingl 80c7cc9
Updates to address review comments
bohnstingl 8adae0a
Merge branch 'main' of github.com:vllm-project/vllm-spyre into pytorc…
bohnstingl af9d8f9
Integrated minor review findings
bohnstingl c6fd7f9
Merge branch 'main' of github.com:vllm-project/vllm-spyre into pytorc…
bohnstingl 7882018
Integrated reviewer comments and suggestions
bohnstingl fef4e7f
Fixing formatting errors
bohnstingl 3d3a169
Switched KV cache format to (num_blocks, 2, ...)
bohnstingl 1749821
Removed outdated max_num_seqs==1 restriction
bohnstingl a3eecc5
Removed enforce_eager argument
bohnstingl 3b02c2c
Merge branch 'main' into pytorch_native_attention
jvlunteren File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would we want this to default to 2? this defines the number of tokens in the batch. 2 is super low, does it even work? we still use the base scheduler in
vllm_spyre_nextso the value is not overridden for our granite3.3-8b model as this was the case invllm_spyre...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
This will create a lot of chunked prefills for the example below I guess? Is this something that even works right now?