[Spyre-Next] Integrated custom attention backend#798
[Spyre-Next] Integrated custom attention backend#798bohnstingl wants to merge 16 commits intotorch-spyre:mainfrom
Conversation
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
|
👋 Hi! Thank you for contributing to vLLM support on Spyre. We also recommend installing prek and configuring it to check your code before every local commit. |
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
|
I updated this PR with some new features. In particular, I
Also, I evaluated the implementation with three runs on the GSM8K test from upstream vLLM for the full 1319 questions. Below is a table with the results
As one can see, all implementation appear to be functionally equivalent and provide similar results in terms of accuracy. However, there are significant differences in terms of throughput performance. Most notably, there is an order of magnitude drop from the upstream CPU attention to the variant using |
@bohnstingl wouldn't this be expected since we're passing data back and forth from the cpu and spyre cards a bunch of times during every forward pass to run the attention on spyre but everything else on cpu? edit- oh nevermind this is all running on cpu anyway. I think I'd still expect though that the cpu kernels shipped with vllm are much more efficient |
| "What are IBMs main businesses?", | ||
| ] | ||
|
|
||
| engine = LLM( |
There was a problem hiding this comment.
We should probably give this file a descriptive name, also this seems to fail for me with NotImplementedError: Sliding window not supported yet as is, is this missing a configuration to disable sliding window?
Signed-off-by: Joe Runde <joe@joerun.de>
|
bot:next-test |
Description
This PR provides the sekeleton to add new attention backends to the vllm_spyre_next plugin.
It uses the first draft implementation of the pytorch-native paged attention as an example
Related Issues
#648 and former PR #774
Checklist
bash format.sh)Signed-off-by:line (DCO compliance)cc @tdoublep @dilipgb @joerunde