-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[ROCm] Support for Whisper v1 with Aiter Unified Attention and Aiter Flash Attention #28376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for Whisper v1 on ROCm by enabling Aiter Unified Attention and Aiter Flash Attention for cross-attention workloads. The changes correctly modify attention type validation to allow for the ENCODER_DECODER attention type. However, I've identified a critical issue in the AiterFlashAttentionImpl where it incorrectly attempts to use the paged KV cache as a contiguous tensor for cross-attention when the key and value tensors are not provided. This will result in incorrect behavior and must be addressed.
| key = key[:num_actual_tokens] if key is not None else key_cache[:num_actual_tokens] | ||
| value = value[:num_actual_tokens] if value is not None else value_cache[:num_actual_tokens] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to handle key and value being None for cross-attention is incorrect. The key_cache and value_cache are paged tensors, not contiguous tensors of keys and values.
The shape of key_cache is [num_blocks, block_size, num_kv_heads, head_size]. Slicing it with [:num_actual_tokens] incorrectly treats num_actual_tokens as a number of blocks and will result in a tensor with an incorrect shape and contents, leading to errors or wrong results in the attention computation.
For cross-attention where key and value are None, the keys and values must be gathered from the paged KV cache into a contiguous tensor before being passed to attention functions like aiter.flash_attn_varlen_func. The extend_forward method in this same class provides an example of how to do this using cp_mha_gather_cache. A similar approach should be adopted for the prefill and decode paths when key and value are None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in latest commit to correctly handle None key/value in cross-attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| RocmAttentionBackend.validate_head_size(head_size) | ||
|
|
||
| if attn_type != AttentionType.DECODER: | ||
| if attn_type not in [AttentionType.DECODER, AttentionType.ENCODER_DECODER]: | ||
| raise NotImplementedError( | ||
| "Encoder self-attention and " | ||
| "encoder/decoder cross-attention " | ||
| "are not implemented for " | ||
| "RocmAttentionImpl" | ||
| "Encoder self-attention is not implemented for RocmAttentionImpl" | ||
| ) | ||
|
|
||
| self.fp8_dtype = current_platform.fp8_dtype() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle encoder–decoder calls without key/value tensors
The constructor now accepts AttentionType.ENCODER_DECODER, but forward still assumes key and value are always present. During decoder-side cross attention, later decode steps reuse the encoder KV cache and invoke this path with key=None/value=None. The new guard no longer blocks these calls, so chunked_prefill_paged_decode immediately dereferences key[:num_actual_tokens] and key.shape, raising an exception before any attention is computed. Either revert the constructor restriction or update forward to fall back to the cached tensors when key/value are None.
Useful? React with 👍 / 👎.
|
@apinge I understand that you have stated that we need to use latest AITER commit. Is there any chance that this work with the AITER version in vllm/docker/Dockerfile.rocm_base Line 10 in 6b2b9fd
|
b80b469 to
3f33199
Compare
This aiter version works fine, provided that #28383 is applied. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
There are the same review questions related to PR #28346 . We will wait for the other PR issue to sort out the issues. |
…ash attention Signed-off-by: apinge <[email protected]>
Signed-off-by: apinge <[email protected]>
Signed-off-by: apinge <[email protected]>
Signed-off-by: apinge <[email protected]>
3f33199 to
110481d
Compare
I found that the changes addressing the review questions are causing an accuracy problem — I’ve left a comment in #28346 . Also, testing shows that the Aiter Flash Attention backend can hit a NaN issue for some prompts, and this issue has been fixed in PR #28670 . |
…t#28376 Signed-off-by: Andreas Karatzas <[email protected]>
Purpose
This PR enables broader attention backend support for Whisper v1 on ROCm platform.
Building on the existing Triton backend PR #28346 , it introduces:
Aiter Unified Attention
Aiter Flash Attention
This change depends on modifications from the Triton backend PR. Since both PRs modify the same file (
vllm/v1/worker/utils.py).Test Plan
Whisper v1 on ROCm with Aiter backends requires the latest Aiter version, tested with commit 7639e55
Test Result
Result of Aiter Unified Attention
Result for Aiter Flash Attention
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.