Skip to content

Conversation

@apinge
Copy link

@apinge apinge commented Nov 10, 2025

Purpose

This PR enables broader attention backend support for Whisper v1 on ROCm platform.
Building on the existing Triton backend PR #28346 , it introduces:

Aiter Unified Attention
Aiter Flash Attention

This change depends on modifications from the Triton backend PR. Since both PRs modify the same file (vllm/v1/worker/utils.py).

Test Plan

Whisper v1 on ROCm with Aiter backends requires the latest Aiter version, tested with commit 7639e55

export CUDA_VISIBLE_DEVICES=0,1
export VLLM_USE_ROCM_AITER=1
# test aiter unified attention
export VLLM_ATTENTION_BACKEND=ROCM_AITER_UNIFIED_ATTN
 pytest ./tests/models/multimodal/generation/test_whisper.py 
 # test aiter flash attention
export VLLM_ATTENTION_BACKEND=ROCM_AITER_FA
 pytest ./tests/models/multimodal/generation/test_whisper.py

Test Result

Result of Aiter Unified Attention

============================================================================================ test session starts =============================================================================================
platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0
rootdir: /root/workspace/vllm_master_251109
configfile: pyproject.toml
plugins: asyncio-1.2.0, anyio-4.11.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 3 items

models/multimodal/generation/test_whisper.py ...                                                                                                                                                       [100%]

============================================================================================== warnings summary ==============================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

../../../../usr/local/lib/python3.12/dist-packages/audioread/rawread.py:16
  /usr/local/lib/python3.12/dist-packages/audioread/rawread.py:16: DeprecationWarning: 'aifc' is deprecated and slated for removal in Python 3.13
    import aifc

../../../../usr/local/lib/python3.12/dist-packages/audioread/rawread.py:17
  /usr/local/lib/python3.12/dist-packages/audioread/rawread.py:17: DeprecationWarning: 'audioop' is deprecated and slated for removal in Python 3.13
    import audioop

../../../../usr/local/lib/python3.12/dist-packages/audioread/rawread.py:19
  /usr/local/lib/python3.12/dist-packages/audioread/rawread.py:19: DeprecationWarning: 'sunau' is deprecated and slated for removal in Python 3.13
    import sunau

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================================= 3 passed, 5 warnings in 31.29s =======================================================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Result for Aiter Flash Attention

pytest ./models/multimodal/generation/test_whisper.py
============================================================================================ test session starts =============================================================================================
platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0
rootdir: /root/workspace/vllm_master_251109
configfile: pyproject.toml
plugins: asyncio-1.2.0, anyio-4.11.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 3 items

models/multimodal/generation/test_whisper.py ...                                                                                                                                                       [100%]

============================================================================================== warnings summary ==============================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

../../../../usr/local/lib/python3.12/dist-packages/audioread/rawread.py:16
  /usr/local/lib/python3.12/dist-packages/audioread/rawread.py:16: DeprecationWarning: 'aifc' is deprecated and slated for removal in Python 3.13
    import aifc

../../../../usr/local/lib/python3.12/dist-packages/audioread/rawread.py:17
  /usr/local/lib/python3.12/dist-packages/audioread/rawread.py:17: DeprecationWarning: 'audioop' is deprecated and slated for removal in Python 3.13
    import audioop

../../../../usr/local/lib/python3.12/dist-packages/audioread/rawread.py:19
  /usr/local/lib/python3.12/dist-packages/audioread/rawread.py:19: DeprecationWarning: 'sunau' is deprecated and slated for removal in Python 3.13
    import sunau

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================================= 3 passed, 5 warnings in 29.38s =======================================================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@apinge apinge requested a review from gshtras as a code owner November 10, 2025 03:20
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added rocm Related to AMD ROCm v1 labels Nov 10, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Whisper v1 on ROCm by enabling Aiter Unified Attention and Aiter Flash Attention for cross-attention workloads. The changes correctly modify attention type validation to allow for the ENCODER_DECODER attention type. However, I've identified a critical issue in the AiterFlashAttentionImpl where it incorrectly attempts to use the paged KV cache as a contiguous tensor for cross-attention when the key and value tensors are not provided. This will result in incorrect behavior and must be addressed.

Comment on lines 732 to 733
key = key[:num_actual_tokens] if key is not None else key_cache[:num_actual_tokens]
value = value[:num_actual_tokens] if value is not None else value_cache[:num_actual_tokens]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic to handle key and value being None for cross-attention is incorrect. The key_cache and value_cache are paged tensors, not contiguous tensors of keys and values.

The shape of key_cache is [num_blocks, block_size, num_kv_heads, head_size]. Slicing it with [:num_actual_tokens] incorrectly treats num_actual_tokens as a number of blocks and will result in a tensor with an incorrect shape and contents, leading to errors or wrong results in the attention computation.

For cross-attention where key and value are None, the keys and values must be gathered from the paged KV cache into a contiguous tensor before being passed to attention functions like aiter.flash_attn_varlen_func. The extend_forward method in this same class provides an example of how to do this using cp_mha_gather_cache. A similar approach should be adopted for the prefill and decode paths when key and value are None.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest commit to correctly handle None key/value in cross-attention.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 248 to 246
RocmAttentionBackend.validate_head_size(head_size)

if attn_type != AttentionType.DECODER:
if attn_type not in [AttentionType.DECODER, AttentionType.ENCODER_DECODER]:
raise NotImplementedError(
"Encoder self-attention and "
"encoder/decoder cross-attention "
"are not implemented for "
"RocmAttentionImpl"
"Encoder self-attention is not implemented for RocmAttentionImpl"
)

self.fp8_dtype = current_platform.fp8_dtype()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle encoder–decoder calls without key/value tensors

The constructor now accepts AttentionType.ENCODER_DECODER, but forward still assumes key and value are always present. During decoder-side cross attention, later decode steps reuse the encoder KV cache and invoke this path with key=None/value=None. The new guard no longer blocks these calls, so chunked_prefill_paged_decode immediately dereferences key[:num_actual_tokens] and key.shape, raising an exception before any attention is computed. Either revert the constructor restriction or update forward to fall back to the cached tensors when key/value are None.

Useful? React with 👍 / 👎.

@tjtanaa
Copy link
Collaborator

tjtanaa commented Nov 10, 2025

@apinge I understand that you have stated that we need to use latest AITER commit. Is there any chance that this work with the AITER version in

ARG AITER_BRANCH="9716b1b8"
?

@apinge
Copy link
Author

apinge commented Nov 10, 2025

@apinge I understand that you have stated that we need to use latest AITER commit. Is there any chance that this work with the AITER version in

ARG AITER_BRANCH="9716b1b8"

?

This aiter version works fine, provided that #28383 is applied.

@mergify
Copy link

mergify bot commented Nov 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @apinge.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 13, 2025
@tjtanaa
Copy link
Collaborator

tjtanaa commented Nov 13, 2025

There are the same review questions related to PR #28346 . We will wait for the other PR issue to sort out the issues.

@apinge
Copy link
Author

apinge commented Nov 14, 2025

There are the same review questions related to PR #28346 . We will wait for the other PR issue to sort out the issues.

I found that the changes addressing the review questions are causing an accuracy problem — I’ve left a comment in #28346 .

Also, testing shows that the Aiter Flash Attention backend can hit a NaN issue for some prompts, and this issue has been fixed in PR #28670 .

@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 21, 2025
AndreasKaratzas added a commit to ROCm/vllm that referenced this pull request Nov 21, 2025
@micah-wil
Copy link
Contributor

micah-wil commented Nov 22, 2025

Hi @apinge, could you check the CI failure?

cc @tjtanaa

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) November 22, 2025 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants