Skip to content

[ROCm] Make Whisper causal attention backend-agnostic#34631

Open
laudney wants to merge 1 commit intovllm-project:mainfrom
mmonad:fix/whisper-backend-agnostic
Open

[ROCm] Make Whisper causal attention backend-agnostic#34631
laudney wants to merge 1 commit intovllm-project:mainfrom
mmonad:fix/whisper-backend-agnostic

Conversation

@laudney
Copy link
Copy Markdown
Contributor

@laudney laudney commented Feb 16, 2026

Summary

  • Remove hardcoded backend allowlist (FlashAttentionBackend, AiterFlashAttentionBackend, RocmAttentionBackend, TritonAttentionBackend) from whisper_causal.py
  • Remove the corresponding explicit imports and _SUPPORTED_BACKENDS check
  • The model already uses get_attn_backend from the attention selector and subclass_attention_backend_with_overrides to wrap the selected backend — the allowlist was redundant and blocked backends that work fine (e.g. on RDNA4/gfx12)

This is a pure deletion (~39 lines removed, 0 added). The subclass_attention_backend_with_overrides mechanism already validates backend compatibility at a lower level.

Test plan

  • Whisper causal inference on FlashAttention (CUDA) — no behavior change
  • Whisper causal inference on ROCm with non-Flash backends
  • Existing CI should pass (no new code paths)

@mergify mergify bot added the rocm Related to AMD ROCm label Feb 16, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Feb 16, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request makes the Whisper causal attention backend-agnostic by removing a hardcoded allowlist of backends. This change correctly identifies that the explicit list was redundant, as backend validation is already handled by get_attn_backend. By deleting the allowlist and associated imports, the code is simplified and more maintainable, and it enables support for newer backends on platforms like ROCm without requiring modifications to this file. The changes are sound and represent a good improvement.

@DarkLight1337
Copy link
Copy Markdown
Member

DarkLight1337 commented Feb 17, 2026

cc @tjtanaa @AndreasKaratzas can you verify this model/attention backend combination?

@laudney
Copy link
Copy Markdown
Contributor Author

laudney commented Feb 17, 2026

Related PRs (RDNA4/gfx12 series)

This PR is part of a series enabling RDNA4 (gfx12) support in vLLM:

Each PR is independent and can be reviewed/merged separately.

@laudney laudney force-pushed the fix/whisper-backend-agnostic branch from 77856e7 to a3d136a Compare February 17, 2026 20:22
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 17, 2026

Hi @laudney, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@laudney laudney force-pushed the fix/whisper-backend-agnostic branch from a3d136a to 23aef46 Compare February 17, 2026 20:44
@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

cc @tjtanaa @AndreasKaratzas can you verify this model/attention backend combination?

Definitely. @laudney can you please help me save some time? Would it be easy to have all of those PRs into one branch and give me the fork and commit hash of that branch? If you do that I will certainly be able to evaluate your changes and share the results here :)

@laudney
Copy link
Copy Markdown
Contributor Author

laudney commented Feb 17, 2026

@AndreasKaratzas Sure! Here you go:

Fork: mmonad/vllm
Branch: feat/rocm-rdna4-combined
Commit: 0a3d6653a0d414a047904f301f9190305250c672

This branch contains all 8 commits from the 5 PRs on top of upstream/main:

  1. 21c8125 — [ROCm] Make Whisper causal attention backend-agnostic ([ROCm] Make Whisper causal attention backend-agnostic #34631)
  2. 9ae352a — [ROCm] Use supports_fp8() for FP8 feature gates instead of arch checks ([ROCm] Use supports_fp8() for FP8 feature gates instead of arch checks #34740)
  3. 0fbf2a8 — [ROCm] Enable FP8 KV-cache and relax constraints for RDNA4 custom paged attention ([ROCm] Enable FP8 KV-cache and relax constraints for RDNA4 custom paged attention #34741)
  4. 79b1817 — [ROCm] Enable LLMM1 skinny GEMM kernel for RDNA4/gfx1x decode ([ROCm] Enable wvSplitK skinny GEMM kernel for RDNA4/gfx1x decode #34709)
  5. f838bfd — [ROCm] Enable wvSplitK skinny GEMM kernel for RDNA4/gfx1x decode ([ROCm] Enable wvSplitK skinny GEMM kernel for RDNA4/gfx1x decode #34709)
  6. dd3c6f9 — [ROCm] Enable wvSplitKQ FP8 skinny GEMM kernel for RDNA4/gfx12 decode ([ROCm] Enable wvSplitK skinny GEMM kernel for RDNA4/gfx1x decode #34709)
  7. dd153f6 — [ROCm] Add MXFP4 inline dequant Triton kernel for RDNA4/gfx12 ([ROCm] Add MXFP4 inline dequant Triton kernel for RDNA4/gfx12 #34632)
  8. 0a3d665 — Fix MXFP4 dequant kernel: use MoEActivation enum instead of strings ([ROCm] Add MXFP4 inline dequant Triton kernel for RDNA4/gfx12 #34632)

Thank you for taking the time to evaluate!

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

Will let you know as soon as I can the results :) We are trying to address some high priority tasks first for CI to report accurately results, and I'm going to get back to you probably by tomorrow this time. Sry for delays

@laudney
Copy link
Copy Markdown
Contributor Author

laudney commented Feb 18, 2026

@AndreasKaratzas No worries at all, take your time! Really appreciate you looking into this. We all want better AMD support in this amazing project, so happy to help if any questions come up during testing.

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

@AndreasKaratzas No worries at all, take your time! Really appreciate you looking into this. We all want better AMD support in this amazing project, so happy to help if any questions come up during testing.

Full CI run is up: https://buildkite.com/vllm/amd-ci/builds/4991/steps/canvas

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

@laudney I am going to take a look at this again tomorrow :) Sry for the delay, have been working on some critical CI-related tasks.

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

Full CI build as of yesterday: https://buildkite.com/vllm/amd-ci/builds/4991/steps/canvas

There were no new regressions observed. At the same time I realize that these changes mostly affect other architectures, not gfx9. Therefore, I should ask testing on gfx11/12 as well for those. Also there are changes inside the attention.cu file which means that we should do a perf analysis of that.

cc @tjtanaa @gshtras

@laudney laudney force-pushed the fix/whisper-backend-agnostic branch from 23aef46 to acee9b8 Compare March 22, 2026 12:48
@laudney
Copy link
Copy Markdown
Contributor Author

laudney commented Mar 22, 2026

Hey, this is approved and rebased on latest main. What else do I need to do to get it merged?

@DarkLight1337
Copy link
Copy Markdown
Member

From @AndreasKaratzas

Therefore, I should ask testing on gfx11/12 as well for those. Also there are changes inside the attention.cu file which means that we should do a perf analysis of that.

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

From @AndreasKaratzas

Therefore, I should ask testing on gfx11/12 as well for those. Also there are changes inside the attention.cu file which means that we should do a perf analysis of that.

I see that this PR has probably been heavily refactored. Specifically, the file that I note in my first comment (attention.cu) is not in the diff. So I think I am good with this PR as long as it is passing the tests. @laudney please rebase with latest main so that we do a comparison of failures.

@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 23, 2026
Remove the FlashAttentionBackend-only guard in whisper_causal.py so
that Voxtral and other Whisper-based models can run on ROCm/RDNA4
with the Triton attention backend.

- Remove issubclass(backend, FlashAttentionBackend) check
- Delegate get_kv_cache_shape to the underlying backend instead of
  hardcoding Flash's (2, num_blocks, ...) layout

Signed-off-by: L.B.R. <lbr@mmonad.com>
@laudney laudney force-pushed the fix/whisper-backend-agnostic branch from acee9b8 to 4a4f4d8 Compare March 23, 2026 09:08
@laudney
Copy link
Copy Markdown
Contributor Author

laudney commented Mar 23, 2026

@AndreasKaratzas Rebased on latest main. Ready for CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

4 participants