[Hardware][AMD] Add fused QK RoPE and reshape & cache flash support for ROCm by mjkvaak-amd · Pull Request #28850 · vllm-project/vllm

mjkvaak-amd · 2025-11-17T10:17:06Z

Purpose

This PR adds support for fusing QK RoPE, zeros, and reshape_and_cache in ROCm, and implements the fusion for Qwen3 and Qwen3-MoE models. This fusion kernel, _fused_qk_rope_reshape_and_cache_kernel, slightly improves the model speed, while not affecting the quality.

Test Plan

Please advise on the test plan, since it's not entirely clear based on existing tests whether/how this feature should be tested. Maybe something similar to test_silu_mul_quant_fusion.py?

Test Result

See "Test Plan"

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…AITER FlashAttention

github-actions · 2025-11-17T10:18:07Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces a fused kernel for ROCm to enhance performance for Qwen3 and Qwen3-MoE models. The changes are well-contained and the approach of using a fused kernel for RoPE, zeros, and caching is sound. However, I've identified a critical bug that could cause a NameError on non-ROCm platforms, along with some code duplication and a magic number that should be refactored for better maintainability. Addressing these points will improve the robustness and clarity of the code.

gemini-code-assist · 2025-11-17T10:21:38Z

vllm/attention/layer.py

 if current_platform.is_rocm():
    from vllm.platforms.rocm import on_gfx9
+
+    if envs.VLLM_ROCM_USE_AITER:
+        VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE = (
+            envs.VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE
+        )
+    else:
+        VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE = False
 else:
    on_gfx9 = lambda *args, **kwargs: False


The variable VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE is only defined if current_platform.is_rocm() is true. This will cause a NameError on other platforms (e.g., CUDA) where this variable is used later in unified_attention_with_output. The definition should be refactored to ensure it's always defined, regardless of the platform.

Suggested change

if current_platform.is_rocm():

from vllm.platforms.rocm import on_gfx9

if envs.VLLM_ROCM_USE_AITER:

VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE = (

envs.VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE

)

else:

VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE = False

else:

on_gfx9 = lambda *args, **kwargs: False

if current_platform.is_rocm():

from vllm.platforms.rocm import on_gfx9

else:

on_gfx9 = lambda *args, **kwargs: False

if current_platform.is_rocm() and envs.VLLM_ROCM_USE_AITER:

VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE = (

envs.VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE

)

else:

VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE = False

gemini-code-assist · 2025-11-17T10:21:38Z

vllm/model_executor/models/qwen3.py

+if current_platform.is_rocm() and envs.VLLM_ROCM_USE_AITER:
+    VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE = (
+        envs.VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE
+    )
+else:
+    VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE = False


This logic for setting VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE is duplicated from vllm/attention/layer.py. To improve maintainability and avoid potential inconsistencies, this flag should be defined in a single location and imported where needed. Please remove this duplicated block and import the flag from vllm.attention.layer like so:

from vllm.attention.layer import VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE

gemini-code-assist · 2025-11-17T10:21:38Z

vllm/model_executor/models/qwen3_moe.py

+if current_platform.is_rocm() and envs.VLLM_ROCM_USE_AITER:
+    VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE = (
+        envs.VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE
+    )
+else:
+    VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE = False


This logic for setting VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE is duplicated from vllm/attention/layer.py. To improve maintainability and avoid potential inconsistencies, this flag should be defined in a single location and imported where needed. Please remove this duplicated block and import the flag from vllm.attention.layer like so:

from vllm.attention.layer import VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE

gemini-code-assist · 2025-11-17T10:21:38Z

vllm/v1/attention/backends/rocm_aiter_fa.py

-                self.kv_cache_dtype,
-                layer._k_scale,
-                layer._v_scale,
+        if positions is not None and query.shape[0] <= 256:


The value 256 is a magic number that determines the token threshold for using the fused kernel. It should be defined as a named constant, for example _FUSED_QK_ROPE_RESHAPE_AND_CACHE_MAX_TOKENS = 256, at the top of the file to improve readability and maintainability.

Suggested change

if positions is not None and query.shape[0] <= 256:

if positions is not None and query.shape[0] <= 256: # TODO: make this a constant

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-17T10:22:25Z

vllm/attention/layer.py

        if self.use_output:
            output_shape = output_shape if output_shape is not None else query.shape
-            output = torch.empty(output_shape, dtype=output_dtype, device=query.device)
+            if positions is not None:
+                output = torch.empty(
+                    output_shape, dtype=query.dtype, device=query.device
+                )
+            else:
+                output = torch.zeros(
+                    output_shape, dtype=query.dtype, device=query.device


Output buffer now allocated in quantized FP8 dtype

In the attention forward path, the output tensor is now created with dtype=query.dtype (lines 402‑410). When FP8 query quantization is active, self.query_quant converts query to an FP8 tensor before this allocation. The previous code cached the pre‑quantization dtype (output_dtype) so the output buffer remained fp16/bf16. After this change the output is allocated in FP8, but downstream attention kernels expect the regular activation dtype, so the call either fails or produces incorrect results whenever query quantization is enabled. Capture the original dtype before quantizing and use it for output to avoid creating an FP8 output buffer.

Useful? React with 👍 / 👎.

heheda12345 · 2025-11-19T01:10:30Z

CC @tjtanaa

tjtanaa · 2025-11-19T01:15:31Z

vllm/attention/layer.py

 if current_platform.is_rocm():
    from vllm.platforms.rocm import on_gfx9
+
+    if envs.VLLM_ROCM_USE_AITER:


AITER flags management are done in the _aiter_ops.py. Please move all the flags there and use rocm_aiter_ops.is_enabled() and some new flags there.

tjtanaa · 2025-11-19T01:15:47Z

vllm/attention/layer.py

-    )
+    from vllm.v1.attention.backends.rocm_aiter_fa import AiterFlashAttentionImpl
+
+    if VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE and isinstance(


AITER flags management are done in the _aiter_ops.py. Please move all the flags there and use rocm_aiter_ops.is_xxx_enabled() and some new flags there.

mergify · 2025-11-19T01:16:03Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mjkvaak-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tjtanaa · 2025-11-19T01:17:01Z

vllm/envs.py

    VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: bool = False
    VLLM_USE_FLASHINFER_MOE_MXFP4_BF16: bool = False
    VLLM_ROCM_FP8_MFMA_PAGE_ATTN: bool = False
+    VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE: bool = True


I saw that this is enabled default.
Does this apply to all models?
If this can be applied to all models, do we see improvement in general?
If it does, maybe we don't need a flag to manage it, just have a logic where when aiter is enabled, we use the fusion op.

tjtanaa · 2025-11-19T01:18:23Z

vllm/model_executor/models/qwen3.py


 logger = init_logger(__name__)
+if current_platform.is_rocm() and envs.VLLM_ROCM_USE_AITER:
+    VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE = (


AITER flags management are done in the _aiter_ops.py. Please move all the flags there and use rocm_aiter_ops.is_enabled() and some new flags there.

tjtanaa · 2025-11-19T01:18:48Z

vllm/model_executor/models/qwen3.py

            else {},
+            rotary_emb=(
+                self.rotary_emb
+                if VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE


tjtanaa · 2025-11-19T01:18:52Z

vllm/model_executor/models/qwen3.py

        k = k_by_head.view(k.shape)
-        q, k = self.rotary_emb(positions, q, k)
-        attn_output = self.attn(q, k, v)
+        if VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE:


tjtanaa · 2025-11-19T01:19:25Z

vllm/model_executor/models/qwen3_moe.py


 logger = init_logger(__name__)
+if current_platform.is_rocm() and envs.VLLM_ROCM_USE_AITER:
+    VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE = (


tjtanaa · 2025-11-19T01:19:31Z

vllm/model_executor/models/qwen3_moe.py

            else {},
+            rotary_emb=(
+                self.rotary_emb
+                if VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE


tjtanaa · 2025-11-19T01:20:17Z

vllm/v1/attention/backends/rocm_aiter_fa.py


    from vllm.triton_utils import tl, triton

+    if envs.VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE:


tjtanaa · 2025-11-19T01:21:31Z

vllm/model_executor/models/qwen3_moe.py

        k = k_by_head.view(k.shape)
-        q, k = self.rotary_emb(positions, q, k)
-        attn_output = self.attn(q, k, v)
+        if VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE:


tjtanaa · 2025-11-19T01:57:28Z

@mjkvaak-amd I think it is possible to turn this into a fusion pass that merge rotary embedding with attention ops similar to what is happening in the fusion_attn.py . Using fusion pass also avoids adding new AITER flag. Regarding to fusion passes, let's get @ProExpertProg feedback.

@ProExpertProg Do you think it is a better way to enable this feature through fusion pass?

ProExpertProg · 2025-11-19T13:23:28Z

Yes, we should do this fusion using a fusion pass. @ElizaWszola has a PR that will be merged soon that separates the cache op from unified_attention. Then we can replace rope -> cache with a fused rope_cache op. More is described in #24678.

To start this PR can just integrate the fused kernels, if you want. And then you (or someone else) can work on a pass in a follow-up. Or, you can work on the whole thing now, whatever you prefer. The first option is probably easier (kernels in this PR, pass in the next).

mjkvaak-amd · 2025-11-24T08:35:14Z

Yes, we should do this fusion using a fusion pass. @ElizaWszola has a PR that will be merged soon that separates the cache op from unified_attention. Then we can replace rope -> cache with a fused rope_cache op. More is described in #24678.

To start this PR can just integrate the fused kernels, if you want. And then you (or someone else) can work on a pass in a follow-up. Or, you can work on the whole thing now, whatever you prefer. The first option is probably easier (kernels in this PR, pass in the next).

Apologies, I didn't realize there were ongoing efforts to refactor the cache op out of unified_attention. I agree that the fusion pass is a more suitable approach moving forward for several reasons, such as being globally available to models without requiring changes to individual model blueprints (this PR only addressed Qwen3) and its ability to work cross-platform (not just on AMD).

At the moment, I have limited bandwidth to rework this PR to include only the fusion kernels. IMHO, it might be best to simply close this one and start fresh. Feel free to close it, and I'll ping you with a new PR when I find more time—unless someone else beats me to it.

mergify · 2025-12-05T14:11:11Z

Hi @mjkvaak-amd, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

github-actions · 2026-03-06T02:31:57Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

ProExpertProg · 2026-03-06T14:28:05Z

Done in #33443

add VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE support for …

efb3520

…AITER FlashAttention

mjkvaak-amd requested review from LucasWilkinson, WoosukKwon, alexm-redhat, njhill, sighingnow, tjtanaa, youkaichao and zhuohan123 as code owners November 17, 2025 10:17

mergify bot added qwen Related to Qwen models rocm Related to AMD ROCm v1 labels Nov 17, 2025

gemini-code-assist bot reviewed Nov 17, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Nov 17, 2025

View reviewed changes

tjtanaa reviewed Nov 19, 2025

View reviewed changes

mergify bot added the needs-rebase label Nov 19, 2025

tjtanaa reviewed Nov 19, 2025

View reviewed changes

github-actions bot added the stale Over 90 days of inactivity label Mar 6, 2026

github-project-automation bot added this to AMD Mar 6, 2026

github-project-automation bot moved this to Todo in AMD Mar 6, 2026

ProExpertProg closed this Mar 6, 2026

github-project-automation bot moved this from Todo to Done in AMD Mar 6, 2026

	if positions is not None and query.shape[0] <= 256:
	if positions is not None and query.shape[0] <= 256: # TODO: make this a constant


		from vllm.triton_utils import tl, triton

		if envs.VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE:

Uh oh!

Conversation

mjkvaak-amd commented Nov 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Nov 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Nov 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Nov 19, 2025

Uh oh!

tjtanaa Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProExpertProg commented Nov 19, 2025

Uh oh!

mjkvaak-amd commented Nov 24, 2025

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

ProExpertProg commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mjkvaak-amd commented Nov 17, 2025 •

edited by github-actions bot

Loading

tjtanaa Nov 19, 2025 •

edited

Loading

tjtanaa commented Nov 19, 2025 •

edited

Loading