[ROCm] Change default settings for ROCm by gshtras · Pull Request #33271 · vllm-project/vllm

gshtras · 2026-01-28T17:26:24Z

Enable AITER by default on platforms where it is supported (gfx9x)
Disable AITER MHA
Switch the default attention backend for ROCm to ROCM_ATTN as it consistently shows better performance than TRITON_ATTN, at least until #28497 is accepted

… Disable AITER MHA. Switch the default attention backend to ROCM_ATTN Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

gemini-code-assist

Code Review

This pull request updates the default settings for ROCm to enhance performance. Key changes include enabling AITeR by default on supported platforms (gfx9x), disabling AITeR MHA, and setting ROCM_ATTN as the default attention backend. The code changes align with the stated objectives. I've identified one minor issue regarding an outdated comment that should be corrected for code clarity.

vllm/envs.py

Rohan138 · 2026-01-28T21:23:34Z

vllm/platforms/rocm.py

            vllm_config = get_current_vllm_config_or_none()
            if (
                vllm_config is not None
                and vllm_config.attention_config.use_prefill_decode_attention


can we nuke this flag/section?

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

ProExpertProg

Should we do a deprecation for use_prefill_decode_attention? https://docs.vllm.ai/en/latest/contributing/deprecation_policy/

tjtanaa · 2026-01-30T16:36:21Z

Switch the default attention backend for ROCm to ROCM_ATTN as it consistently shows better performance than TRITON_ATTN, at least until #28497 is accepted

is #28497 consistently better than ROCM_ATTN?
Another thing, when ROCM_ATTN does not dispatch the custom paged attention, it is much slower than TRITON_ATTN. That's what I observed on Qwen3-MoE-Models. So, I think we can optimize in future PR so that ROCM_ATTN can match TRITON_ATTN whenever the custom paged attention cannot be dispatched.

And there is something weird, haven't had time to look closely. In vLLM omni diffusion model, when we enable all AITER functions, it's speed is slower than only enable AITER Attention.

gshtras · 2026-01-30T18:44:35Z

Should we do a deprecation for use_prefill_decode_attention? https://docs.vllm.ai/en/latest/contributing/deprecation_policy/

This var is a new addition that replaced the old rapidly deprecated env. I don't think many users, if any at all, switched their workflows to use it instead, and not the backend selection directly. But for the sake of completeness we could follow the deprecation

gshtras · 2026-01-30T18:46:55Z

Switch the default attention backend for ROCm to ROCM_ATTN as it consistently shows better performance than TRITON_ATTN, at least until #28497 is accepted

is #28497 consistently better than ROCM_ATTN? Another thing, when ROCM_ATTN does not dispatch the custom paged attention, it is much slower than TRITON_ATTN. That's what I observed on Qwen3-MoE-Models. So, I think we can optimize in future PR so that ROCM_ATTN can match TRITON_ATTN whenever the custom paged attention cannot be dispatched.

And there is something weird, haven't had time to look closely. In vLLM omni diffusion model, when we enable all AITER functions, it's speed is slower than only enable AITER Attention.

#28497 is significantly faster than TRITON_ATTN. It is often not slower than ROCM_ATTN, although between the 2 where ROCM_ATTN dispatches to the custom_paged_attn, it is usually faster.
Switching the default from TRITON to ROCM would make an average use case much faster, even if Qwen3-MoE will become slower, but then for it an old backend can still be explicitly selected.

It's worth mentioning that in any official AMD docker release prior to 0.14, published in rocm/vllm, ROCM_ATTN was the default backend

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

tjtanaa · 2026-01-31T00:48:43Z

@gshtras I just remember something, the AITER Fused MoE and upcoming preshuffled GEMM kernels (#29981, and #28837) , by default does not support that many shapes, so vllm serve will highly likely crash and requires tuning. The heuristics of those kernels are not able to cover all unseen shapes.
Example, the AITER Fused MoE does not support the shapes of Qwen3-235B-BF16 model TP4. This issue still persists #22245 . The approach proposed in the closed issue is to set AITER_ONLINE_TUNE=1 (flag from AITER, this is for Fused MoE kernel only, not for GEMM). #22245 (comment)

gshtras · 2026-02-02T16:20:37Z

@gshtras I just remember something, the AITER Fused MoE and upcoming preshuffled GEMM kernels (#29981, and #28837) , by default does not support that many shapes, so vllm serve will highly likely crash and requires tuning. The heuristics of those kernels are not able to cover all unseen shapes. Example, the AITER Fused MoE does not support the shapes of Qwen3-235B-BF16 model TP4. This issue still persists #22245 . The approach proposed in the closed issue is to set AITER_ONLINE_TUNE=1 (flag from AITER, this is for Fused MoE kernel only, not for GEMM). #22245 (comment)

What's the impact of ONLINE_TUNE on performance? Sounds like VLLM_ROCM_USE_AITER_MOE can't be turned on by default

tjtanaa · 2026-02-04T06:59:32Z

ONLINE_TUNE will pick moe kernels that will affect accuracy for some models.

AndreasKaratzas · 2026-02-19T03:26:58Z

Full CI run with this PR: https://buildkite.com/vllm/amd-ci/builds/4997/steps
Nightly run: https://buildkite.com/vllm/amd-ci/builds/4972/steps/canvas

For NotImplementedError: Encoder self-attention is not implemented for RocmAttentionImpl there are two possible solutions:

We configure rocm.py to dispatch encoder-only self-attention model architectures to TRITON_ATTN or ROCM_AITER_FA (if AITER is enabled).
Implement encoder-only self attention kernel for ROCM_ATTN

EDIT: Motivation for this was originally observed in the results of the Entrypoints Integration Test (API Server 1) group while running:

entrypoints/openai/test_run_batch.py::test_empty_file
entrypoints/openai/test_run_batch.py::test_embeddings
entrypoints/openai/test_run_batch.py::test_score

This happens in Entrypoints Integration Test (API Server 2) as well (entrypoints/instrumentator/test_optional_middleware.py::test_no_api_token).

At the same time there seem to be some accuracy issues here:
entrypoints/pooling/score/test_online_score_vision.py under Entrypoints Integration Test (Pooling)

Also, the Regression Test looks like it hit some seg fault while running test_regression.py::test_model_from_modelscope. This might be an infra issue, but it's worth investigating a bit further. This happens on both MI325 and MI355 (https://buildkite.com/vllm/amd-ci/builds/4997/steps/canvas?sid=019c7286-f853-4add-b0c3-9810f8eceb5c&tab=output#019c7286-f974-4e88-8b58-e6852f381267/L569).

In the Distributed Tests (both the 4 GPU and 8 GPU) there are some weird errors but it looks like it has to do with AITER. I will rerun things after I rebase and confirm if there is anything there correlated to ROCM_ATTN.

For Engine Test, there is an assertion AssertionError: pass_config.fuse_norm_quant: expected True, got False. This however should be fixed in test level, so it is not in the scope of this PR.

In the V1 Test e2e + engine , there is the issue of MLA auto dispatching. Specifically, in v1/e2e/test_spec_decode.py::test_mtp_correctness[deepseek], accuracy is 0 (https://buildkite.com/vllm/amd-ci/builds/4997/steps/canvas?sid=019c7286-f854-424c-9635-5439dc40d7bf&tab=output#019c7286-f976-4cb7-8b37-2e090cb6fadc/L9507). I am thinking that an auto-dispatching mechanism to default MLA architectures to TRITON should take place in rocm.py.

V1 Test others can be patched in test level. The failure is because TREE_ATTN is trying to compare the hierarchical steps with the default backend. I've actually already put a PR to address it.

V1 Test attention (H100) should be fixed in this very PR I think. What happens there is that in test_rocm_attention_backends_selection.py::test_standard_attention_backend_selection we are trying to verify the backend selection process. So, since we are updating the rocm.py file, we should update this test file with what attention backend we should be expecting to be selected.

Additionally, I think that all failures in kernels/moe/test_routing.py::test_grouped_topk under Kernels MoE Test %N are correlated to this PR and should be investigated before merging this. This I recommend because it has do do with potential accuracy issues of the ROCM_ATTN backend (I might be wrong but worth investigating).

Another interesting failure that should be resolved inside the rocm.py file comes from models/language/generation/test_common.py::test_models[True-False-5-32-stabilityai/stablelm-3b-4e1t] under Language Models Test (Extended Generation): ValueError: Head size 80 is not supported by RocmAttention. Supported head sizes are: [32, 64, 96, 128, 160, 192, 224, 256]. This means that we must make sure inside the rocm.py to default to TRITON_ATTN if head size is incompatible with ROCM_ATTN. Same is the issue in models/quantization/test_gguf.py::test_models[1-5-32-bfloat16-model4] under Quantized Models Test.

In models/multimodal/generation/test_common.py::test_video_models[qwen3_vl-test_case20] under Multi-Modal Models Test (Standard) Qwen fails, and it is not the first time that I see Qwen be a bit less accurate while utilizing ROCM_ATTN. It would be interesting to run lm-eval on Qwen to compare TRITON_ATTN vs ROCM_ATTN. This claim is also backed up by the Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy test which fails with near-zero accuracy.

LoRA TP Test (Distributed) is an interesting issue, though it's one of those that I want to see after a second CI run.

NixlConnector PD accuracy tests (Distributed) and DP EP NixlConnector PD accuracy tests (Distributed) are also regressions though I cannot see how they could be correlated to this PR.

The regression in LM Eval Large Models I think has to do with what really happens in terms of attention backend selection, since AssertionError: Aiter MLA only supports 16 or 128 number of heads is an issue that we are not having at the moment as rocm.py is. Indeed, I don't see why AITER_MLA was selected since AITER is not enabled in vllm/.buildkite/lm-eval-harness/test_lm_eval_correctness.py.

@gshtras I can launch another run to see if some of these issues are resolved. Let me know :)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

tjtanaa · 2026-02-26T10:57:16Z

@AndreasKaratzas @gshtras how about we split this into two PRs: 1) Enable AITER by default 2) change the default attention kernel to ROCM_ATTN ? Each of changes has huge impact to the AMD CI.

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

mergify · 2026-03-05T17:55:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gshtras.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Change default settings for ROCm. Enable AITER where it is supported.…

ecbb5fd

… Disable AITER MHA. Switch the default attention backend to ROCM_ATTN Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

mergify bot added the rocm Related to AMD ROCm label Jan 28, 2026

github-project-automation bot added this to AMD Jan 28, 2026

github-project-automation bot moved this to Todo in AMD Jan 28, 2026

gemini-code-assist bot reviewed Jan 28, 2026

View reviewed changes

vllm/envs.py Outdated Show resolved Hide resolved

Rohan138 reviewed Jan 28, 2026

View reviewed changes

Cleanup unused fields and variables

61b4581

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

mergify bot added the v1 label Jan 29, 2026

gshtras marked this pull request as ready for review January 30, 2026 16:19

gshtras requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tjtanaa, tlrmchlsmth, yewentao256 and youkaichao as code owners January 30, 2026 16:19

ProExpertProg reviewed Jan 30, 2026

View reviewed changes

Return use_prefill_decode_attention and add a deprecation warning for it

65f84dc

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

gshtras mentioned this pull request Feb 18, 2026

[ROCm] Add extra step in config initialization to populate custom ops before compilation config init #34848

Merged

gshtras added a commit to ROCm/vllm that referenced this pull request Feb 20, 2026

Merge ROCm:rocm_defaults (vllm-project#33271)

dadbceb

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

gshtras added a commit to ROCm/vllm that referenced this pull request Feb 20, 2026

Merge ROCm:rocm_defaults (vllm-project#33271)

d1be70c

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

gshtras added a commit to ROCm/vllm that referenced this pull request Feb 23, 2026

Merge ROCm:rocm_defaults (vllm-project#33271)

968f014

gshtras mentioned this pull request Feb 25, 2026

[ROCm] Enabling encoder and encoder-decoder on ROCm and AITER unified backends #35334

Merged

update attention backend selection test in accordance with new defaults

9fa4d8f

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

micah-wil mentioned this pull request Feb 27, 2026

[ROCm] Add stablelm Head Size 80 To Supported Head Sizes For ROCM_ATTN #35527

Merged

AndreasKaratzas mentioned this pull request Feb 28, 2026

[ROCm][CI] Parametrize vision score tests across attention backends with per-backend tolerances #35571

Merged

mergify bot added the needs-rebase label Mar 5, 2026

micah-wil mentioned this pull request Mar 5, 2026

[ROCm][CI] Prep Tests For Change To ROCM_ATTN As New Default Backend On ROCm #36025

Merged

gshtras closed this Mar 10, 2026

github-project-automation bot moved this from Todo to Done in AMD Mar 10, 2026

Uh oh!

Conversation

gshtras commented Jan 28, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Rohan138 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gshtras commented Jan 30, 2026

Uh oh!

gshtras commented Jan 30, 2026

Uh oh!

tjtanaa commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gshtras commented Feb 2, 2026

Uh oh!

tjtanaa commented Feb 4, 2026

Uh oh!

AndreasKaratzas commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjtanaa commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gshtras commented Jan 28, 2026 •

edited by github-actions bot

Loading

tjtanaa commented Jan 30, 2026 •

edited

Loading

tjtanaa commented Jan 31, 2026 •

edited

Loading

AndreasKaratzas commented Feb 19, 2026 •

edited

Loading

tjtanaa commented Feb 26, 2026 •

edited

Loading