Skip to content

[ROCm] Change default settings for ROCm#33271

Closed
gshtras wants to merge 4 commits intovllm-project:mainfrom
ROCm:rocm_defaults
Closed

[ROCm] Change default settings for ROCm#33271
gshtras wants to merge 4 commits intovllm-project:mainfrom
ROCm:rocm_defaults

Conversation

@gshtras
Copy link
Collaborator

@gshtras gshtras commented Jan 28, 2026

Enable AITER by default on platforms where it is supported (gfx9x)
Disable AITER MHA
Switch the default attention backend for ROCm to ROCM_ATTN as it consistently shows better performance than TRITON_ATTN, at least until #28497 is accepted

… Disable AITER MHA. Switch the default attention backend to ROCM_ATTN

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
@mergify mergify bot added the rocm Related to AMD ROCm label Jan 28, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Jan 28, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the default settings for ROCm to enhance performance. Key changes include enabling AITeR by default on supported platforms (gfx9x), disabling AITeR MHA, and setting ROCM_ATTN as the default attention backend. The code changes align with the stated objectives. I've identified one minor issue regarding an outdated comment that should be corrected for code clarity.

vllm_config = get_current_vllm_config_or_none()
if (
vllm_config is not None
and vllm_config.attention_config.use_prefill_decode_attention
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we nuke this flag/section?

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do a deprecation for use_prefill_decode_attention? https://docs.vllm.ai/en/latest/contributing/deprecation_policy/

@tjtanaa
Copy link
Collaborator

tjtanaa commented Jan 30, 2026

Switch the default attention backend for ROCm to ROCM_ATTN as it consistently shows better performance than TRITON_ATTN, at least until #28497 is accepted

is #28497 consistently better than ROCM_ATTN?
Another thing, when ROCM_ATTN does not dispatch the custom paged attention, it is much slower than TRITON_ATTN. That's what I observed on Qwen3-MoE-Models. So, I think we can optimize in future PR so that ROCM_ATTN can match TRITON_ATTN whenever the custom paged attention cannot be dispatched.

And there is something weird, haven't had time to look closely. In vLLM omni diffusion model, when we enable all AITER functions, it's speed is slower than only enable AITER Attention.

@gshtras
Copy link
Collaborator Author

gshtras commented Jan 30, 2026

Should we do a deprecation for use_prefill_decode_attention? https://docs.vllm.ai/en/latest/contributing/deprecation_policy/

This var is a new addition that replaced the old rapidly deprecated env. I don't think many users, if any at all, switched their workflows to use it instead, and not the backend selection directly. But for the sake of completeness we could follow the deprecation

@gshtras
Copy link
Collaborator Author

gshtras commented Jan 30, 2026

Switch the default attention backend for ROCm to ROCM_ATTN as it consistently shows better performance than TRITON_ATTN, at least until #28497 is accepted

is #28497 consistently better than ROCM_ATTN? Another thing, when ROCM_ATTN does not dispatch the custom paged attention, it is much slower than TRITON_ATTN. That's what I observed on Qwen3-MoE-Models. So, I think we can optimize in future PR so that ROCM_ATTN can match TRITON_ATTN whenever the custom paged attention cannot be dispatched.

And there is something weird, haven't had time to look closely. In vLLM omni diffusion model, when we enable all AITER functions, it's speed is slower than only enable AITER Attention.

#28497 is significantly faster than TRITON_ATTN. It is often not slower than ROCM_ATTN, although between the 2 where ROCM_ATTN dispatches to the custom_paged_attn, it is usually faster.
Switching the default from TRITON to ROCM would make an average use case much faster, even if Qwen3-MoE will become slower, but then for it an old backend can still be explicitly selected.

It's worth mentioning that in any official AMD docker release prior to 0.14, published in rocm/vllm, ROCM_ATTN was the default backend

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
@tjtanaa
Copy link
Collaborator

tjtanaa commented Jan 31, 2026

@gshtras I just remember something, the AITER Fused MoE and upcoming preshuffled GEMM kernels (#29981, and #28837) , by default does not support that many shapes, so vllm serve will highly likely crash and requires tuning. The heuristics of those kernels are not able to cover all unseen shapes.
Example, the AITER Fused MoE does not support the shapes of Qwen3-235B-BF16 model TP4. This issue still persists #22245 . The approach proposed in the closed issue is to set AITER_ONLINE_TUNE=1 (flag from AITER, this is for Fused MoE kernel only, not for GEMM). #22245 (comment)

@gshtras
Copy link
Collaborator Author

gshtras commented Feb 2, 2026

@gshtras I just remember something, the AITER Fused MoE and upcoming preshuffled GEMM kernels (#29981, and #28837) , by default does not support that many shapes, so vllm serve will highly likely crash and requires tuning. The heuristics of those kernels are not able to cover all unseen shapes. Example, the AITER Fused MoE does not support the shapes of Qwen3-235B-BF16 model TP4. This issue still persists #22245 . The approach proposed in the closed issue is to set AITER_ONLINE_TUNE=1 (flag from AITER, this is for Fused MoE kernel only, not for GEMM). #22245 (comment)

What's the impact of ONLINE_TUNE on performance? Sounds like VLLM_ROCM_USE_AITER_MOE can't be turned on by default

@tjtanaa
Copy link
Collaborator

tjtanaa commented Feb 4, 2026

ONLINE_TUNE will pick moe kernels that will affect accuracy for some models.

@AndreasKaratzas
Copy link
Collaborator

AndreasKaratzas commented Feb 19, 2026

Full CI run with this PR: https://buildkite.com/vllm/amd-ci/builds/4997/steps
Nightly run: https://buildkite.com/vllm/amd-ci/builds/4972/steps/canvas

For NotImplementedError: Encoder self-attention is not implemented for RocmAttentionImpl there are two possible solutions:

  1. We configure rocm.py to dispatch encoder-only self-attention model architectures to TRITON_ATTN or ROCM_AITER_FA (if AITER is enabled).
  2. Implement encoder-only self attention kernel for ROCM_ATTN

EDIT: Motivation for this was originally observed in the results of the Entrypoints Integration Test (API Server 1) group while running:

  • entrypoints/openai/test_run_batch.py::test_empty_file
  • entrypoints/openai/test_run_batch.py::test_embeddings
  • entrypoints/openai/test_run_batch.py::test_score

This happens in Entrypoints Integration Test (API Server 2) as well (entrypoints/instrumentator/test_optional_middleware.py::test_no_api_token).

At the same time there seem to be some accuracy issues here:
entrypoints/pooling/score/test_online_score_vision.py under Entrypoints Integration Test (Pooling)

Also, the Regression Test looks like it hit some seg fault while running test_regression.py::test_model_from_modelscope. This might be an infra issue, but it's worth investigating a bit further. This happens on both MI325 and MI355 (https://buildkite.com/vllm/amd-ci/builds/4997/steps/canvas?sid=019c7286-f853-4add-b0c3-9810f8eceb5c&tab=output#019c7286-f974-4e88-8b58-e6852f381267/L569).

In the Distributed Tests (both the 4 GPU and 8 GPU) there are some weird errors but it looks like it has to do with AITER. I will rerun things after I rebase and confirm if there is anything there correlated to ROCM_ATTN.

For Engine Test, there is an assertion AssertionError: pass_config.fuse_norm_quant: expected True, got False. This however should be fixed in test level, so it is not in the scope of this PR.

In the V1 Test e2e + engine , there is the issue of MLA auto dispatching. Specifically, in v1/e2e/test_spec_decode.py::test_mtp_correctness[deepseek], accuracy is 0 (https://buildkite.com/vllm/amd-ci/builds/4997/steps/canvas?sid=019c7286-f854-424c-9635-5439dc40d7bf&tab=output#019c7286-f976-4cb7-8b37-2e090cb6fadc/L9507). I am thinking that an auto-dispatching mechanism to default MLA architectures to TRITON should take place in rocm.py.

V1 Test others can be patched in test level. The failure is because TREE_ATTN is trying to compare the hierarchical steps with the default backend. I've actually already put a PR to address it.

V1 Test attention (H100) should be fixed in this very PR I think. What happens there is that in test_rocm_attention_backends_selection.py::test_standard_attention_backend_selection we are trying to verify the backend selection process. So, since we are updating the rocm.py file, we should update this test file with what attention backend we should be expecting to be selected.

Additionally, I think that all failures in kernels/moe/test_routing.py::test_grouped_topk under Kernels MoE Test %N are correlated to this PR and should be investigated before merging this. This I recommend because it has do do with potential accuracy issues of the ROCM_ATTN backend (I might be wrong but worth investigating).

Another interesting failure that should be resolved inside the rocm.py file comes from models/language/generation/test_common.py::test_models[True-False-5-32-stabilityai/stablelm-3b-4e1t] under Language Models Test (Extended Generation): ValueError: Head size 80 is not supported by RocmAttention. Supported head sizes are: [32, 64, 96, 128, 160, 192, 224, 256]. This means that we must make sure inside the rocm.py to default to TRITON_ATTN if head size is incompatible with ROCM_ATTN. Same is the issue in models/quantization/test_gguf.py::test_models[1-5-32-bfloat16-model4] under Quantized Models Test.

In models/multimodal/generation/test_common.py::test_video_models[qwen3_vl-test_case20] under Multi-Modal Models Test (Standard) Qwen fails, and it is not the first time that I see Qwen be a bit less accurate while utilizing ROCM_ATTN. It would be interesting to run lm-eval on Qwen to compare TRITON_ATTN vs ROCM_ATTN. This claim is also backed up by the Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy test which fails with near-zero accuracy.

LoRA TP Test (Distributed) is an interesting issue, though it's one of those that I want to see after a second CI run.

NixlConnector PD accuracy tests (Distributed) and DP EP NixlConnector PD accuracy tests (Distributed) are also regressions though I cannot see how they could be correlated to this PR.

The regression in LM Eval Large Models I think has to do with what really happens in terms of attention backend selection, since AssertionError: Aiter MLA only supports 16 or 128 number of heads is an issue that we are not having at the moment as rocm.py is. Indeed, I don't see why AITER_MLA was selected since AITER is not enabled in vllm/.buildkite/lm-eval-harness/test_lm_eval_correctness.py.

@gshtras I can launch another run to see if some of these issues are resolved. Let me know :)

gshtras added a commit to ROCm/vllm that referenced this pull request Feb 20, 2026
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
gshtras added a commit to ROCm/vllm that referenced this pull request Feb 20, 2026
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
gshtras added a commit to ROCm/vllm that referenced this pull request Feb 23, 2026
@tjtanaa
Copy link
Collaborator

tjtanaa commented Feb 26, 2026

@AndreasKaratzas @gshtras how about we split this into two PRs: 1) Enable AITER by default 2) change the default attention kernel to ROCM_ATTN ? Each of changes has huge impact to the AMD CI.

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
@mergify
Copy link

mergify bot commented Mar 5, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gshtras.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants