[ROCm] Change default settings for ROCm#33271
[ROCm] Change default settings for ROCm#33271gshtras wants to merge 4 commits intovllm-project:mainfrom
Conversation
… Disable AITER MHA. Switch the default attention backend to ROCM_ATTN Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
There was a problem hiding this comment.
Code Review
This pull request updates the default settings for ROCm to enhance performance. Key changes include enabling AITeR by default on supported platforms (gfx9x), disabling AITeR MHA, and setting ROCM_ATTN as the default attention backend. The code changes align with the stated objectives. I've identified one minor issue regarding an outdated comment that should be corrected for code clarity.
| vllm_config = get_current_vllm_config_or_none() | ||
| if ( | ||
| vllm_config is not None | ||
| and vllm_config.attention_config.use_prefill_decode_attention |
There was a problem hiding this comment.
can we nuke this flag/section?
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
ProExpertProg
left a comment
There was a problem hiding this comment.
Should we do a deprecation for use_prefill_decode_attention? https://docs.vllm.ai/en/latest/contributing/deprecation_policy/
is #28497 consistently better than And there is something weird, haven't had time to look closely. In vLLM omni diffusion model, when we enable all AITER functions, it's speed is slower than only enable AITER Attention. |
This var is a new addition that replaced the old rapidly deprecated env. I don't think many users, if any at all, switched their workflows to use it instead, and not the backend selection directly. But for the sake of completeness we could follow the deprecation |
#28497 is significantly faster than TRITON_ATTN. It is often not slower than ROCM_ATTN, although between the 2 where ROCM_ATTN dispatches to the custom_paged_attn, it is usually faster. It's worth mentioning that in any official AMD docker release prior to 0.14, published in rocm/vllm, ROCM_ATTN was the default backend |
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
|
@gshtras I just remember something, the AITER Fused MoE and upcoming preshuffled GEMM kernels (#29981, and #28837) , by default does not support that many shapes, so vllm serve will highly likely crash and requires tuning. The heuristics of those kernels are not able to cover all unseen shapes. |
What's the impact of ONLINE_TUNE on performance? Sounds like VLLM_ROCM_USE_AITER_MOE can't be turned on by default |
|
|
|
Full CI run with this PR: https://buildkite.com/vllm/amd-ci/builds/4997/steps For
EDIT: Motivation for this was originally observed in the results of the
This happens in At the same time there seem to be some accuracy issues here: Also, the In the Distributed Tests (both the 4 GPU and 8 GPU) there are some weird errors but it looks like it has to do with AITER. I will rerun things after I rebase and confirm if there is anything there correlated to ROCM_ATTN. For In the
Additionally, I think that all failures in Another interesting failure that should be resolved inside the In
The regression in @gshtras I can launch another run to see if some of these issues are resolved. Let me know :) |
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
|
@AndreasKaratzas @gshtras how about we split this into two PRs: 1) Enable AITER by default 2) change the default attention kernel to ROCM_ATTN ? Each of changes has huge impact to the AMD CI. |
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Enable AITER by default on platforms where it is supported (gfx9x)
Disable AITER MHA
Switch the default attention backend for ROCm to ROCM_ATTN as it consistently shows better performance than TRITON_ATTN, at least until #28497 is accepted