[Bugfix] [ROCm] [UX] Reorganize ROCm Backend Selection Logic#26980
[Bugfix] [ROCm] [UX] Reorganize ROCm Backend Selection Logic#26980tjtanaa merged 11 commits intovllm-project:mainfrom
Conversation
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
There was a problem hiding this comment.
Code Review
This pull request significantly improves the ROCm attention backend selection logic by reorganizing it for clarity and correctness. Prioritizing explicit user selection over environment variable-based auto-selection is a great change that enhances predictability. The addition of a comprehensive suite of unit tests is also excellent, ensuring the new logic is robust and well-verified. I have one suggestion to improve the clarity and correctness of a mock in the new test file.
| # Mock is_aiter_mla_enabled based on env vars and block_size | ||
| aiter_enabled = env_vars.get("VLLM_ROCM_USE_AITER") == "1" and block_size == 1 | ||
| with patch( | ||
| "vllm.v1.attention.backends.mla.rocm_aiter_mla.is_aiter_mla_enabled", | ||
| return_value=aiter_enabled, | ||
| ): |
There was a problem hiding this comment.
The mock for is_aiter_mla_enabled is a bit confusing as its return value depends on block_size, which is an input to the function under test (get_attn_backend_cls), not to is_aiter_mla_enabled itself. This couples the test's mock behavior too tightly with the implementation details of the function being tested, making it harder to understand and maintain.
The purpose of is_aiter_mla_enabled is to check environment variables, while the block_size check happens within get_attn_backend_cls. The test should reflect this separation of concerns.
You can simplify the mock to only depend on the environment variables, which will make the test clearer and more robust. The current mock makes the condition is_aiter_mla_enabled() and block_size == 1 effectively become (env_vars.get("VLLM_ROCM_USE_AITER") == "1" and block_size == 1) and block_size == 1, which is redundant. The suggested change correctly tests the logic by mocking is_aiter_mla_enabled based only on its own dependencies (the environment variables).
| # Mock is_aiter_mla_enabled based on env vars and block_size | |
| aiter_enabled = env_vars.get("VLLM_ROCM_USE_AITER") == "1" and block_size == 1 | |
| with patch( | |
| "vllm.v1.attention.backends.mla.rocm_aiter_mla.is_aiter_mla_enabled", | |
| return_value=aiter_enabled, | |
| ): | |
| # Mock is_aiter_mla_enabled based on env vars | |
| aiter_enabled = env_vars.get("VLLM_ROCM_USE_AITER") == "1" | |
| with patch( | |
| "vllm.v1.attention.backends.mla.rocm_aiter_mla.is_aiter_mla_enabled", | |
| return_value=aiter_enabled, | |
| ): |
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
vllm/platforms/rocm.py
Outdated
| ) | ||
| # When AITER is enabled and block_size is 1, use AITER MLA | ||
| # Otherwise, use TRITON MLA | ||
| if is_aiter_mla_enabled() and block_size == 1: |
There was a problem hiding this comment.
the is_aiter_mla_enabled() has been changed to rocm_aiter_ops.is_mla_enabled() from the _aiter_ops.py
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
|
Hi @tjtanaa , I have updated code and push it, meanwhile according to Appendix to re-run successful. When you have time pls take a look, Thanks! |
vllm/platforms/rocm.py
Outdated
| # When AITER is enabled and block_size is 1, use AITER MLA | ||
| # Otherwise, use TRITON MLA | ||
| if rocm_aiter_ops.is_mla_enabled() and block_size == 1: | ||
| selected_backend = AttentionBackendEnum.ROCM_AITER_MLA |
There was a problem hiding this comment.
ROCM_AITER_MLA now supports block-size larger than 1.
There was a problem hiding this comment.
okay, Thanks for your suggestion, I will fix them one by one.
vllm/platforms/rocm.py
Outdated
| if rocm_aiter_ops.is_mla_enabled() and block_size == 1: | ||
| selected_backend = AttentionBackendEnum.ROCM_AITER_MLA | ||
| else: | ||
| selected_backend = AttentionBackendEnum.TRITON_MLA |
There was a problem hiding this comment.
TRITON_MLA must have block size of at least 16, it does not support block size of 1. Need to add statement to guard it
vllm/platforms/rocm.py
Outdated
| def opaque_attention_op(cls) -> bool: | ||
| return True | ||
|
|
||
| @classmethod |
There was a problem hiding this comment.
Why there is this line diff? I remember that this has already been removed.
| dtype=torch.float16, | ||
| kv_cache_dtype="auto", | ||
| block_size=16, | ||
| use_v1=True, |
There was a problem hiding this comment.
Fix this. there is no use_v1 argument anymore.
| dtype=torch.float16, | ||
| kv_cache_dtype="auto", | ||
| block_size=16, | ||
| use_v1=False, |
There was a problem hiding this comment.
Fix this. there is no use_v1 argument anymore.
| dtype=torch.float16, | ||
| kv_cache_dtype="auto", | ||
| block_size=16, | ||
| use_v1=False, |
There was a problem hiding this comment.
Fix this. there is no use_v1 argument anymore.
| dtype=torch.float16, | ||
| kv_cache_dtype="auto", | ||
| block_size=16, | ||
| use_v1=True, |
There was a problem hiding this comment.
Fix this. there is no use_v1 argument anymore.
| dtype=torch.float16, | ||
| kv_cache_dtype="auto", | ||
| block_size=block_size, | ||
| use_v1=True, |
There was a problem hiding this comment.
Fix this. there is no use_v1 argument anymore.
| dtype=torch.float16, | ||
| kv_cache_dtype="auto", | ||
| block_size=16, | ||
| use_v1=True, |
There was a problem hiding this comment.
Fix this. there is no use_v1 argument anymore.
|
This pull request has merge conflicts that must be resolved before it can be |
# Conflicts: # vllm/platforms/rocm.py Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
|
Hi @tjtanaa , I have updated these code according to your suggestion, about pytest and re-run different commands successful, When you have time pls take a look, Thanks! |
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
|
Hi @tjtanaa , I've revised the code based on the latest main branch. All commands have been executed. pls review the code when you have time. Thanks! |
| ( | ||
| {}, | ||
| None, | ||
| "vllm.v1.attention.backends.triton_attn.TritonAttentionBackend", |
There was a problem hiding this comment.
can you replace this with the notation like AttentionBackendEnum.ROCM_AITER_FA.get_path()
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
|
@vllmellm Why was |
…oject#26980) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
…oject#26980) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
…oject#26980) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Purpose
This is to simplify the user experiences so that users do not need to keep track of the environment variables to select the backend. Users can directly specify the backend through
VLLM_ATTENTION_BACKEND.Fix Attention Backend Configuration on ROCm
This PR fixes several issues with attention backend configuration on AMD ROCm platforms to ensure proper backend selection based on environment variables and flags.
Issues Fixed
Missing
ROCM_AITER_FAin V1 Oracle ListROCM_AITER_FAto the V1 oracle list invllm/engine/arg_utils.pyto allow using AITER Flash Attention backend without enabling all AITER kernelsVLLM_ATTENTION_BACKEND="ROCM_AITER_FA" vllm serve ...Incorrect Backend Selection with
VLLM_ROCM_USE_AITER=1VLLM_ATTENTION_BACKENDwhenVLLM_ROCM_USE_AITER=1is setVLLM_ROCM_USE_AITER=1would override explicit backend choicesVLLM_ROCM_USE_AITER=1 VLLM_ATTENTION_BACKEND="TRITON_ATTN"→ Uses Triton attentionVLLM_ROCM_USE_AITER=1 VLLM_ATTENTION_BACKEND="ROCM_ATTN"→ Uses ROCm chunked prefill/paged decodeVLLM_ROCM_USE_AITER=1 VLLM_ATTENTION_BACKEND="ROCM_AITER_UNIFIED_ATTN"→ Uses AITER unified attentionImproved Flag Combinations
VLLM_ROCM_USE_AITER_MHAflag to properly disable AITER MHA when set to 0VLLM_ROCM_USE_AITER_UNIFIED_ATTENTIONand backend selectionSupported Attention Backends on ROCm
After this fix, the following backends are properly supported:
TRITON_ATTN: vLLM's Triton unified attention (default)ROCM_ATTN: Chunked prefill (Triton) + paged decode (HIP)ROCM_AITER_FA: AITER Flash AttentionROCM_AITER_UNIFIED_ATTN: AITER unified attentionMLA Backend Support
TRITON_MLA: Triton MLA backend for DeepSeek models (requires block-size >= 16)ROCM_AITER_MLA: AITER MLA backend (now supports default block-size, and block-size 1)Test Plan
All documented configuration (find in appendix) combinations now work correctly:
VLLM_ATTENTION_BACKENDVLLM_ROCM_USE_AITERflag combinationsAdded a unit tests
tests/v1/attention/test_rocm_attention_backends_selection.pyTest Result
All commands validated to run the correct backend
pytest tests/v1/attention/test_rocm_attention_backends_selection.py: All test passed.Appendix (This bugfix PR allows all the following command be working correctly)
Attention Backend on ROCm
Attention backend can be set through various ways:
VLLM_ATTENTION_BACKENDVLLM_ROCM_flags.Attention
On AMD ROCm there are
TRITON_ATTN,ROCM_ATTN,FLASH_ATTNorROCM_AITER_UNIFIED_ATTN.TRITON_ATTN:Example command:
ROCM_ATTNROCM_AITER_FAROCM_AITER_UNIFIED_ATTNMLA Backend:
On AMD ROCm, there are
TRITON_MLAandROCM_AITER_MLATRITON_MLA:ROCM_AITER_MLA:VLLM_ATTENTION_BACKEND="ROCM_AITER_MLA" vllm serve deepseek-ai/DeepSeek-R1 -tp 8 VLLM_ROCM_USE_AITER=1 vllm serve deepseek-ai/DeepSeek-R1 -tp 8 --block-size 1 VLLM_ROCM_USE_AITER=1 vllm serve deepseek-ai/DeepSeek-R1 -tp 8Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.