[ROCm] Add AITER RoPE + KV cache fusion for MLA prefill and decode#2
[ROCm] Add AITER RoPE + KV cache fusion for MLA prefill and decode#2khairulkabir1661 wants to merge 1287 commits intomainfrom
Conversation
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
…vllm-project#36642) Signed-off-by: Or Ozeri <oro@il.ibm.com>
…put improvement (vllm-project#37340) Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
…llm-project#37456) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
…/DP (vllm-project#37449) Signed-off-by: youkaichao <youkaichao@gmail.com>
…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com>
…m-project#37398) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
…ct#36795) (vllm-project#37427) Signed-off-by: JartX <sagformas@epdcenter.es>
…oject#36671) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
…llm-project#34733) Signed-off-by: Philip Ottesen <phiott256@gmail.com>
…6928) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
…#37465) Signed-off-by: Michael Goin <mgoin64@gmail.com>
…project#37238) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai>
…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
Signed-off-by: sihao.li <sihao.li@intel.com>
…UDA graph padding (vllm-project#37442) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
…t#36267) Signed-off-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
…ample for dpep (vllm-project#37334) Signed-off-by: ahao-anyscale <ahao@anyscale.com>
…m-project#37237) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com>
… improvement (vllm-project#37347) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…llm-project#37009) Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com>
…tructure (vllm-project#37504) Signed-off-by: sfeng33 <4florafeng@gmail.com>
…eaming (vllm-project#37510) Signed-off-by: cdpath <cdpath@outlook.com>
…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>
…ct#37998) Signed-off-by: Vineeta Tiwari <vineeta.tiwari2@ibm.com> Signed-off-by: Vineeta Tiwari <vineetatiwari2000@gmail.com> Co-authored-by: Vineeta Tiwari <vineeta.tiwari2@ibm.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…llm-project#37706) Signed-off-by: Willy Hardy <whardy@redhat.com> Signed-off-by: Will Hardy <whardy@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…8019) Signed-off-by: Nick Cao <ncao@redhat.com>
Signed-off-by: Richard Zou <zou3519@gmail.com>
Signed-off-by: Richard Zou <zou3519@gmail.com>
…upport (vllm-project#37920) Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com>
Signed-off-by: Angel Li <liangel@meta.com>
Signed-off-by: aasgaonkar <aasgaonkar@nvidia.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: mgoin <mgoin64@gmail.com>
…ject#37926) Signed-off-by: Junhao Li <junhao@ubicloud.com>
The rope_applied parameter was being passed through forward_impl and custom ops but never actually used in any logic. With AITER fusion always enabled, RoPE application is now determined solely by use_aiter_fused flag. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Restore three upstream changes in MLACommonImpl that were accidentally removed in initial AITER commits: 1. Add back logger.info_once for backend selection (TRT-LLM, FlashInfer, CUDNN, FlashAttention) - helpful for debugging 2. Restore FA4 support in _pad_v logic - FA4 natively handles different head dimensions like FA3 on Hopper 3. Restore params_dtype fallback for AWQ/GPTQ quantized models (PR vllm-project#34695) - Quantized layers may lack .weight attribute These changes are in MLACommonImpl (shared backend selector), not related to AITER fused kernel functionality which is in MLAAttention class. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Restore four upstream changes in backend and metadata builder sections that were accidentally removed in initial AITER commits: 1. XPU support: Restore Intel XPU flash attention ops import 2. KV cache stride order: Restore identity permutation (0,1,2,3) for contiguous per-layer views - MLA kernels require this layout 3. Blackwell comment: Remove outdated comment (not in current origin/main) 4. FP8 prefill logging: Restore helpful logs for FP8 prefill status These changes are in shared infrastructure (MLACommonBackend, MLACommonMetadataBuilder) not related to AITER fused kernel functionality. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Organize comments for lines 1405-1430: - Remove duplicate comment about positions/slot_mapping - Remove commented-out logger.warning - Make remaining comments concise and clear Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Simplify comments for lines 1334-1349: - Remove redundant CUDA graph compatibility explanation - Make fused/unfused path comments more concise - Keep essential information about KV cache write locations Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Make docstring more concise while keeping essential information about fused path behavior. Remove detailed line number references. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Organize comments for better readability: - Remove redundant explanations about parameters - Simplify fallback comment - Remove logger.info_once (not needed for production) - Make use_fused_path comment consistent with other locations Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Organize comments for better readability: - Simplify position computation comment - Remove redundant CUDA graph compatibility explanation - Clean up fused/unfused path comments - Keep important logger.info_once statements for verification - Remove inline shape comments (obvious from context) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Organize comments for better readability: - Simplify position tensor trimming comment - Remove redundant explanations in slot_mapping retrieval - Make prefill/decode path comments more concise - Simplify AITER fused kernel comments - Remove obvious update comments Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Organize parameter comments for better readability: - Group AITER fused kernel parameters under single comment - Simplify use_fused_path type annotation - Remove verbose parameter passing explanation - Make default value comment more concise Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Simplify comments for better readability: - Remove verbose fused/unfused path explanation - Remove obvious slot_mapping comment - Make KV cache update comment more concise Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Simplify comments for better readability: - Remove redundant inline parameter comments - Simplify forward_context storage comment - Remove obvious rotary_emb explanation Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Organize comments for better readability: - Remove redundant inline comments in use_aiter_fused check - Simplify kernel import section comments - Keep all logger.warning_once and logger.info statements for debugging Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Simplify comments for better readability: - Group AITER parameters under single comment - Remove redundant inline parameter comments - Remove obvious rotary_emb storage comment Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Organize comments for better readability: - Simplify reshape comment - Remove redundant inline parameter comments (already in docstring) - Keep FP4/FP8 variant distinction - Simplify output split comment Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
Summary
This PR adds AITER fused kernel support for DeepSeek MLA attention on AMD ROCm, implementing RoPE + KV cache fusion for both prefill and decode paths. This builds on top of vllm-project#38299 (RMSNorm + FP8 quantization fusion).
Changes
Core Functionality
Implementation Details
vllm/model_executor/layers/attention/mla_attention.py:_run_aiter_fused_decode()for fused decode pathforward_impl()unified_mla_kv_cache_update()to skip KV writes when using fused pathsvllm/model_executor/layers/mla.py:vllm/envs.py:VLLM_ROCM_USE_AITERflag to enable AITER kernels (default: enabled on ROCm)Code Quality
Testing
Tested on AMD MI300X with DeepSeek-V3:
Performance
Expected improvements on AMD MI300X:
Dependencies
🤖 Generated with Claude Code