Skip to content

[ROCm] Add AITER RoPE + KV cache fusion for MLA prefill and decode#2

Closed
khairulkabir1661 wants to merge 1287 commits intomainfrom
rocm-mla-aiter-rope-kv-fusion
Closed

[ROCm] Add AITER RoPE + KV cache fusion for MLA prefill and decode#2
khairulkabir1661 wants to merge 1287 commits intomainfrom
rocm-mla-aiter-rope-kv-fusion

Conversation

@khairulkabir1661
Copy link
Copy Markdown
Owner

Summary

This PR adds AITER fused kernel support for DeepSeek MLA attention on AMD ROCm, implementing RoPE + KV cache fusion for both prefill and decode paths. This builds on top of vllm-project#38299 (RMSNorm + FP8 quantization fusion).

Changes

Core Functionality

  • AITER fused decode kernel: Fuses RoPE application and KV cache writes for decode tokens using AMD's AITER library
  • Prefill RoPE + KV cache fusion: Separate fusion path for prefill tokens
  • Mixed batch handling: Correctly handles batches containing both prefill and decode tokens

Implementation Details

  • vllm/model_executor/layers/attention/mla_attention.py:
    • Add _run_aiter_fused_decode() for fused decode path
    • Add prefill fusion in forward_impl()
    • Update unified_mla_kv_cache_update() to skip KV writes when using fused paths
    • Add custom ops for torch.compile/CUDA graph compatibility
  • vllm/model_executor/layers/mla.py:
    • Pass RoPE caches and modules to MLAAttention
    • Skip RoPE in mla.py when using fused path (applied in custom op instead)
  • vllm/envs.py:
    • Add VLLM_ROCM_USE_AITER flag to enable AITER kernels (default: enabled on ROCm)

Code Quality

  • Clean up verbose comments throughout mla_attention.py
  • Remove unused parameters and debug logging
  • Simplify docstrings and inline comments

Testing

Tested on AMD MI300X with DeepSeek-V3:

  • ✅ Mixed batches (prefill + decode)
  • ✅ Pure prefill batches
  • ✅ Pure decode batches
  • ✅ CUDA graph mode
  • ✅ Eager mode

Performance

Expected improvements on AMD MI300X:

  • Reduced memory bandwidth usage (fused RoPE + KV cache write)
  • Better kernel launch overhead (fewer separate operations)

Dependencies

🤖 Generated with Claude Code

RonaldBXu and others added 30 commits March 27, 2026 02:45
)

Signed-off-by: Ubuntu <ubuntu@ip-172-31-43-201.ap-northeast-1.compute.internal>
Signed-off-by: Ronald Xu <ronaldxu@amazon.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
…put improvement (vllm-project#37340)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
…calculation (vllm-project#37439)

Signed-off-by: chengyufang <cnyvfang@outlook.com>
…m-project#37398)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
…oject#36671)

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
…_dtype "auto" leading to gibberish (vllm-project#37054)

Signed-off-by: Andy Lo <andy@mistral.ai>
…ject#37231)

Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
…UDA graph padding (vllm-project#37442)

Signed-off-by: Elvir Crncevic <elvircrn@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
…t#36267)

Signed-off-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
…ample for dpep (vllm-project#37334)

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
… improvement (vllm-project#37347)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
…tructure (vllm-project#37504)

Signed-off-by: sfeng33 <4florafeng@gmail.com>
minosfuture and others added 28 commits March 27, 2026 02:54
…ct#37998)

Signed-off-by: Vineeta Tiwari <vineeta.tiwari2@ibm.com>
Signed-off-by: Vineeta Tiwari <vineetatiwari2000@gmail.com>
Co-authored-by: Vineeta Tiwari <vineeta.tiwari2@ibm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…llm-project#37706)

Signed-off-by: Willy Hardy <whardy@redhat.com>
Signed-off-by: Will Hardy <whardy@redhat.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Richard Zou <zou3519@gmail.com>
Signed-off-by: Richard Zou <zou3519@gmail.com>
…upport (vllm-project#37920)

Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com>
Signed-off-by: Angel Li <liangel@meta.com>
Signed-off-by: aasgaonkar <aasgaonkar@nvidia.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
The rope_applied parameter was being passed through forward_impl and custom ops but never actually used in any logic. With AITER fusion always enabled, RoPE application is now determined solely by use_aiter_fused flag.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Restore three upstream changes in MLACommonImpl that were accidentally
removed in initial AITER commits:

1. Add back logger.info_once for backend selection (TRT-LLM, FlashInfer,
   CUDNN, FlashAttention) - helpful for debugging
2. Restore FA4 support in _pad_v logic - FA4 natively handles different
   head dimensions like FA3 on Hopper
3. Restore params_dtype fallback for AWQ/GPTQ quantized models (PR vllm-project#34695)
   - Quantized layers may lack .weight attribute

These changes are in MLACommonImpl (shared backend selector), not related
to AITER fused kernel functionality which is in MLAAttention class.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Restore four upstream changes in backend and metadata builder sections that
were accidentally removed in initial AITER commits:

1. XPU support: Restore Intel XPU flash attention ops import
2. KV cache stride order: Restore identity permutation (0,1,2,3) for
   contiguous per-layer views - MLA kernels require this layout
3. Blackwell comment: Remove outdated comment (not in current origin/main)
4. FP8 prefill logging: Restore helpful logs for FP8 prefill status

These changes are in shared infrastructure (MLACommonBackend,
MLACommonMetadataBuilder) not related to AITER fused kernel functionality.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Organize comments for lines 1405-1430:
- Remove duplicate comment about positions/slot_mapping
- Remove commented-out logger.warning
- Make remaining comments concise and clear

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Simplify comments for lines 1334-1349:
- Remove redundant CUDA graph compatibility explanation
- Make fused/unfused path comments more concise
- Keep essential information about KV cache write locations

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Make docstring more concise while keeping essential information about
fused path behavior. Remove detailed line number references.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Organize comments for better readability:
- Remove redundant explanations about parameters
- Simplify fallback comment
- Remove logger.info_once (not needed for production)
- Make use_fused_path comment consistent with other locations

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Organize comments for better readability:
- Simplify position computation comment
- Remove redundant CUDA graph compatibility explanation
- Clean up fused/unfused path comments
- Keep important logger.info_once statements for verification
- Remove inline shape comments (obvious from context)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Organize comments for better readability:
- Simplify position tensor trimming comment
- Remove redundant explanations in slot_mapping retrieval
- Make prefill/decode path comments more concise
- Simplify AITER fused kernel comments
- Remove obvious update comments

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Organize parameter comments for better readability:
- Group AITER fused kernel parameters under single comment
- Simplify use_fused_path type annotation
- Remove verbose parameter passing explanation
- Make default value comment more concise

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Simplify comments for better readability:
- Remove verbose fused/unfused path explanation
- Remove obvious slot_mapping comment
- Make KV cache update comment more concise

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Simplify comments for better readability:
- Remove redundant inline parameter comments
- Simplify forward_context storage comment
- Remove obvious rotary_emb explanation

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Organize comments for better readability:
- Remove redundant inline comments in use_aiter_fused check
- Simplify kernel import section comments
- Keep all logger.warning_once and logger.info statements for debugging

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Simplify comments for better readability:
- Group AITER parameters under single comment
- Remove redundant inline parameter comments
- Remove obvious rotary_emb storage comment

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Organize comments for better readability:
- Simplify reshape comment
- Remove redundant inline parameter comments (already in docstring)
- Keep FP4/FP8 variant distinction
- Simplify output split comment

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.