[ROCm] Add AITER RoPE + KV cache fusion for MLA prefill and decode by khairulkabir1661 · Pull Request #2 · khairulkabir1661/vllm

khairulkabir1661 · 2026-03-27T03:11:57Z

Summary

This PR adds AITER fused kernel support for DeepSeek MLA attention on AMD ROCm, implementing RoPE + KV cache fusion for both prefill and decode paths. This builds on top of vllm-project#38299 (RMSNorm + FP8 quantization fusion).

Changes

Core Functionality

AITER fused decode kernel: Fuses RoPE application and KV cache writes for decode tokens using AMD's AITER library
Prefill RoPE + KV cache fusion: Separate fusion path for prefill tokens
Mixed batch handling: Correctly handles batches containing both prefill and decode tokens

Implementation Details

vllm/model_executor/layers/attention/mla_attention.py:
- Add _run_aiter_fused_decode() for fused decode path
- Add prefill fusion in forward_impl()
- Update unified_mla_kv_cache_update() to skip KV writes when using fused paths
- Add custom ops for torch.compile/CUDA graph compatibility
vllm/model_executor/layers/mla.py:
- Pass RoPE caches and modules to MLAAttention
- Skip RoPE in mla.py when using fused path (applied in custom op instead)
vllm/envs.py:
- Add VLLM_ROCM_USE_AITER flag to enable AITER kernels (default: enabled on ROCm)

Code Quality

Clean up verbose comments throughout mla_attention.py
Remove unused parameters and debug logging
Simplify docstrings and inline comments

Testing

Tested on AMD MI300X with DeepSeek-V3:

✅ Mixed batches (prefill + decode)
✅ Pure prefill batches
✅ Pure decode batches
✅ CUDA graph mode
✅ Eager mode

Performance

Expected improvements on AMD MI300X:

Reduced memory bandwidth usage (fused RoPE + KV cache write)
Better kernel launch overhead (fewer separate operations)

Dependencies

Requires: [ROCm] Add AITER RMSNorm+FP8 quantization fusion for MLA vllm-project/vllm#38299 (RMSNorm + FP8 quantization fusion)
AITER library must be installed on ROCm systems

🤖 Generated with Claude Code

) Signed-off-by: Ubuntu <ubuntu@ip-172-31-43-201.ap-northeast-1.compute.internal> Signed-off-by: Ronald Xu <ronaldxu@amazon.com>

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…vllm-project#36642) Signed-off-by: Or Ozeri <oro@il.ibm.com>

…put improvement (vllm-project#37340) Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

…llm-project#37456) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

…/DP (vllm-project#37449) Signed-off-by: youkaichao <youkaichao@gmail.com>

…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com>

…m-project#37398) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…ct#36795) (vllm-project#37427) Signed-off-by: JartX <sagformas@epdcenter.es>

…oject#36671) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com>

…llm-project#34733) Signed-off-by: Philip Ottesen <phiott256@gmail.com>

…6928) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

…#37465) Signed-off-by: Michael Goin <mgoin64@gmail.com>

…project#37238) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai>

…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>

Signed-off-by: sihao.li <sihao.li@intel.com>

…UDA graph padding (vllm-project#37442) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>

…t#36267) Signed-off-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

…ample for dpep (vllm-project#37334) Signed-off-by: ahao-anyscale <ahao@anyscale.com>

…m-project#37237) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com>

… improvement (vllm-project#37347) Signed-off-by: yewentao256 <zhyanwentao@126.com>

…llm-project#37009) Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com>

…tructure (vllm-project#37504) Signed-off-by: sfeng33 <4florafeng@gmail.com>

…ct#37310)

…eaming (vllm-project#37510) Signed-off-by: cdpath <cdpath@outlook.com>

…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

…vllm-project#37728) (vllm-project#37728) (vllm-project#37728)

…ct#37998) Signed-off-by: Vineeta Tiwari <vineeta.tiwari2@ibm.com> Signed-off-by: Vineeta Tiwari <vineetatiwari2000@gmail.com> Co-authored-by: Vineeta Tiwari <vineeta.tiwari2@ibm.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…llm-project#37706) Signed-off-by: Willy Hardy <whardy@redhat.com> Signed-off-by: Will Hardy <whardy@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…8019) Signed-off-by: Nick Cao <ncao@redhat.com>

Signed-off-by: Richard Zou <zou3519@gmail.com>

…upport (vllm-project#37920) Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com>

Signed-off-by: Angel Li <liangel@meta.com>

Signed-off-by: aasgaonkar <aasgaonkar@nvidia.com>

Signed-off-by: Nick Hill <nickhill123@gmail.com>

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

Signed-off-by: mgoin <mgoin64@gmail.com>

…ject#37926) Signed-off-by: Junhao Li <junhao@ubicloud.com>

The rope_applied parameter was being passed through forward_impl and custom ops but never actually used in any logic. With AITER fusion always enabled, RoPE application is now determined solely by use_aiter_fused flag. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>

Restore three upstream changes in MLACommonImpl that were accidentally removed in initial AITER commits: 1. Add back logger.info_once for backend selection (TRT-LLM, FlashInfer, CUDNN, FlashAttention) - helpful for debugging 2. Restore FA4 support in _pad_v logic - FA4 natively handles different head dimensions like FA3 on Hopper 3. Restore params_dtype fallback for AWQ/GPTQ quantized models (PR vllm-project#34695) - Quantized layers may lack .weight attribute These changes are in MLACommonImpl (shared backend selector), not related to AITER fused kernel functionality which is in MLAAttention class. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>

Restore four upstream changes in backend and metadata builder sections that were accidentally removed in initial AITER commits: 1. XPU support: Restore Intel XPU flash attention ops import 2. KV cache stride order: Restore identity permutation (0,1,2,3) for contiguous per-layer views - MLA kernels require this layout 3. Blackwell comment: Remove outdated comment (not in current origin/main) 4. FP8 prefill logging: Restore helpful logs for FP8 prefill status These changes are in shared infrastructure (MLACommonBackend, MLACommonMetadataBuilder) not related to AITER fused kernel functionality. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>

Organize comments for lines 1405-1430: - Remove duplicate comment about positions/slot_mapping - Remove commented-out logger.warning - Make remaining comments concise and clear Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>

Simplify comments for lines 1334-1349: - Remove redundant CUDA graph compatibility explanation - Make fused/unfused path comments more concise - Keep essential information about KV cache write locations Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>

Make docstring more concise while keeping essential information about fused path behavior. Remove detailed line number references. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>

Organize comments for better readability: - Remove redundant explanations about parameters - Simplify fallback comment - Remove logger.info_once (not needed for production) - Make use_fused_path comment consistent with other locations Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>

Organize comments for better readability: - Simplify position computation comment - Remove redundant CUDA graph compatibility explanation - Clean up fused/unfused path comments - Keep important logger.info_once statements for verification - Remove inline shape comments (obvious from context) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>

Organize comments for better readability: - Simplify position tensor trimming comment - Remove redundant explanations in slot_mapping retrieval - Make prefill/decode path comments more concise - Simplify AITER fused kernel comments - Remove obvious update comments Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>

Organize parameter comments for better readability: - Group AITER fused kernel parameters under single comment - Simplify use_fused_path type annotation - Remove verbose parameter passing explanation - Make default value comment more concise Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>

Simplify comments for better readability: - Remove verbose fused/unfused path explanation - Remove obvious slot_mapping comment - Make KV cache update comment more concise Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>

Simplify comments for better readability: - Remove redundant inline parameter comments - Simplify forward_context storage comment - Remove obvious rotary_emb explanation Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>

Organize comments for better readability: - Remove redundant inline comments in use_aiter_fused check - Simplify kernel import section comments - Keep all logger.warning_once and logger.info statements for debugging Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>

Simplify comments for better readability: - Group AITER parameters under single comment - Remove redundant inline parameter comments - Remove obvious rotary_emb storage comment Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>

Organize comments for better readability: - Simplify reshape comment - Remove redundant inline parameter comments (already in docstring) - Keep FP4/FP8 variant distinction - Simplify output split comment Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>

github-actions · 2026-03-27T03:12:06Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

RonaldBXu and others added 30 commits March 27, 2026 02:45

Adding deterministic lora benchmarking to vLLM Bench (vllm-project#36057

5ae638c

) Signed-off-by: Ubuntu <ubuntu@ip-172-31-43-201.ap-northeast-1.compute.internal> Signed-off-by: Ronald Xu <ronaldxu@amazon.com>

Add API docs link if the CLI arg is a config class (vllm-project#37432)

1d5ed78

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[kv_offload+HMA][2/N]: Support multiple KV groups in GPULoadStoreSpec (…

9496288

…vllm-project#36642) Signed-off-by: Or Ozeri <oro@il.ibm.com>

[Perf] Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E through…

7df6065

…put improvement (vllm-project#37340) Signed-off-by: yewentao256 <zhyanwentao@126.com>

[Misc] Clean up model registry (vllm-project#37457)

3fff62b

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Model] Remove unnecessary processor definition for Nemotron Parse (v…

d68954f

…llm-project#37456) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[bugfix][async scheduling] fix extra cuda context in device 0 with EP…

7606842

…/DP (vllm-project#37449) Signed-off-by: youkaichao <youkaichao@gmail.com>

[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp …

d047b0e

…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com>

Fix models which use layer_type_validation for Transformers v5 (vll…

bf62ca1

…m-project#37398) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[Bugfix] Fix ROCm crash in qwen3_next multi-stream events (vllm-proje…

b6691c7

…ct#36795) (vllm-project#37427) Signed-off-by: JartX <sagformas@epdcenter.es>

chunk parakeet into 30s clips to prevent OOMs on long audios (vllm-pr…

bd75fdd

…oject#36671) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

[V0 Deprecation] Deprecate virtual engine (vllm-project#37195)

de69384

Signed-off-by: yewentao256 <zhyanwentao@126.com>

fix(worker): optimize swap_states to copy only active token prefixes (v…

7c48273

…llm-project#34733) Signed-off-by: Philip Ottesen <phiott256@gmail.com>

[LoRA][BugFix] Fix skipped LoRA adapters for Mistral3 (vllm-project#3…

f67b81c

…6928) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

[Bugfix] Remove assertion for NVFP4 scale dynamic range (vllm-project…

8ea3b29

…#37465) Signed-off-by: Michael Goin <mgoin64@gmail.com>

[Model Runner V2] Spec decode rejection sampler greedy support (vllm-…

a35a825

…project#37238) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache…

7734ddb

…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai>

[Bugfix] Expand quantization method support in perf metrics (vllm-pro…

7857e30

…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>

[XPU]Unify xpu test dependencies in dockerfile.xpu (vllm-project#36477)

60f9b46

Signed-off-by: sihao.li <sihao.li@intel.com>

[Bugfix] Zero-init MLA attention output buffers to prevent NaN from C…

12918ee

…UDA graph padding (vllm-project#37442) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>

[EPLB] Simplify EPLB rearrange by only returning one map (vllm-projec…

45e72ea

…t#36267) Signed-off-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

[BUG] Exclude SKIP_TENSORS from get_layer_size() + new weight sync ex…

105e79e

…ample for dpep (vllm-project#37334) Signed-off-by: ahao-anyscale <ahao@anyscale.com>

[Model Runner V2] Spec decode rejection sampler logprobs support (vll…

33f1588

…m-project#37237) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

[bug] Fix deadlock with pause resume and collective_rpc (vllm-project…

6ce502d

…#37024) Signed-off-by: hao-aaron <ahao@anyscale.com>

[Perf] Optimize token_embed for pooling models, 1.0% token throughput…

a2c303f

… improvement (vllm-project#37347) Signed-off-by: yewentao256 <zhyanwentao@126.com>

[ROCm] issue management - request information for bug issues on ROCm (v…

cde956d

…llm-project#37009) Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com>

[Refactor] Relocate endpoint tests to mirror serving code directory s…

e8f5624

…tructure (vllm-project#37504) Signed-off-by: sfeng33 <4florafeng@gmail.com>

[SSM/Mamba] Follow-up: N-1 prefill for P/D disaggregation (vllm-proje…

4fc7ed7

…ct#37310)

fix(anthropic): remove non-standard 'data: [DONE]' from Anthropic str…

8f82769

…eaming (vllm-project#37510) Signed-off-by: cdpath <cdpath@outlook.com>

[Perf] Fix slow hasattr in CUDAGraphWrapper.__getattr__ (vllm-project…

37dec84

…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

minosfuture and others added 28 commits March 27, 2026 02:54

Fix Mamba state corruption from referencing stale block table entries (…

e909077

…vllm-project#37728) (vllm-project#37728) (vllm-project#37728)

[Bugfix] Fix structured output crash on CPU due to pin_memory=True (v…

b523a51

…llm-project#37706) Signed-off-by: Willy Hardy <whardy@redhat.com> Signed-off-by: Will Hardy <whardy@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

[Model] Add Granite 4.0 1B speech to supported models (vllm-project#3…

817d676

…8019) Signed-off-by: Nick Cao <ncao@redhat.com>

[BugFix] Fix order of compile logging (vllm-project#38012)

b1b380f

Signed-off-by: Richard Zou <zou3519@gmail.com>

[BugFix] fix VLLM_USE_STANDALONE_COMPILE=0 (vllm-project#38015)

1053702

Signed-off-by: Richard Zou <zou3519@gmail.com>

[Bugfix] Pass hf_token through config loading paths for gated model s…

8fbbb35

…upport (vllm-project#37920) Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com>

[FlexAttention] allow custom mask mod (vllm-project#37692)

3e3f7ee

Signed-off-by: Angel Li <liangel@meta.com>

Add Ubuntu 24.04 support for Docker builds (vllm-project#35386)

89c7920

Signed-off-by: aasgaonkar <aasgaonkar@nvidia.com>

[Model Runner V2][Minor] Simplify PP logic (vllm-project#38031)

ab1a479

Signed-off-by: Nick Hill <nickhill123@gmail.com>

[MRV2] Fix for DS v3.2 (vllm-project#38030)

a3620b3

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

[UX] Add flashinfer-cubin as CUDA default dep (vllm-project#37233)

ad6f941

Signed-off-by: mgoin <mgoin64@gmail.com>

Make microbatch optimization (DBO) work with general models (vllm-pro…

178f361

…ject#37926) Signed-off-by: Junhao Li <junhao@ubicloud.com>

khairulkabir1661 closed this Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Add AITER RoPE + KV cache fusion for MLA prefill and decode#2

[ROCm] Add AITER RoPE + KV cache fusion for MLA prefill and decode#2
khairulkabir1661 wants to merge 1287 commits intomainfrom
rocm-mla-aiter-rope-kv-fusion

khairulkabir1661 commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

khairulkabir1661 commented Mar 27, 2026

Summary

Changes

Core Functionality

Implementation Details

Code Quality

Testing

Performance

Dependencies

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants