[ROCm] Utilize persistent MLA kernel from AITER by SKPsanjeevi · Pull Request #36574 · vllm-project/vllm

SKPsanjeevi · 2026-03-10T02:20:02Z

Purpose

Add support for aiter's persistent mode MLA decode kernel on ROCm. The persistent
kernel stays resident on GPU CUs and processes work items from pre-computed
scheduling metadata, avoiding per-batch kernel launch overhead.

Pre-allocates six persistent scheduling buffers (work_meta_data, work_indptr,
work_info_set, reduce_indptr, reduce_final_map, reduce_partial_map) during
AiterMLAMetadataBuilder initialization via aiter.get_mla_metadata_info_v1.
Fills the buffers each decode step via aiter.get_mla_metadata_v1.
Stores them on AiterMLAMetadata and passes them through
rocm_aiter_ops.mla_decode_fwd into aiter.mla.mla_decode_fwd, which uses
the persistent kernel path.

Test Plan

Build vllm image without and with the persistent MLA.
Base Image : vllm/vllm-openai-rocm:v0.17.1
AITER commit : c3708fb74 (v0.1.10.post2)
vLLM commit (with persistent MLA) : ab8159d
vLLM commit (without persistent MLA) : 8798507

Deepseek-R1-0528-MXFP4, TP=8:

export AMDGCN_USE_BUFFER_OPS=1
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=True

$cat DeepSeek-R1-0528-MXFP4_mi355x_vllm.yaml
max-num-seqs: 64
no-enable-prefix-caching: True
max-num-batched-tokens: 65536
block-size: 16
gpu-memory-utilization: 0.9
async-scheduling: True
distributed_executor_backend: mp
kv-cache-dtype: fp8
tensor_parallel_size: 8

vllm serve /data/workloads-inference/models/DeepSeek-R1-0528-MXFP4 --host 0.0.0.0 --port 8888 --config DeepSeek-R1-0528-MXFP4_mi355x_vllm.yaml

Test Results

Deepseek-R1-0528-MXFP4 TP=8, TTPS per GPU listed below:

Config	Conc	Without persistent MLA	With persistent MLA	Uplift
1k1k	4	59	78	1.32
1k1k	64	509	623	1.23
1k8k	4	30	45	1.47
1k8k	64	332	402	1.21
8k1k	4	261	324	1.24
8k1k	64	1542	1895	1.23

Accuracy

lm_eval scores for gsm8k 250 limit:
openai-completions ({'model': '/data/workloads-inference/models/DeepSeek-R1-0528-MXFP4', 'base_url': 'http://0.0.0.0:8888/v1/completions', 'add_bos_token': True, 'enforce_eager': True, 'num_concurrent': 64, 'max_retries': 10, 'max_gen_toks': 1024, 'tokenizer_backend': 'huggingface'}), gen_kwargs: ({}), limit: 250.0, num_fewshot: 5, batch_size: 64

Without persistent MLA:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.952	±	0.0135
		strict-match	5	exact_match	↑	0.952	±	0.0135

With persistent MLA:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.964	±	0.0118
		strict-match	5	exact_match	↑	0.964	±	0.0118

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces support for a persistent mode MLA decode kernel on ROCm, aimed at improving performance by avoiding kernel launch overheads. The changes are well-contained and gated behind a new environment variable VLLM_ROCM_USE_AITER_MLA_PERSISTENT. The implementation correctly pre-allocates and fills scheduling buffers when the feature is enabled, and passes them through the attention call stack. My review includes one suggestion to improve the robustness of the implementation by making an implicit contract between components explicit with assertions, which will help prevent potential issues in the future.

_{Note: Security Review did not run due to the size of the PR.}

github-actions · 2026-03-10T02:27:58Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

mergify · 2026-03-10T02:31:06Z

Hi @SKPsanjeevi, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-10T03:19:27Z

Hi @SKPsanjeevi, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-11T07:35:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @SKPsanjeevi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-03-12T23:57:06Z

Hi @SKPsanjeevi, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-13T00:09:24Z

Hi @SKPsanjeevi, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-13T00:17:07Z

Hi @SKPsanjeevi, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

SKPsanjeevi · 2026-03-13T00:29:59Z

@tjtanaa can you please take a look at this PR?

mergify · 2026-03-13T00:43:21Z

Hi @SKPsanjeevi, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

dllehr-amd · 2026-03-17T20:33:30Z

@@ -115,6 +123,59 @@ def __init__(
            max_num_pages, dtype=torch.int32, device=device
        )

+        if envs.VLLM_ROCM_USE_AITER_MLA_PERSISTENT:


we are now strongly discouraged from using ENVVARS for triggering paths. So in this case we need to decide whether to leave it all on or all off. or find runtime values we can use to determine if we should enable the persistent kernel.

Would prefer the way it is because the improvements are noted for specifc models. Once more datapoints emerge, we could flip the default to True and then eventually remove the flag (always enabled).

Update: Removed the persistent env variable.

Would prefer the way it is because the improvements are noted for specifc models.

Please share more information about this information. We should try to bake in the selection logic (based on the shape of the tensors) to only trigger the persistent kernel for the case that you know it is providing speed boost. Else we will use the non-persistent MLA kernel to avoid perf regression or kernel failure.

And is there a tensor shape constraint for persistent mla? Is the constraint tighter and limited compared to non-persistent mla kernel?

dllehr-amd · 2026-03-17T21:30:33Z

+                uni_seqlen_qo=max_qo_len,
+                fast_mode=True,
+            )
+            decode_work_meta_data = self._mla_work_meta_data


why are we setting an intermediate variable here? this shouldn't be needed. In addition. I would strongly recommend we pass this information as part of the AiterMLAMetadata and reference this in the code. It's easier than adding 6 new arguments to a handful of function calls when

mergify · 2026-03-18T18:57:58Z

Hi @SKPsanjeevi, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

SKPsanjeevi · 2026-03-24T21:03:08Z

@tjtanaa yes, the current PR is tested with that aiter commit only (c3708fb7445899c14cdc6e8055953ee02ed78ddf)

dllehr-amd · 2026-03-25T06:08:11Z

@SKPsanjeevi thank you for driving this to completion! we appreciate your work on it!

dllehr-amd

Approved

tjtanaa · 2026-03-25T14:53:49Z

@SKPsanjeevi I am trying to validate this across different models to understand whether the constraint of using the MLA backend becomes smaller.

Kimi K2.5 TP8 (Before this PR TP8 also cannot run since this recipe that AMD puts out only mentioned about TP4 https://github.com/vllm-project/recipes/pull/296)

export VLLM_USE_V1=1
export SAFETENSORS_FAST_GPU=1
export VLLM_ROCM_USE_AITER=1

model_path="moonshotai/Kimi-K2.5"

vllm serve $model_path \
  --tensor-parallel-size 8 \
  --max-num-batched-tokens 32768 \
  --trust-remote-code \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --gpu_memory_utilization 0.9 \
  --async-scheduling \
  --load-format fastsafetensors > kimi-k2.5.log 2>&1

(Worker_TP6 pid=87036) ERROR 03-25 14:36:47 [multiproc_executor.py:932]   File "/app/reviewpr7/persistent_mla/vllm/v1/worker/gpu_model_runner.py", line 5567, in _init_minimal_kv_cache_for_profiling
(Worker_TP6 pid=87036) ERROR 03-25 14:36:47 [multiproc_executor.py:932]     self.initialize_kv_cache(minimal_config)
(Worker_TP6 pid=87036) ERROR 03-25 14:36:47 [multiproc_executor.py:932]   File "/app/reviewpr7/persistent_mla/vllm/v1/worker/gpu_model_runner.py", line 6511, in initialize_kv_cache
(Worker_TP6 pid=87036) ERROR 03-25 14:36:47 [multiproc_executor.py:932]     self.initialize_metadata_builders(kv_cache_config, kernel_block_sizes)
(Worker_TP6 pid=87036) ERROR 03-25 14:36:47 [multiproc_executor.py:932]   File "/app/reviewpr7/persistent_mla/vllm/v1/worker/gpu_model_runner.py", line 5945, in initialize_metadata_builders
(Worker_TP6 pid=87036) ERROR 03-25 14:36:47 [multiproc_executor.py:932]     attn_group.create_metadata_builders(
(Worker_TP6 pid=87036) ERROR 03-25 14:36:47 [multiproc_executor.py:932]   File "/app/reviewpr7/persistent_mla/vllm/v1/worker/utils.py", line 246, in create_metadata_builders
(Worker_TP6 pid=87036) ERROR 03-25 14:36:47 [multiproc_executor.py:932]     self.backend.get_builder_cls()(
(Worker_TP6 pid=87036) ERROR 03-25 14:36:47 [multiproc_executor.py:932]   File "/app/reviewpr7/persistent_mla/vllm/v1/attention/backends/mla/rocm_aiter_mla.py", line 148, in __init__
(Worker_TP6 pid=87036) ERROR 03-25 14:36:47 [multiproc_executor.py:932]     ) = get_mla_metadata_info_v1(
(Worker_TP6 pid=87036) ERROR 03-25 14:36:47 [multiproc_executor.py:932]         ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=87036) ERROR 03-25 14:36:47 [multiproc_executor.py:932]   File "/usr/local/lib/python3.12/dist-packages/aiter/ops/attention.py", line 812, in get_mla_metadata_info_v1
(Worker_TP6 pid=87036) ERROR 03-25 14:36:47 [multiproc_executor.py:932]     assert num_head_qo % 16 == 0
(Worker_TP6 pid=87036) ERROR 03-25 14:36:47 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=87036) ERROR 03-25 14:36:47 [multiproc_executor.py:932] AssertionError

Kimi K2.5 TP4 bf16 kvcache and fp8 kvcache works

SKPsanjeevi · 2026-03-25T15:41:48Z

@tjtanaa this particular PR does not change the existing constraint/limitations. It focuses only the performance. Current AITER MLA backend, I believe, supports only 16 or 128 heads. Therefore, Kimi's 64 heads works only on TP=4.

tjtanaa · 2026-03-25T15:49:33Z

@tjtanaa this particular PR does not change the existing constraint/limitations. It focuses only the performance. Current AITER MLA backend, I believe, supports only 16 or 128 heads. Therefore, Kimi's 64 heads works only on TP=4.

Would prefer the way it is because the improvements are noted for specifc models. Once more datapoints emerge, we could flip the default to True and then eventually remove the flag (always enabled).

Ah ok. I saw your comments here, so I decided to do some testing.

tjtanaa

LGTM.

Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com> Signed-off-by: johnnynunez <johnnynuca14@gmail.com>

Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com>

Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com>

Base image: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_0330_vllm018 Patches applied: - AITER SplitK bug fix (ROCm/aiter#2508) - vLLM persistent MLA kernel (vllm-project/vllm#36574) - vLLM fused AllReduce+RMSNorm (vllm-project/vllm#37891) Made-with: Cursor

Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com>

Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com>

SKPsanjeevi requested a review from tjtanaa as a code owner March 10, 2026 02:20

mergify Bot added rocm Related to AMD ROCm v1 labels Mar 10, 2026

github-project-automation Bot added this to AMD Mar 10, 2026

github-project-automation Bot moved this to Todo in AMD Mar 10, 2026

gemini-code-assist Bot reviewed Mar 10, 2026

View reviewed changes

Comment thread vllm/_aiter_ops.py

SKPsanjeevi changed the title ~~initial commit for persistent MLA~~ Utilize persistent MLA kernel in AITER Mar 10, 2026

SKPsanjeevi changed the title ~~Utilize persistent MLA kernel in AITER~~ [ROCm] Utilize persistent MLA kernel from AITER Mar 10, 2026

mergify Bot added the needs-rebase label Mar 11, 2026

SKPsanjeevi force-pushed the skps/persistent_mla branch from e77b2ab to 4e17a49 Compare March 12, 2026 23:53

mergify Bot removed the needs-rebase label Mar 12, 2026

SKPsanjeevi force-pushed the skps/persistent_mla branch 2 times, most recently from d8b7584 to f642b9c Compare March 13, 2026 00:05

SKPsanjeevi force-pushed the skps/persistent_mla branch from f642b9c to 1a47fe0 Compare March 13, 2026 00:13

SKPsanjeevi force-pushed the skps/persistent_mla branch 3 times, most recently from d056954 to 3c02d91 Compare March 13, 2026 00:39

dllehr-amd reviewed Mar 17, 2026

View reviewed changes

SKPsanjeevi requested a review from dllehr-amd March 19, 2026 04:24

gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 24, 2026

gshtras approved these changes Mar 24, 2026

View reviewed changes

dllehr-amd approved these changes Mar 25, 2026

View reviewed changes

tjtanaa approved these changes Mar 25, 2026

View reviewed changes

tjtanaa merged commit 978fc18 into vllm-project:main Mar 25, 2026
54 of 56 checks passed

github-project-automation Bot moved this from Todo to Done in AMD Mar 25, 2026

johnnynunez pushed a commit to johnnynunez/vllm that referenced this pull request Mar 25, 2026

[ROCm] Utilize persistent MLA kernel from AITER (vllm-project#36574)

5804ed1

Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com> Signed-off-by: johnnynunez <johnnynuca14@gmail.com>

RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026

[ROCm] Utilize persistent MLA kernel from AITER (vllm-project#36574)

47189fc

Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com>

nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026

[ROCm] Utilize persistent MLA kernel from AITER (vllm-project#36574)

8b4d40f

Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[ROCm] Utilize persistent MLA kernel from AITER (vllm-project#36574)

ffccdf2

Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com>

peizhang56 pushed a commit to peizhang56/vllm that referenced this pull request Apr 3, 2026

[ROCm] Utilize persistent MLA kernel from AITER (vllm-project#36574)

5b1b423

Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com>

puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026

[ROCm] Utilize persistent MLA kernel from AITER (vllm-project#36574)

f5b231d

Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026

[ROCm] Utilize persistent MLA kernel from AITER (vllm-project#36574)

b6f72e8

Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com>

jin-amd mentioned this pull request Apr 27, 2026

[Bugfix][ROCm][AITER MLA] Size persistent metadata buffers off max(max_num_seqs, max_cudagraph_capture_size) #41016

Open

4 tasks

Uh oh!

Conversation

SKPsanjeevi commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Results

Accuracy

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions Bot commented Mar 10, 2026

Uh oh!

mergify Bot commented Mar 10, 2026

Uh oh!

mergify Bot commented Mar 10, 2026

Uh oh!

mergify Bot commented Mar 11, 2026

Uh oh!

mergify Bot commented Mar 12, 2026

Uh oh!

mergify Bot commented Mar 13, 2026

Uh oh!

mergify Bot commented Mar 13, 2026

Uh oh!

SKPsanjeevi commented Mar 13, 2026

Uh oh!

mergify Bot commented Mar 13, 2026

Uh oh!

dllehr-amd Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

SKPsanjeevi Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SKPsanjeevi Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

tjtanaa Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dllehr-amd Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

SKPsanjeevi Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Mar 18, 2026

Uh oh!

SKPsanjeevi commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dllehr-amd commented Mar 25, 2026

Uh oh!

dllehr-amd left a comment

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SKPsanjeevi commented Mar 25, 2026

Uh oh!

tjtanaa commented Mar 25, 2026

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

SKPsanjeevi commented Mar 10, 2026 •

edited

Loading

SKPsanjeevi Mar 18, 2026 •

edited

Loading

tjtanaa Mar 23, 2026 •

edited

Loading

SKPsanjeevi commented Mar 24, 2026 •

edited

Loading

tjtanaa commented Mar 25, 2026 •

edited

Loading