[ROCm] Utilize persistent MLA kernel from AITER#36574
[ROCm] Utilize persistent MLA kernel from AITER#36574tjtanaa merged 4 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for a persistent mode MLA decode kernel on ROCm, aimed at improving performance by avoiding kernel launch overheads. The changes are well-contained and gated behind a new environment variable VLLM_ROCM_USE_AITER_MLA_PERSISTENT. The implementation correctly pre-allocates and fills scheduling buffers when the feature is enabled, and passes them through the attention call stack. My review includes one suggestion to improve the robustness of the implementation by making an implicit contract between components explicit with assertions, which will help prevent potential issues in the future.
Note: Security Review did not run due to the size of the PR.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Hi @SKPsanjeevi, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @SKPsanjeevi, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
e77b2ab to
4e17a49
Compare
|
Hi @SKPsanjeevi, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
d8b7584 to
f642b9c
Compare
|
Hi @SKPsanjeevi, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
f642b9c to
1a47fe0
Compare
|
Hi @SKPsanjeevi, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
@tjtanaa can you please take a look at this PR? |
d056954 to
3c02d91
Compare
|
Hi @SKPsanjeevi, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| @@ -115,6 +123,59 @@ def __init__( | |||
| max_num_pages, dtype=torch.int32, device=device | |||
| ) | |||
|
|
|||
| if envs.VLLM_ROCM_USE_AITER_MLA_PERSISTENT: | |||
There was a problem hiding this comment.
we are now strongly discouraged from using ENVVARS for triggering paths. So in this case we need to decide whether to leave it all on or all off. or find runtime values we can use to determine if we should enable the persistent kernel.
There was a problem hiding this comment.
Would prefer the way it is because the improvements are noted for specifc models. Once more datapoints emerge, we could flip the default to True and then eventually remove the flag (always enabled).
There was a problem hiding this comment.
Update: Removed the persistent env variable.
There was a problem hiding this comment.
Would prefer the way it is because the improvements are noted for specifc models.
Please share more information about this information. We should try to bake in the selection logic (based on the shape of the tensors) to only trigger the persistent kernel for the case that you know it is providing speed boost. Else we will use the non-persistent MLA kernel to avoid perf regression or kernel failure.
And is there a tensor shape constraint for persistent mla? Is the constraint tighter and limited compared to non-persistent mla kernel?
| uni_seqlen_qo=max_qo_len, | ||
| fast_mode=True, | ||
| ) | ||
| decode_work_meta_data = self._mla_work_meta_data |
There was a problem hiding this comment.
why are we setting an intermediate variable here? this shouldn't be needed. In addition. I would strongly recommend we pass this information as part of the AiterMLAMetadata and reference this in the code. It's easier than adding 6 new arguments to a handful of function calls when
|
Hi @SKPsanjeevi, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
@tjtanaa yes, the current PR is tested with that aiter commit only (c3708fb7445899c14cdc6e8055953ee02ed78ddf) |
|
@SKPsanjeevi thank you for driving this to completion! we appreciate your work on it! |
|
@SKPsanjeevi I am trying to validate this across different models to understand whether the constraint of using the MLA backend becomes smaller. Kimi K2.5 TP8 (Before this PR TP8 also cannot run since this recipe that AMD puts out only mentioned about TP4 https://github.com/vllm-project/recipes/pull/296)Kimi K2.5 TP4 bf16 kvcache and fp8 kvcache works |
|
@tjtanaa this particular PR does not change the existing constraint/limitations. It focuses only the performance. Current AITER MLA backend, I believe, supports only 16 or 128 heads. Therefore, Kimi's 64 heads works only on TP=4. |
Ah ok. I saw your comments here, so I decided to do some testing. |
Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com> Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com>
Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com>
Base image: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_0330_vllm018 Patches applied: - AITER SplitK bug fix (ROCm/aiter#2508) - vLLM persistent MLA kernel (vllm-project/vllm#36574) - vLLM fused AllReduce+RMSNorm (vllm-project/vllm#37891) Made-with: Cursor
Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com>
Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com>
Purpose
Add support for aiter's persistent mode MLA decode kernel on ROCm. The persistent
kernel stays resident on GPU CUs and processes work items from pre-computed
scheduling metadata, avoiding per-batch kernel launch overhead.
work_meta_data,work_indptr,work_info_set,reduce_indptr,reduce_final_map,reduce_partial_map) duringAiterMLAMetadataBuilderinitialization viaaiter.get_mla_metadata_info_v1.aiter.get_mla_metadata_v1.AiterMLAMetadataand passes them throughrocm_aiter_ops.mla_decode_fwdintoaiter.mla.mla_decode_fwd, which usesthe persistent kernel path.
Test Plan
Build vllm image without and with the persistent MLA.
Base Image :
vllm/vllm-openai-rocm:v0.17.1AITER commit :
c3708fb74(v0.1.10.post2)vLLM commit (with persistent MLA) : ab8159d
vLLM commit (without persistent MLA) : 8798507
Deepseek-R1-0528-MXFP4, TP=8:
Test Results
Deepseek-R1-0528-MXFP4 TP=8, TTPS per GPU listed below:
Accuracy
lm_evalscores for gsm8k 250 limit:openai-completions ({'model': '/data/workloads-inference/models/DeepSeek-R1-0528-MXFP4', 'base_url': 'http://0.0.0.0:8888/v1/completions', 'add_bos_token': True, 'enforce_eager': True, 'num_concurrent': 64, 'max_retries': 10, 'max_gen_toks': 1024, 'tokenizer_backend': 'huggingface'}), gen_kwargs: ({}), limit: 250.0, num_fewshot: 5, batch_size: 64
Without persistent MLA:
With persistent MLA:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.