[ROCm] Enable gluon paged MQA logits on gfx950 (MI355X)#42062
[ROCm] Enable gluon paged MQA logits on gfx950 (MI355X)#42062frida-andersson wants to merge 2 commits intovllm-project:mainfrom
Conversation
rocm_fp8_paged_mqa_logits had two branches: gfx942 (MI300X, MI325X) used deepgemm_fp8_paged_mqa_logits (gluon, single kernel), while gfx950 (MI355X) fell through to deepgemm_fp8_paged_mqa_logits_stage1 (two kernels, stage1 + sum(dim=0)). The split was introduced in 628c436 (vllm-project#40871). With Triton >= 3.5.0, deepgemm_fp8_paged_mqa_logits dispatches internally to the gluon kernel for both gfx942 and gfx950. Extend the condition to _ON_GFX942 or _ON_GFX950 so MI355X uses the same gluon path as MI300X/MI325X. Signed-off-by: Frida Andersson <fanderss@amd.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. Agent GuidelinesIMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request adds support for the GFX950 architecture in the ROCm AITER MLA sparse attention operations by updating imports and conditional logic. The reviewer suggests using a unified _ON_MI3XX flag instead of listing individual architectures to improve maintainability and simplify the code.
| if current_platform.is_rocm(): | ||
| from vllm.platforms.rocm import _ON_GFX942 | ||
| from vllm.platforms.rocm import _ON_GFX942, _ON_GFX950 | ||
| else: | ||
| _ON_GFX942 = False | ||
| _ON_GFX950 = False |
There was a problem hiding this comment.
For better maintainability and to avoid listing individual GPU architectures, consider using the _ON_MI3XX flag which is a boolean that is true for both gfx942 and gfx950. This will make the code cleaner and easier to extend for future MI300-series GPUs.
| if current_platform.is_rocm(): | |
| from vllm.platforms.rocm import _ON_GFX942 | |
| from vllm.platforms.rocm import _ON_GFX942, _ON_GFX950 | |
| else: | |
| _ON_GFX942 = False | |
| _ON_GFX950 = False | |
| if current_platform.is_rocm(): | |
| from vllm.platforms.rocm import _ON_MI3XX | |
| else: | |
| _ON_MI3XX = False |
There was a problem hiding this comment.
In this case we have to be more specific as Mi308 might have some exceptions.
|
|
||
| if aiter_paged_mqa_logits_module is not None: | ||
| if _ON_GFX942: | ||
| if _ON_GFX942 or _ON_GFX950: |
There was a problem hiding this comment.
We need both _on_gfx942 and _on_gfx950 elsewhere, so it does not also make sense to apply above recommendation. Also see the comment above.
Summary
rocm_fp8_paged_mqa_logitsinrocm_aiter_mla_sparse.pyhad two separate branches:_ON_GFX942(MI300X, MI325X): calleddeepgemm_fp8_paged_mqa_logits→ gluon single-kernel path - everything else (including MI355X / gfx950): fell through todeepgemm_fp8_paged_mqa_logits_stage1→ slow two-kernel path (stage1 +sum(dim=0))The split was introduced in 628c436 (#40871). With Triton >= 3.5.0,
deepgemm_fp8_paged_mqa_logitsinside AITER already dispatches to the gluon kernel for both gfx942 and gfx950 (it asserts on both and setscdna_version/TotalCuCountper GPU internally). The_ON_GFX942-only guard in vLLM was therefore incorrect — it was accidentally routing MI355X to the slow stage1 path.Fix:
Extend the condition from
if _ON_GFX942toif _ON_GFX942 or _ON_GFX950so MI355X takes the gluon path.Before:

After:

Test plan
--block-size 64on MI355X — confirm_gluon_deepgemm_fp8_paged_mqa_logits_preshuffleis dispatched (previously_stage1)