Skip to content

[ROCm] Enable gluon paged MQA logits on gfx950 (MI355X)#42062

Open
frida-andersson wants to merge 2 commits intovllm-project:mainfrom
frida-andersson:rocm/gluon-gfx950-dispatch
Open

[ROCm] Enable gluon paged MQA logits on gfx950 (MI355X)#42062
frida-andersson wants to merge 2 commits intovllm-project:mainfrom
frida-andersson:rocm/gluon-gfx950-dispatch

Conversation

@frida-andersson
Copy link
Copy Markdown

@frida-andersson frida-andersson commented May 8, 2026

Summary

rocm_fp8_paged_mqa_logits in rocm_aiter_mla_sparse.py had two separate branches:

  • _ON_GFX942 (MI300X, MI325X): called deepgemm_fp8_paged_mqa_logits → gluon single-kernel path - everything else (including MI355X / gfx950): fell through to deepgemm_fp8_paged_mqa_logits_stage1 → slow two-kernel path (stage1 + sum(dim=0))

The split was introduced in 628c436 (#40871). With Triton >= 3.5.0, deepgemm_fp8_paged_mqa_logits inside AITER already dispatches to the gluon kernel for both gfx942 and gfx950 (it asserts on both and sets cdna_version/TotalCuCount per GPU internally). The _ON_GFX942-only guard in vLLM was therefore incorrect — it was accidentally routing MI355X to the slow stage1 path.

Fix:
Extend the condition from if _ON_GFX942 to if _ON_GFX942 or _ON_GFX950 so MI355X takes the gluon path.

Before:
Screenshot 2026-05-08 at 13 38 24

After:
Screenshot 2026-05-08 at 13 39 10

Test plan

  • DeepSeek-V3.2 TP€ with HIP graphs and --block-size 64 on MI355X — confirm _gluon_deepgemm_fp8_paged_mqa_logits_preshuffle is dispatched (previously _stage1)
  • GSM8K 5-shot flexible-extract score

rocm_fp8_paged_mqa_logits had two branches: gfx942 (MI300X, MI325X)
used deepgemm_fp8_paged_mqa_logits (gluon, single kernel), while gfx950
(MI355X) fell through to deepgemm_fp8_paged_mqa_logits_stage1 (two
kernels, stage1 + sum(dim=0)). The split was introduced in 628c436
(vllm-project#40871).

With Triton >= 3.5.0, deepgemm_fp8_paged_mqa_logits dispatches
internally to the gluon kernel for both gfx942 and gfx950. Extend
the condition to _ON_GFX942 or _ON_GFX950 so MI355X uses the same
gluon path as MI300X/MI325X.

Signed-off-by: Frida Andersson <fanderss@amd.com>
@frida-andersson frida-andersson requested a review from tjtanaa as a code owner May 8, 2026 11:55
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

@mergify mergify Bot added rocm Related to AMD ROCm v1 labels May 8, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD May 8, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the GFX950 architecture in the ROCm AITER MLA sparse attention operations by updating imports and conditional logic. The reviewer suggests using a unified _ON_MI3XX flag instead of listing individual architectures to improve maintainability and simplify the code.

Comment on lines 18 to +22
if current_platform.is_rocm():
from vllm.platforms.rocm import _ON_GFX942
from vllm.platforms.rocm import _ON_GFX942, _ON_GFX950
else:
_ON_GFX942 = False
_ON_GFX950 = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For better maintainability and to avoid listing individual GPU architectures, consider using the _ON_MI3XX flag which is a boolean that is true for both gfx942 and gfx950. This will make the code cleaner and easier to extend for future MI300-series GPUs.

Suggested change
if current_platform.is_rocm():
from vllm.platforms.rocm import _ON_GFX942
from vllm.platforms.rocm import _ON_GFX942, _ON_GFX950
else:
_ON_GFX942 = False
_ON_GFX950 = False
if current_platform.is_rocm():
from vllm.platforms.rocm import _ON_MI3XX
else:
_ON_MI3XX = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we have to be more specific as Mi308 might have some exceptions.


if aiter_paged_mqa_logits_module is not None:
if _ON_GFX942:
if _ON_GFX942 or _ON_GFX950:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Following the suggestion to use _ON_MI3XX, this condition can be simplified.

Suggested change
if _ON_GFX942 or _ON_GFX950:
if _ON_MI3XX:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need both _on_gfx942 and _on_gfx950 elsewhere, so it does not also make sense to apply above recommendation. Also see the comment above.

Copy link
Copy Markdown
Contributor

@dllehr-amd dllehr-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Copy Markdown
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label May 8, 2026
@tjtanaa tjtanaa enabled auto-merge (squash) May 8, 2026 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

4 participants