Support MLA decode with nhead < 16 by transparent pad-to-16 by ChuanLi1101 · Pull Request #2577 · ROCm/aiter

ChuanLi1101 · 2026-04-01T10:29:08Z

Summary

For MLA models with small query head counts (e.g., Kimi-Linear-48B-A3B with TP=8 giving nhead=4), AITER's ASM kernel has no pre-compiled support for gqa_ratio < 16, causing decode failures.
This PR adds transparent head padding within AITER: when nhead < 16 and divides 16 evenly, Q is padded to 16 heads via repeat_interleave, the nhead=16 ASM kernel runs, then the output is un-padded.
Adjusts C++ persistent metadata generation and Python metadata sizing to accept nhead < 16.
Adds nhead=4 test configurations to test_mla.py and test_mla_persistent.py.

Changes

aiter/mla.py: Pad Q heads to 16 when nhead < 16 (both non-persistent and persistent paths), un-pad output before return. Add safe entries in get_block_n_fp8.
csrc/kernels/mla/metadata/v1_2_device.cuh: Add pad_to_qh16 logic for nhead < 16 in persistent metadata generation.
aiter/ops/attention.py: Relax num_head_qo % 16 assertion in get_mla_metadata_info_v1 to allow nhead < 16 with effective_num_head = 16.
op_tests/test_mla.py: Add nhead=(4,1) to default test configurations.
op_tests/test_mla_persistent.py: Add nhead=(4,1) to default test configurations.

Test plan

nhead=4 BF16 decode test on MI355X (gfx950): all checkAllclose passed
nhead=16 BF16 regression test: all passed, no regressions
nhead=4 FP8 decode test (future work - needs ASM kernel support)
CI pipeline validation

Made-with: Cursor

github-actions · 2026-04-01T10:29:41Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-355`	Run Triton tests on MI355 in addition to MI325
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2577 --add-label <label>

ChuanLi1101 · 2026-04-03T22:34:28Z

cc @valarLip @carlushuang — Requesting expedited review. This enables MLA decode with nhead < 16 (required for GLM-5 TP=8 on MI355X). This is blocking vLLM-side PRs (vllm-project/vllm#36855, vllm-project/vllm#38665) for customer-facing GLM-5 inference. See also #2563 for the upstream issue.

Support MLA decode with nhead < 16 by transparent pad-to-16

3929d1d

Made-with: Cursor

ChuanLi1101 requested a review from a team April 1, 2026 10:29

ChuanLi1101 mentioned this pull request Apr 3, 2026

[ROCm][Perf] Add AITER MLA prefill kernel for dense MLA backend vllm-project/vllm#38947

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support MLA decode with nhead < 16 by transparent pad-to-16#2577

Support MLA decode with nhead < 16 by transparent pad-to-16#2577
ChuanLi1101 wants to merge 1 commit intomainfrom
chuan/mla-nhead-pad-to-16

ChuanLi1101 commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

ChuanLi1101 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChuanLi1101 commented Apr 1, 2026

Summary

Changes

Test plan

Uh oh!

github-actions Bot commented Apr 1, 2026

🏷️ CI Guide

Uh oh!

ChuanLi1101 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant