Fix fused qk concat cache mla by yzhou103 · Pull Request #1783 · ROCm/aiter

yzhou103 · 2026-01-07T08:11:58Z

Motivation

report error when running in docker rocm6.4.1

Technical Details

in docker rocm6.4.1 ck_tile::bf16_t is actually unsigned short, which caused the data mismatch, now we convert the data type to fp32 explicitly
also found perf is very bad at rocm6.4.1. It is caused by the buffer_o.template set_raw interface.
I used set_raw replaced the set interface as i found it will cause 2 buffer_store_dwordx2 previously. This problem is fixed in Rocm7.x

Test Plan

python op_tests/test_concat_cache_mla.py -c fused_qk -hd 16 -k 512 -t 1 -kvd auto -qd auto

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR fixes data mismatch and performance issues in the fused QK concat cache MLA kernel on ROCm 6.4.1, where ck_tile::bf16_t is represented as unsigned short causing type conversion problems.

Key changes:

Introduces explicit float conversions for all RoPE (Rotary Position Embedding) rotation operations to ensure consistent numeric behavior across different type representations
Replaces buffer_o.template set_raw() calls with buffer_o.template set() to address performance degradation in ROCm 6.4.1
Updates copyright year to 2026

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

csrc/kernels/cache_kernels.cu

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

yzhou103 · 2026-01-07T10:47:02Z

test more cases, for kernel fuse_qk_rope_concat_and_cache_mla_kernel_opt, buffer_o.template set_raw -> buffer_o.template set will cause about 8%-10% perfmance drop. But the performance of using set_raw interface in rocm6.4.1 is bad. This kernel is used for k=512 rope=64 and token>=256 head>=4.

yzhou103 · 2026-01-12T02:45:47Z

* fix fuse_qk_rope_concat_and_cache_mla in rocm-6.4.1 * update * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update format * revert set interface change * use gmem in opus.h to replace ck_tile::buffer_view --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

yzhou103 added 2 commits January 7, 2026 15:55

fix fuse_qk_rope_concat_and_cache_mla in rocm-6.4.1

e7a58b4

update

578b253

yzhou103 requested review from a team and Copilot January 7, 2026 08:11

Copilot started reviewing on behalf of yzhou103 January 7, 2026 08:12 View session

Copilot AI reviewed Jan 7, 2026

View reviewed changes

csrc/kernels/cache_kernels.cu Outdated Show resolved Hide resolved

csrc/kernels/cache_kernels.cu Outdated Show resolved Hide resolved

csrc/kernels/cache_kernels.cu Outdated Show resolved Hide resolved

csrc/kernels/cache_kernels.cu Outdated Show resolved Hide resolved

yzhou103 and others added 3 commits January 7, 2026 16:18

Apply suggestions from code review

98cd28e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'main' into fix_fused_qk_concat_cache_mla

adba71f

update format

0d3b6da

yzhou103 added 4 commits January 8, 2026 18:15

Merge branch 'main' into fix_fused_qk_concat_cache_mla

b97196f

revert set interface change

924f3d4

use gmem in opus.h to replace ck_tile::buffer_view

c4c0676

Merge branch 'main' into fix_fused_qk_concat_cache_mla

7150c5a

valarLip approved these changes Jan 12, 2026

View reviewed changes

valarLip merged commit 73dbdee into main Jan 12, 2026
17 checks passed

valarLip deleted the fix_fused_qk_concat_cache_mla branch January 12, 2026 08:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fused qk concat cache mla#1783

Fix fused qk concat cache mla#1783
valarLip merged 9 commits intomainfrom
fix_fused_qk_concat_cache_mla

yzhou103 commented Jan 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yzhou103 commented Jan 7, 2026 •

edited

Loading

Uh oh!

yzhou103 commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yzhou103 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yzhou103 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzhou103 commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yzhou103 commented Jan 7, 2026 •

edited

Loading

yzhou103 commented Jan 7, 2026 •

edited

Loading