Skip to content

Fix fused qk concat cache mla#1783

Merged
valarLip merged 9 commits intomainfrom
fix_fused_qk_concat_cache_mla
Jan 12, 2026
Merged

Fix fused qk concat cache mla#1783
valarLip merged 9 commits intomainfrom
fix_fused_qk_concat_cache_mla

Conversation

@yzhou103
Copy link
Contributor

@yzhou103 yzhou103 commented Jan 7, 2026

Motivation

report error when running in docker rocm6.4.1

Technical Details

  1. in docker rocm6.4.1 ck_tile::bf16_t is actually unsigned short, which caused the data mismatch, now we convert the data type to fp32 explicitly
  2. also found perf is very bad at rocm6.4.1. It is caused by the buffer_o.template set_raw interface.
    I used set_raw replaced the set interface as i found it will cause 2 buffer_store_dwordx2 previously. This problem is fixed in Rocm7.x

Test Plan

python op_tests/test_concat_cache_mla.py -c fused_qk -hd 16 -k 512 -t 1 -kvd auto -qd auto

Test Result

Submission Checklist

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes data mismatch and performance issues in the fused QK concat cache MLA kernel on ROCm 6.4.1, where ck_tile::bf16_t is represented as unsigned short causing type conversion problems.

Key changes:

  • Introduces explicit float conversions for all RoPE (Rotary Position Embedding) rotation operations to ensure consistent numeric behavior across different type representations
  • Replaces buffer_o.template set_raw() calls with buffer_o.template set() to address performance degradation in ROCm 6.4.1
  • Updates copyright year to 2026

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

yzhou103 and others added 3 commits January 7, 2026 16:18
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@yzhou103
Copy link
Contributor Author

yzhou103 commented Jan 7, 2026

test more cases, for kernel fuse_qk_rope_concat_and_cache_mla_kernel_opt, buffer_o.template set_raw -> buffer_o.template set will cause about 8%-10% perfmance drop. But the performance of using set_raw interface in rocm6.4.1 is bad. This kernel is used for k=512 rope=64 and token>=256 head>=4.

@yzhou103
Copy link
Contributor Author

image

@valarLip valarLip merged commit 73dbdee into main Jan 12, 2026
17 checks passed
@valarLip valarLip deleted the fix_fused_qk_concat_cache_mla branch January 12, 2026 08:58
zhuyuhua-v pushed a commit that referenced this pull request Jan 14, 2026
* fix fuse_qk_rope_concat_and_cache_mla in rocm-6.4.1

* update

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update format

* revert set interface change

* use gmem in opus.h to replace ck_tile::buffer_view

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
valarLip pushed a commit that referenced this pull request Mar 18, 2026
* fix fuse_qk_rope_concat_and_cache_mla in rocm-6.4.1

* update

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update format

* revert set interface change

* use gmem in opus.h to replace ck_tile::buffer_view

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
valarLip pushed a commit that referenced this pull request Mar 18, 2026
* fix fuse_qk_rope_concat_and_cache_mla in rocm-6.4.1

* update

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update format

* revert set interface change

* use gmem in opus.h to replace ck_tile::buffer_view

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants