Skip to content

[AMD] f16_gemm gluon kernel improve pipeline#10057

Merged
antiagainst merged 1 commit into
triton-lang:mainfrom
ROCm:dtanner/f16_gemm_pipeline
Apr 18, 2026
Merged

[AMD] f16_gemm gluon kernel improve pipeline#10057
antiagainst merged 1 commit into
triton-lang:mainfrom
ROCm:dtanner/f16_gemm_pipeline

Conversation

@guacamoleo
Copy link
Copy Markdown
Contributor

Improve f16 gemm gfx1250-gluon performance. Improves gemm_tdm_pipelined_single_warp_per_simd_schedule_kernel by moving tdm.load earlier; from the top of the loop (which hides 3/4th of a loop-iteration's worth of cycles) to right after the wait (which hides a full loop-iteration's worth of cycles).
This only fixes the mentioned kernel; other kernels need independent benchmarking and improving.

@guacamoleo guacamoleo marked this pull request as draft April 16, 2026 19:38
@guacamoleo
Copy link
Copy Markdown
Contributor Author

Converting to draft while I verify that gfx1250 tests pass.

@guacamoleo guacamoleo marked this pull request as ready for review April 17, 2026 19:25
@guacamoleo
Copy link
Copy Markdown
Contributor Author

guacamoleo commented Apr 17, 2026

Ready for review; I've verified that the kernel tests pass, and that this change alone improves performance significantly.

@guacamoleo guacamoleo force-pushed the dtanner/f16_gemm_pipeline branch from 9ab90c1 to 849ac59 Compare April 17, 2026 19:45
@antiagainst antiagainst merged commit 0ee2ec2 into triton-lang:main Apr 18, 2026
15 of 18 checks passed
@antiagainst antiagainst deleted the dtanner/f16_gemm_pipeline branch April 18, 2026 06:03
bingyizh233 pushed a commit to bingyizh233/triton that referenced this pull request Apr 20, 2026
)

Improve f16 gemm gfx1250-gluon performance. Improves
gemm_tdm_pipelined_single_warp_per_simd_schedule_kernel by moving
tdm.load earlier; from the top of the loop (which hides 3/4th of a
loop-iteration's worth of cycles) to right after the wait (which hides a
full loop-iteration's worth of cycles).
This only fixes the mentioned kernel; other kernels need independent
benchmarking and improving.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants