Skip to content

hipblaslt: add adaptive gemm exact tuning for tf32#2563

Closed
chengchingwen wants to merge 3 commits into
developfrom
users/chengchingwen/tf32_ag_tuning
Closed

hipblaslt: add adaptive gemm exact tuning for tf32#2563
chengchingwen wants to merge 3 commits into
developfrom
users/chengchingwen/tf32_ag_tuning

Conversation

@chengchingwen
Copy link
Copy Markdown
Contributor

Motivation

add adaptive gemm exact tuning for tf32

Technical Details

Test Plan

Test Result

Submission Checklist

AlexBrownAMD pushed a commit that referenced this pull request Nov 10, 2025
When building `hipDNN` from source it handles downloading and
configuring transitive dependencies, while an installed version requires
the consuming package to provide needed dependencies. In the plugin's
current form, the former is more convenient.
@b-shi b-shi force-pushed the users/chengchingwen/tf32_ag_tuning branch from 6a713b6 to bbb1048 Compare November 10, 2025 21:19
@math-ci
Copy link
Copy Markdown

math-ci Bot commented Nov 11, 2025

perfci run on commit bbb1048

math-ci run

@AndySu12
Copy link
Copy Markdown
Contributor

#2983

ammallya pushed a commit that referenced this pull request Feb 3, 2026
Co-authored-by: Aviral Goel <aviral.goel@amd.com>

[ROCm/composable_kernel commit: 1e1ee75]
ammallya pushed a commit that referenced this pull request Feb 3, 2026
* Add stride validation to prevent segfault in blockscale GEMM

* run clang-format

* Update profiler/include/profiler/profile_gemm_blockscale_wp_impl.hpp

Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>

* added stride length checking to more gemm examples in ckprofiler

* ran clang format

* added validation header and implement in core gemm operations

* remove ck_tile transpose and gemm stages from CI (#2646)

* update CK build instruction step 4 (#2563)

Co-authored-by: Aviral Goel <aviral.goel@amd.com>

* Fixes to  "General 2D Reduction Kernel" (#2535) (#2656)

* fix reduce2d

- revret the combine_partial_results() chnages
- remove auto from function def

* clang-format

* enable aiter test_mha in daily CI (#2659)

* feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582)

* feat(copy_kernel): add basic copy kernel example with documentation

* docs(CHANGELOG): Updated changelog

* chore: performed clang format

* Update example/ck_tile/39_copy/copy_basic.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* fix(terminology): follow amd terms

* extract elementwise copy to a new kernel

* fix(copy_kernel): bug in verification

* add comments about vgpr usage

* lint and nits

* add notes and comments

* print hostTensor via stream

* print hostTensor via stream

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* [CK_TILE] FMHA BWD Optimization For GFX950 (#2628)

* simplify fmha_bwd_kernel MakeKargs & dq_dram_window

* simply duplicate

* trload pipeline

* Try two-stage

* add prefetch

* optimize & iglp

* Fix num_byte calculations to use nhead_k for K & V size (#2653)

Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300.

Before:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s
```

After:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s
```

* [CK_TILE] FMHA BWD Decode Pipeline (#2643)

* Fix distr

* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr

* decode 16x16 o2

* fix (#2668)

* Optimize fmha fwd decode & prefill for gfx950 (#2641)

* Fix for fwd/bwd kernel build filter

* fix bwd code

* save an example for __bf16 type

* temp save, waiting for debug

* tempsave, fmha_decode

* temp save, change all instance to 1wave

* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* temp save

* tempsave

* compile pass

* tempsave, trload+asyncload done

* tempsave. asynccopy+trload sanity checked

* remove unnecessary features

* fix the lds alignment caused performance regression

* enable prefill overload operator().

* remove all lds bankconflict with xor layouts

* enable larger tile size; upgrade xor pattern

* upgrade prefill pipeline; simple iglp; consistent data produce and consume order

* small refactor

* Load Q through lds, implement xor;

* add vmcnt guard before load ktile

* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

* add __restrict__ to tr load

* merge fa_decode pipeline into fmha_fwd api

* remove unnecessary files; rename some files

* Remove unnecessary changes

* bug fix, clang format;

* remove non-necessary change

* fix clangformat with 18.1.3

* fix bugs

* fix bug

* fix bug on non-gfx950

* fix bugs in gemm

* fix bug in pki4

* tempsave, update the blocksync functions

* change the warp setting for hdim32 fmha fwd

* clang format

* fix conflict. disable all v-col instance for fmha fwd

* Fix the bug

* clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

* Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641)" (#2670)

This reverts commit 327bf40.

* added batch stride checking to batched gemm ops in profiler

* removed batch stride validation

* removed batched stride validation again

* Update include/ck/library/utility/profiler_validation_common.hpp

Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>

* refactor function names

* added gemm stride checking to more profiler gemm operations

* run clang format

* add stride checkign to 01 gemm example

* rename from profiler to validation common, used for examples and profiler

* build of ckProfiler success

* update file headers

---------

Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: geozhai <44495440+geozhai@users.noreply.github.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Yashvardhan Agarwal <yashagar@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Co-authored-by: Yi DING <yi.ding@amd.com>
Co-authored-by: Cameron Shinn <camerontshinn@gmail.com>
Co-authored-by: Mateusz Ozga <110818320+mozga-amd@users.noreply.github.com>
Co-authored-by: Haocong WANG <haocwang@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>

[ROCm/composable_kernel commit: 60320e9]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants