Skip to content

Streamk unit tests#3016

Merged
cgmillette merged 2 commits into
ROCm:developfrom
cgmillette:streamk_unit_tests
Oct 15, 2025
Merged

Streamk unit tests#3016
cgmillette merged 2 commits into
ROCm:developfrom
cgmillette:streamk_unit_tests

Conversation

@cgmillette
Copy link
Copy Markdown
Collaborator

@cgmillette cgmillette commented Oct 14, 2025

Proposed changes

Regression tests are failing on bf16, which is a known issue on streamK with atomic reduction. The error threshold calculations need to be adjusted for this case, so we need to disable extended tests for now.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

@cgmillette cgmillette reopened this Oct 14, 2025
@cgmillette cgmillette merged commit bde5f26 into ROCm:develop Oct 15, 2025
38 of 47 checks passed
bartekxk pushed a commit that referenced this pull request Dec 5, 2025
Improve the grouped conv kernel name generation in CK Tile.

Remove unnecessary compilations.

Add min blocks per CU to invoker name.

Add more instances.

Better split-K handling in the template instantiation.

Add more instances.

Enable vector loads in grouped conv bwd weight kernels.

Add more instances.

[CK Tile] contraction multi d - kernel & example (#2901)

* Initial commit. create batched_contraction_kernel file

* initial problem definition

* implement initial example to launch kernel

* add universal gemm to contraction. initial phase

* complete implementation for special case all Dims are 1 and no Ds

* clean code

* initial changes to support multi dimensional G

* more progress in implementing multiple G

* tmp commit

* manage dynamic NumDimG in kernel

* improving example for multi M,N,K,G handling. start generalizing kernel. it is a temporary commit

* implement the example for general Multi dimension G M N K and test different reference calculation algorithms

* 2 functions for reference using multi dimensional and flat indexing

* clean the code for muti dimentional G, M, N, K contraction and add some logs

* Add Make descriptor function in kernel for merging Ms, Ns, Ks for A, B, E

* some cleaning on kernel

* clean the code for  calculating the offsets from flatten batch number

* Start adding MultiD support to kernel and example

* more changes to manage multi D in kernel and example

* manage passing multi d to kernel and testing.

* complete multi D support in kernel. modify example code to support it

* Correct algorithm to calc the correct offset values for D tensor batches and some code cleaning

* Minor fix

* Generalize example code for variable NumD tensors and apply cleanup based on review feedback

* Refactored code and addressed review feedback

* refactoring, cleaning, add documents, in kernel side and example codes

* Optimize batch offset calculation in kernel

* Inline CalculateBatchOffset in batched contraction kernel, update CHANGELOG.md

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

Update include path to break the remod's cyclic dep issue (#2978)

* Update include path to break the cyclic dep issue

* Use ck_tile::permute_vectors_i4x4_b in tile engine

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[CK_TILE] Batched Gemm Kernel IsSupported function checks (#2860)

* Add valid check batched gemm part1

* [CK_TILE] Add batched gemm kernel IsSupported func checks

* revert broken pre-commit hook changes

* revert broken pre-commit hook changes v2

* Clarify error messages

[CK_TILE] Blockwise GEMM pipeline v6 - port of v5 from old CK (#2955)

* First checkpoint

* Second checkpoint - hot loop scheduler

* Third checkpoint - init main operator

* Fourth checkpoint - main loop ready

* Fifth checkpoint - main loop fix

* Sixth checkpoint - ReadWritecompFunc

* Seventh checkpoint - Tail finished

* [CK_TILE] Blockwise gemm pipeline v5 complete

* Working

* Working fixes 2

* Rename v5 to v77 temporarily

* Data type adjustment

* Data type adjustment 2

* [CK_TILE] Blockwise Gemm pipeline v5 add tests

* [CK_TILE] Fix calculation error

* TEMP: check pipeline

* Fix name to V6

* naming and documentation changes

* WIP dump

* Try fixing v1

* Failing tests v5

* Debugging

* Changes v2

* F16 tests working great

* Working BlockwiseGemmPipelineV5 as V6

* Cleanup and format

* Merging changes part1

* [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6

* Remove commented code

* Fix gfx950 build issues

* Fix file formatting

* Review changes, more concat info, add bf16 bf8 tests

* Fix formatting

* Add bf16 and bf8 tests

---------

Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)

* Enable the adapted LDS B layout for Row-Major

* fix formatting

* Implement specialized col-major A LDS block descriptor

* Fix formatting

* Use VecLoadSize for AK1/BK1

* Fix some thread access pattern values

* Use GetVectorSizeA for A

* Fix formatting

* Add extra condition to avoid division by zero

* disable layout for wave32

* remove extra else

* fix formatting

* Fix formatting

* Rename one remaining TileDistributionEncodingPattern2D

* Use integer ceil division

* revert remod.py changes

* also revert utility.hpp

* use getA/BTileAccessPattern everywhere

* use integer_divide_ceil for AK0 too

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

Enable GMock and improve gtest configuration (#2976)

Our current cmake/gtest.cmake file does not enable gmock. Gmock is needed for matchers that are needed for more readable unit tests. This PR enables gmock and does a little cleanup in gtest.cmake:

* Enable BUILD_GMOCK by default (was previously disabled)
* Patch gtest-src/googlemock/CMakeLists.txt for broken include path.
* Add configuration to gmock if the target is used.

No other changes in this PR, but I've verified I can use gmock matchers correctly once I include these changes in other code.

[CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm (#2540)

* [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm

* Update rmsnorm host reference

* Update tree reduction of rmsnorm for reference host

* Fix cross warp for m > 1 cases

* Add RMSNorm model selectable option for host reference

* Fix save_unquant cases

* Update reference rmsnorm forward function to use enum for model sensitivity

* Update reference rmsnorm calculation for model sensitivity

* Fix m warp for layernorm

* Adjust parameter of reference for twoPass

* Fix clang format

* Run clang-format-overwrite.sh to fix formating issue

* fix clang format

---------

Co-authored-by: MHYang <mengyang@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

Fix: Handle JSON boolean values (pad_m, pad_n, pad_k and persistent) in gemm_instance_builder (#3008)

[CK_TILE] Switch into universal gemms for conv bwds (#2981)

* switch into universal gemms for conv bwds

* some fixes and support universal gemm in conv fwd

* add reviewer comments

Revert "[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)" (#3017)

This reverts commit d2bbca3.

update s_barrier's logic in gfx12 architecture (#3003)

change s_waitcnt's logic in gfx1250

change s_waitcnt's logic in gfx1250

update comment

fixing group id (#3002)

feat(grouped_gemm_multi_d): add support for bf16

test(grouped_gemm_multi_d): add unit test for bf16 support

Felix/opt sorting (#2902)

* merge felix/sorting
* opt moe sorting  (#2822)
* opt moe storing for 2k
---------
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>

Disable streamk extended regression tests for now (#3016)

re-enable clang-format by default (#3030)

* re-enable clang-format by default

* fix clang format

use branch develop to test hipTensor (#3034)

docs: add quant mode comparison to readme (#3032)

* docs: add quant mode comparison to readme

* Update example/ck_tile/38_block_scale_gemm/README.md

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

---------

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

Fix compiler noreturn error for ck tile permute test (#3036)

Enable storelse for fmha_fwd_trload kernel (#3023)

Take universal GEMM pipeline into use for grouped convolutions.

Add more instances.

Disqualify benchmarking results from kernels that do not pass validation.

Add more instances.

Add double smem buffer instances.

Rename conv factory.

Add grouped conv fwd direction profiling into CK Tile profiler.

Fix fwd layouts.

Add missing header.

Fix validation.

Improve profiler output.

Create runner script to runs CK and CK Tile profilers.

Add back BF16 instances.

Optimize calculation of the CPU reference.

Improve profiler output.

Parallel compilation of the CK Tile instances.

Improve benchmarking and analysis script.

Add new kernel instances.

Small script improvements.

Script to convert MIOpenDriver commands to CK profiler input.

Benchmarking script improvements.

More script improvements.

Fix compilation issues on MI300.

Add more gfx942 instances.

Plot a large set of benchmark results.

Print out aggregated statistics.

tmp save

tmp save

save work

next save
bartekxk pushed a commit that referenced this pull request Dec 5, 2025
Skeleton for the ckTileProfiler.

WIP: CK Tile conv bwd profiler.

Added a placeholder conv bwd instance factory for CK Tile profiler.

Create invoker for the kernel and a factory for creating invokers.

Rename CK Tile grouped conv factory.

Add empty instance factory.

Fully functional CK Tile profiler.

Fix transferring data back to host for validation.

Improve the grouped conv kernel name generation in CK Tile.

Remove unnecessary compilations.

Add min blocks per CU to invoker name.

Add more instances.

Better split-K handling in the template instantiation.

Add more instances.

Enable vector loads in grouped conv bwd weight kernels.

Add more instances.

[CK Tile] contraction multi d - kernel & example (#2901)

* Initial commit. create batched_contraction_kernel file

* initial problem definition

* implement initial example to launch kernel

* add universal gemm to contraction. initial phase

* complete implementation for special case all Dims are 1 and no Ds

* clean code

* initial changes to support multi dimensional G

* more progress in implementing multiple G

* tmp commit

* manage dynamic NumDimG in kernel

* improving example for multi M,N,K,G handling. start generalizing kernel. it is a temporary commit

* implement the example for general Multi dimension G M N K and test different reference calculation algorithms

* 2 functions for reference using multi dimensional and flat indexing

* clean the code for muti dimentional G, M, N, K contraction and add some logs

* Add Make descriptor function in kernel for merging Ms, Ns, Ks for A, B, E

* some cleaning on kernel

* clean the code for  calculating the offsets from flatten batch number

* Start adding MultiD support to kernel and example

* more changes to manage multi D in kernel and example

* manage passing multi d to kernel and testing.

* complete multi D support in kernel. modify example code to support it

* Correct algorithm to calc the correct offset values for D tensor batches and some code cleaning

* Minor fix

* Generalize example code for variable NumD tensors and apply cleanup based on review feedback

* Refactored code and addressed review feedback

* refactoring, cleaning, add documents, in kernel side and example codes

* Optimize batch offset calculation in kernel

* Inline CalculateBatchOffset in batched contraction kernel, update CHANGELOG.md

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

Update include path to break the remod's cyclic dep issue (#2978)

* Update include path to break the cyclic dep issue

* Use ck_tile::permute_vectors_i4x4_b in tile engine

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[CK_TILE] Batched Gemm Kernel IsSupported function checks (#2860)

* Add valid check batched gemm part1

* [CK_TILE] Add batched gemm kernel IsSupported func checks

* revert broken pre-commit hook changes

* revert broken pre-commit hook changes v2

* Clarify error messages

[CK_TILE] Blockwise GEMM pipeline v6 - port of v5 from old CK (#2955)

* First checkpoint

* Second checkpoint - hot loop scheduler

* Third checkpoint - init main operator

* Fourth checkpoint - main loop ready

* Fifth checkpoint - main loop fix

* Sixth checkpoint - ReadWritecompFunc

* Seventh checkpoint - Tail finished

* [CK_TILE] Blockwise gemm pipeline v5 complete

* Working

* Working fixes 2

* Rename v5 to v77 temporarily

* Data type adjustment

* Data type adjustment 2

* [CK_TILE] Blockwise Gemm pipeline v5 add tests

* [CK_TILE] Fix calculation error

* TEMP: check pipeline

* Fix name to V6

* naming and documentation changes

* WIP dump

* Try fixing v1

* Failing tests v5

* Debugging

* Changes v2

* F16 tests working great

* Working BlockwiseGemmPipelineV5 as V6

* Cleanup and format

* Merging changes part1

* [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6

* Remove commented code

* Fix gfx950 build issues

* Fix file formatting

* Review changes, more concat info, add bf16 bf8 tests

* Fix formatting

* Add bf16 and bf8 tests

---------

Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)

* Enable the adapted LDS B layout for Row-Major

* fix formatting

* Implement specialized col-major A LDS block descriptor

* Fix formatting

* Use VecLoadSize for AK1/BK1

* Fix some thread access pattern values

* Use GetVectorSizeA for A

* Fix formatting

* Add extra condition to avoid division by zero

* disable layout for wave32

* remove extra else

* fix formatting

* Fix formatting

* Rename one remaining TileDistributionEncodingPattern2D

* Use integer ceil division

* revert remod.py changes

* also revert utility.hpp

* use getA/BTileAccessPattern everywhere

* use integer_divide_ceil for AK0 too

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

Enable GMock and improve gtest configuration (#2976)

Our current cmake/gtest.cmake file does not enable gmock. Gmock is needed for matchers that are needed for more readable unit tests. This PR enables gmock and does a little cleanup in gtest.cmake:

* Enable BUILD_GMOCK by default (was previously disabled)
* Patch gtest-src/googlemock/CMakeLists.txt for broken include path.
* Add configuration to gmock if the target is used.

No other changes in this PR, but I've verified I can use gmock matchers correctly once I include these changes in other code.

[CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm (#2540)

* [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm

* Update rmsnorm host reference

* Update tree reduction of rmsnorm for reference host

* Fix cross warp for m > 1 cases

* Add RMSNorm model selectable option for host reference

* Fix save_unquant cases

* Update reference rmsnorm forward function to use enum for model sensitivity

* Update reference rmsnorm calculation for model sensitivity

* Fix m warp for layernorm

* Adjust parameter of reference for twoPass

* Fix clang format

* Run clang-format-overwrite.sh to fix formating issue

* fix clang format

---------

Co-authored-by: MHYang <mengyang@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

Fix: Handle JSON boolean values (pad_m, pad_n, pad_k and persistent) in gemm_instance_builder (#3008)

[CK_TILE] Switch into universal gemms for conv bwds (#2981)

* switch into universal gemms for conv bwds

* some fixes and support universal gemm in conv fwd

* add reviewer comments

Revert "[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)" (#3017)

This reverts commit d2bbca3.

update s_barrier's logic in gfx12 architecture (#3003)

change s_waitcnt's logic in gfx1250

change s_waitcnt's logic in gfx1250

update comment

fixing group id (#3002)

feat(grouped_gemm_multi_d): add support for bf16

test(grouped_gemm_multi_d): add unit test for bf16 support

Felix/opt sorting (#2902)

* merge felix/sorting
* opt moe sorting  (#2822)
* opt moe storing for 2k
---------
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>

Disable streamk extended regression tests for now (#3016)

re-enable clang-format by default (#3030)

* re-enable clang-format by default

* fix clang format

use branch develop to test hipTensor (#3034)

docs: add quant mode comparison to readme (#3032)

* docs: add quant mode comparison to readme

* Update example/ck_tile/38_block_scale_gemm/README.md

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

---------

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

Fix compiler noreturn error for ck tile permute test (#3036)

Enable storelse for fmha_fwd_trload kernel (#3023)

Take universal GEMM pipeline into use for grouped convolutions.

Add more instances.

Disqualify benchmarking results from kernels that do not pass validation.

Add more instances.

Add double smem buffer instances.

Rename conv factory.

Add grouped conv fwd direction profiling into CK Tile profiler.

Fix fwd layouts.

Add missing header.

Fix validation.

Improve profiler output.

Create runner script to runs CK and CK Tile profilers.

Add back BF16 instances.

Optimize calculation of the CPU reference.

Improve profiler output.

Parallel compilation of the CK Tile instances.

Improve benchmarking and analysis script.

Add new kernel instances.

Small script improvements.

Script to convert MIOpenDriver commands to CK profiler input.

Benchmarking script improvements.

More script improvements.

Fix compilation issues on MI300.

Add more gfx942 instances.

Plot a large set of benchmark results.

Print out aggregated statistics.

tmp save

tmp save

save work

next save
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants