Skip to content

use branch develop to test hipTensor#3034

Merged
illsilin merged 1 commit into
developfrom
lwpck-3977
Oct 15, 2025
Merged

use branch develop to test hipTensor#3034
illsilin merged 1 commit into
developfrom
lwpck-3977

Conversation

@illsilin
Copy link
Copy Markdown
Collaborator

@illsilin illsilin commented Oct 15, 2025

Proposed changes

The hipTensor repo no longer has a branch "mainline", which broke our daily tests. Need to switch to branch "develop". Made sure the tests are now passing:

[2025-10-15T20:37:03.282Z] Test project /home/jenkins/workspace/IBS_composable_kernel_lwpck-3977/hipTensor-develop/build
[2025-10-15T20:37:03.282Z] Start 1: logger_test
[2025-10-15T20:37:04.730Z] 1/54 Test #1: logger_test ................................ Passed 1.17 sec
[2025-10-15T20:37:04.730Z] Start 2: yaml_test
[2025-10-15T20:37:05.640Z] 2/54 Test #2: yaml_test .................................. Passed 1.18 sec
[2025-10-15T20:37:05.640Z] Start 3: elementwise_op_test
[2025-10-15T20:37:07.089Z] 3/54 Test #3: elementwise_op_test ........................ Passed 1.46 sec
[2025-10-15T20:37:07.089Z] Start 4: util_test
[2025-10-15T20:37:08.538Z] 4/54 Test #4: util_test .................................. Passed 1.15 sec
[2025-10-15T20:37:08.538Z] Start 5: hiptensor_options_test
[2025-10-15T20:37:09.448Z] 5/54 Test #5: hiptensor_options_test ..................... Passed 1.18 sec
[2025-10-15T20:37:09.448Z] Start 6: bilinear_contraction_test_m1n1k1
[2025-10-15T20:37:26.697Z] 6/54 Test #6: bilinear_contraction_test_m1n1k1 ........... Passed 14.68 sec
[2025-10-15T20:37:26.697Z] Start 7: bilinear_contraction_test_m2n2k2
[2025-10-15T20:37:38.712Z] 7/54 Test #7: bilinear_contraction_test_m2n2k2 ........... Passed 13.45 sec
[2025-10-15T20:37:38.712Z] Start 8: bilinear_contraction_test_m3n3k3
[2025-10-15T20:37:53.124Z] 8/54 Test #8: bilinear_contraction_test_m3n3k3 ........... Passed 13.97 sec
[2025-10-15T20:37:53.124Z] Start 9: bilinear_contraction_test_m4n4k4
[2025-10-15T20:38:07.534Z] 9/54 Test #9: bilinear_contraction_test_m4n4k4 ........... Passed 14.35 sec
[2025-10-15T20:38:07.534Z] Start 10: bilinear_contraction_test_m5n5k5
[2025-10-15T20:38:22.315Z] 10/54 Test #10: bilinear_contraction_test_m5n5k5 ........... Passed 15.68 sec
[2025-10-15T20:38:22.315Z] Start 11: bilinear_contraction_test_m6n6k6
[2025-10-15T20:38:46.953Z] 11/54 Test #11: bilinear_contraction_test_m6n6k6 ........... Passed 25.21 sec
[2025-10-15T20:38:46.953Z] Start 12: complex_bilinear_contraction_test_m1n1k1
[2025-10-15T20:38:55.250Z] 12/54 Test #12: complex_bilinear_contraction_test_m1n1k1 ... Passed 7.16 sec
[2025-10-15T20:38:55.250Z] Start 13: complex_bilinear_contraction_test_m2n2k2
[2025-10-15T20:39:00.790Z] 13/54 Test #13: complex_bilinear_contraction_test_m2n2k2 ... Passed 6.48 sec
[2025-10-15T20:39:00.790Z] Start 14: complex_bilinear_contraction_test_m3n3k3
[2025-10-15T20:39:07.583Z] 14/54 Test #14: complex_bilinear_contraction_test_m3n3k3 ... Passed 6.80 sec
[2025-10-15T20:39:07.583Z] Start 15: complex_bilinear_contraction_test_m4n4k4
[2025-10-15T20:39:14.374Z] 15/54 Test #15: complex_bilinear_contraction_test_m4n4k4 ... Passed 7.25 sec
[2025-10-15T20:39:14.374Z] Start 16: complex_bilinear_contraction_test_m5n5k5
[2025-10-15T20:39:24.340Z] 16/54 Test #16: complex_bilinear_contraction_test_m5n5k5 ... Passed 8.37 sec
[2025-10-15T20:39:24.340Z] Start 17: complex_bilinear_contraction_test_m6n6k6
[2025-10-15T20:39:41.565Z] 17/54 Test #17: complex_bilinear_contraction_test_m6n6k6 ... Passed 17.96 sec
[2025-10-15T20:39:41.565Z] Start 18: scale_contraction_test_m1n1k1
[2025-10-15T20:39:55.959Z] 18/54 Test #18: scale_contraction_test_m1n1k1 .............. Passed 14.19 sec
[2025-10-15T20:39:55.959Z] Start 19: scale_contraction_test_m2n2k2
[2025-10-15T20:40:10.365Z] 19/54 Test #19: scale_contraction_test_m2n2k2 .............. Passed 13.56 sec
[2025-10-15T20:40:10.365Z] Start 20: scale_contraction_test_m3n3k3
[2025-10-15T20:40:22.361Z] 20/54 Test #20: scale_contraction_test_m3n3k3 .............. Passed 13.76 sec
[2025-10-15T20:40:22.361Z] Start 21: scale_contraction_test_m4n4k4
[2025-10-15T20:40:36.749Z] 21/54 Test #21: scale_contraction_test_m4n4k4 .............. Passed 14.26 sec
[2025-10-15T20:40:36.749Z] Start 22: scale_contraction_test_m5n5k5
[2025-10-15T20:40:53.982Z] 22/54 Test #22: scale_contraction_test_m5n5k5 .............. Passed 15.64 sec
[2025-10-15T20:40:53.982Z] Start 23: scale_contraction_test_m6n6k6
[2025-10-15T20:41:18.605Z] 23/54 Test #23: scale_contraction_test_m6n6k6 .............. Passed 24.88 sec
[2025-10-15T20:41:18.605Z] Start 24: complex_scale_contraction_test_m1n1k1
[2025-10-15T20:41:25.395Z] 24/54 Test #24: complex_scale_contraction_test_m1n1k1 ...... Passed 7.27 sec
[2025-10-15T20:41:25.395Z] Start 25: complex_scale_contraction_test_m2n2k2
[2025-10-15T20:41:30.939Z] 25/54 Test #25: complex_scale_contraction_test_m2n2k2 ...... Passed 6.49 sec
[2025-10-15T20:41:30.939Z] Start 26: complex_scale_contraction_test_m3n3k3
[2025-10-15T20:41:37.726Z] 26/54 Test #26: complex_scale_contraction_test_m3n3k3 ...... Passed 6.61 sec
[2025-10-15T20:41:37.726Z] Start 27: complex_scale_contraction_test_m4n4k4
[2025-10-15T20:41:44.656Z] 27/54 Test #27: complex_scale_contraction_test_m4n4k4 ...... Passed 7.02 sec
[2025-10-15T20:41:44.656Z] Start 28: complex_scale_contraction_test_m5n5k5
[2025-10-15T20:41:52.963Z] 28/54 Test #28: complex_scale_contraction_test_m5n5k5 ...... Passed 8.30 sec
[2025-10-15T20:41:52.963Z] Start 29: complex_scale_contraction_test_m6n6k6
[2025-10-15T20:42:04.907Z] 29/54 Test #29: complex_scale_contraction_test_m6n6k6 ...... Passed 11.83 sec
[2025-10-15T20:42:04.907Z] Start 30: contraction_mode_test
[2025-10-15T20:42:06.350Z] 30/54 Test #30: contraction_mode_test ...................... Passed 1.80 sec
[2025-10-15T20:42:06.350Z] Start 31: plan_cache_test
[2025-10-15T20:43:31.874Z] 31/54 Test #31: plan_cache_test ............................ Passed 79.52 sec
[2025-10-15T20:43:31.874Z] Start 32: elementwise_cpu_test
[2025-10-15T20:43:31.874Z] 32/54 Test #32: elementwise_cpu_test ....................... Passed 1.13 sec
[2025-10-15T20:43:31.874Z] Start 33: rank2_elementwise_permute_test
[2025-10-15T20:43:31.874Z] 33/54 Test #33: rank2_elementwise_permute_test ............. Passed 1.55 sec
[2025-10-15T20:43:31.874Z] Start 34: rank2_elementwise_binary_op_test
[2025-10-15T20:43:31.874Z] 34/54 Test #34: rank2_elementwise_binary_op_test ........... Passed 1.52 sec
[2025-10-15T20:43:31.875Z] Start 35: rank2_elementwise_trinary_op_test
[2025-10-15T20:43:31.875Z] 35/54 Test #35: rank2_elementwise_trinary_op_test .......... Passed 1.57 sec
[2025-10-15T20:43:31.875Z] Start 36: rank3_elementwise_permute_test
[2025-10-15T20:43:33.927Z] 36/54 Test #36: rank3_elementwise_permute_test ............. Passed 1.73 sec
[2025-10-15T20:43:33.927Z] Start 37: rank3_elementwise_binary_op_test
[2025-10-15T20:43:35.376Z] 37/54 Test #37: rank3_elementwise_binary_op_test ........... Passed 1.55 sec
[2025-10-15T20:43:35.376Z] Start 38: rank3_elementwise_trinary_op_test
[2025-10-15T20:43:36.822Z] 38/54 Test #38: rank3_elementwise_trinary_op_test .......... Passed 1.63 sec
[2025-10-15T20:43:36.822Z] Start 39: rank4_elementwise_permute_test
[2025-10-15T20:43:39.574Z] 39/54 Test #39: rank4_elementwise_permute_test ............. Passed 2.59 sec
[2025-10-15T20:43:39.574Z] Start 40: rank4_elementwise_binary_op_test
[2025-10-15T20:43:41.630Z] 40/54 Test #40: rank4_elementwise_binary_op_test ........... Passed 2.34 sec
[2025-10-15T20:43:41.630Z] Start 41: rank4_elementwise_trinary_op_test
[2025-10-15T20:43:44.383Z] 41/54 Test #41: rank4_elementwise_trinary_op_test .......... Passed 2.59 sec
[2025-10-15T20:43:44.383Z] Start 42: rank5_elementwise_permute_test
[2025-10-15T20:43:54.372Z] 42/54 Test #42: rank5_elementwise_permute_test ............. Passed 9.20 sec
[2025-10-15T20:43:54.372Z] Start 43: rank5_elementwise_binary_op_test
[2025-10-15T20:44:02.663Z] 43/54 Test #43: rank5_elementwise_binary_op_test ........... Passed 9.12 sec
[2025-10-15T20:44:02.663Z] Start 44: rank5_elementwise_trinary_op_test
[2025-10-15T20:44:14.687Z] 44/54 Test #44: rank5_elementwise_trinary_op_test .......... Passed 12.03 sec
[2025-10-15T20:44:14.687Z] Start 45: rank6_elementwise_permute_test
[2025-10-15T20:45:04.766Z] 45/54 Test #45: rank6_elementwise_permute_test ............. Passed 48.37 sec
[2025-10-15T20:45:04.767Z] Start 46: rank6_elementwise_binary_op_test
[2025-10-15T20:45:54.854Z] 46/54 Test #46: rank6_elementwise_binary_op_test ........... Passed 50.11 sec
[2025-10-15T20:45:54.854Z] Start 47: rank6_elementwise_trinary_op_test
[2025-10-15T20:47:06.299Z] 47/54 Test #47: rank6_elementwise_trinary_op_test .......... Passed 66.57 sec
[2025-10-15T20:47:06.299Z] Start 48: reduction_cpu_impl_test
[2025-10-15T20:47:06.299Z] 48/54 Test #48: reduction_cpu_impl_test .................... Passed 1.17 sec
[2025-10-15T20:47:06.299Z] Start 49: rank1_reduction_test
[2025-10-15T20:47:06.299Z] 49/54 Test #49: rank1_reduction_test ....................... Passed 1.48 sec
[2025-10-15T20:47:06.299Z] Start 50: rank2_reduction_test
[2025-10-15T20:47:06.299Z] 50/54 Test #50: rank2_reduction_test ....................... Passed 2.27 sec
[2025-10-15T20:47:06.299Z] Start 51: rank3_reduction_test
[2025-10-15T20:47:09.078Z] 51/54 Test #51: rank3_reduction_test ....................... Passed 4.23 sec
[2025-10-15T20:47:09.078Z] Start 52: rank4_reduction_test
[2025-10-15T20:47:17.358Z] 52/54 Test #52: rank4_reduction_test ....................... Passed 7.94 sec
[2025-10-15T20:47:17.358Z] Start 53: rank5_reduction_test
[2025-10-15T20:47:34.686Z] 53/54 Test #53: rank5_reduction_test ....................... Passed 15.20 sec
[2025-10-15T20:47:34.686Z] Start 54: rank6_reduction_test
[2025-10-15T20:48:04.103Z] 54/54 Test #54: rank6_reduction_test ....................... Passed 28.28 sec
[2025-10-15T20:48:04.103Z]
[2025-10-15T20:48:04.103Z] 100% tests passed, 0 tests failed out of 54
[2025-10-15T20:48:04.103Z]
[2025-10-15T20:48:04.103Z] Total Test time (real) = 657.05 sec

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

@illsilin illsilin merged commit 87d0a3a into develop Oct 15, 2025
41 of 42 checks passed
@illsilin illsilin deleted the lwpck-3977 branch October 15, 2025 22:40
bartekxk pushed a commit that referenced this pull request Dec 5, 2025
bartekxk pushed a commit that referenced this pull request Dec 5, 2025
Improve the grouped conv kernel name generation in CK Tile.

Remove unnecessary compilations.

Add min blocks per CU to invoker name.

Add more instances.

Better split-K handling in the template instantiation.

Add more instances.

Enable vector loads in grouped conv bwd weight kernels.

Add more instances.

[CK Tile] contraction multi d - kernel & example (#2901)

* Initial commit. create batched_contraction_kernel file

* initial problem definition

* implement initial example to launch kernel

* add universal gemm to contraction. initial phase

* complete implementation for special case all Dims are 1 and no Ds

* clean code

* initial changes to support multi dimensional G

* more progress in implementing multiple G

* tmp commit

* manage dynamic NumDimG in kernel

* improving example for multi M,N,K,G handling. start generalizing kernel. it is a temporary commit

* implement the example for general Multi dimension G M N K and test different reference calculation algorithms

* 2 functions for reference using multi dimensional and flat indexing

* clean the code for muti dimentional G, M, N, K contraction and add some logs

* Add Make descriptor function in kernel for merging Ms, Ns, Ks for A, B, E

* some cleaning on kernel

* clean the code for  calculating the offsets from flatten batch number

* Start adding MultiD support to kernel and example

* more changes to manage multi D in kernel and example

* manage passing multi d to kernel and testing.

* complete multi D support in kernel. modify example code to support it

* Correct algorithm to calc the correct offset values for D tensor batches and some code cleaning

* Minor fix

* Generalize example code for variable NumD tensors and apply cleanup based on review feedback

* Refactored code and addressed review feedback

* refactoring, cleaning, add documents, in kernel side and example codes

* Optimize batch offset calculation in kernel

* Inline CalculateBatchOffset in batched contraction kernel, update CHANGELOG.md

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

Update include path to break the remod's cyclic dep issue (#2978)

* Update include path to break the cyclic dep issue

* Use ck_tile::permute_vectors_i4x4_b in tile engine

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[CK_TILE] Batched Gemm Kernel IsSupported function checks (#2860)

* Add valid check batched gemm part1

* [CK_TILE] Add batched gemm kernel IsSupported func checks

* revert broken pre-commit hook changes

* revert broken pre-commit hook changes v2

* Clarify error messages

[CK_TILE] Blockwise GEMM pipeline v6 - port of v5 from old CK (#2955)

* First checkpoint

* Second checkpoint - hot loop scheduler

* Third checkpoint - init main operator

* Fourth checkpoint - main loop ready

* Fifth checkpoint - main loop fix

* Sixth checkpoint - ReadWritecompFunc

* Seventh checkpoint - Tail finished

* [CK_TILE] Blockwise gemm pipeline v5 complete

* Working

* Working fixes 2

* Rename v5 to v77 temporarily

* Data type adjustment

* Data type adjustment 2

* [CK_TILE] Blockwise Gemm pipeline v5 add tests

* [CK_TILE] Fix calculation error

* TEMP: check pipeline

* Fix name to V6

* naming and documentation changes

* WIP dump

* Try fixing v1

* Failing tests v5

* Debugging

* Changes v2

* F16 tests working great

* Working BlockwiseGemmPipelineV5 as V6

* Cleanup and format

* Merging changes part1

* [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6

* Remove commented code

* Fix gfx950 build issues

* Fix file formatting

* Review changes, more concat info, add bf16 bf8 tests

* Fix formatting

* Add bf16 and bf8 tests

---------

Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)

* Enable the adapted LDS B layout for Row-Major

* fix formatting

* Implement specialized col-major A LDS block descriptor

* Fix formatting

* Use VecLoadSize for AK1/BK1

* Fix some thread access pattern values

* Use GetVectorSizeA for A

* Fix formatting

* Add extra condition to avoid division by zero

* disable layout for wave32

* remove extra else

* fix formatting

* Fix formatting

* Rename one remaining TileDistributionEncodingPattern2D

* Use integer ceil division

* revert remod.py changes

* also revert utility.hpp

* use getA/BTileAccessPattern everywhere

* use integer_divide_ceil for AK0 too

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

Enable GMock and improve gtest configuration (#2976)

Our current cmake/gtest.cmake file does not enable gmock. Gmock is needed for matchers that are needed for more readable unit tests. This PR enables gmock and does a little cleanup in gtest.cmake:

* Enable BUILD_GMOCK by default (was previously disabled)
* Patch gtest-src/googlemock/CMakeLists.txt for broken include path.
* Add configuration to gmock if the target is used.

No other changes in this PR, but I've verified I can use gmock matchers correctly once I include these changes in other code.

[CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm (#2540)

* [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm

* Update rmsnorm host reference

* Update tree reduction of rmsnorm for reference host

* Fix cross warp for m > 1 cases

* Add RMSNorm model selectable option for host reference

* Fix save_unquant cases

* Update reference rmsnorm forward function to use enum for model sensitivity

* Update reference rmsnorm calculation for model sensitivity

* Fix m warp for layernorm

* Adjust parameter of reference for twoPass

* Fix clang format

* Run clang-format-overwrite.sh to fix formating issue

* fix clang format

---------

Co-authored-by: MHYang <mengyang@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

Fix: Handle JSON boolean values (pad_m, pad_n, pad_k and persistent) in gemm_instance_builder (#3008)

[CK_TILE] Switch into universal gemms for conv bwds (#2981)

* switch into universal gemms for conv bwds

* some fixes and support universal gemm in conv fwd

* add reviewer comments

Revert "[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)" (#3017)

This reverts commit d2bbca3.

update s_barrier's logic in gfx12 architecture (#3003)

change s_waitcnt's logic in gfx1250

change s_waitcnt's logic in gfx1250

update comment

fixing group id (#3002)

feat(grouped_gemm_multi_d): add support for bf16

test(grouped_gemm_multi_d): add unit test for bf16 support

Felix/opt sorting (#2902)

* merge felix/sorting
* opt moe sorting  (#2822)
* opt moe storing for 2k
---------
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>

Disable streamk extended regression tests for now (#3016)

re-enable clang-format by default (#3030)

* re-enable clang-format by default

* fix clang format

use branch develop to test hipTensor (#3034)

docs: add quant mode comparison to readme (#3032)

* docs: add quant mode comparison to readme

* Update example/ck_tile/38_block_scale_gemm/README.md

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

---------

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

Fix compiler noreturn error for ck tile permute test (#3036)

Enable storelse for fmha_fwd_trload kernel (#3023)

Take universal GEMM pipeline into use for grouped convolutions.

Add more instances.

Disqualify benchmarking results from kernels that do not pass validation.

Add more instances.

Add double smem buffer instances.

Rename conv factory.

Add grouped conv fwd direction profiling into CK Tile profiler.

Fix fwd layouts.

Add missing header.

Fix validation.

Improve profiler output.

Create runner script to runs CK and CK Tile profilers.

Add back BF16 instances.

Optimize calculation of the CPU reference.

Improve profiler output.

Parallel compilation of the CK Tile instances.

Improve benchmarking and analysis script.

Add new kernel instances.

Small script improvements.

Script to convert MIOpenDriver commands to CK profiler input.

Benchmarking script improvements.

More script improvements.

Fix compilation issues on MI300.

Add more gfx942 instances.

Plot a large set of benchmark results.

Print out aggregated statistics.

tmp save

tmp save

save work

next save
bartekxk pushed a commit that referenced this pull request Dec 5, 2025
Skeleton for the ckTileProfiler.

WIP: CK Tile conv bwd profiler.

Added a placeholder conv bwd instance factory for CK Tile profiler.

Create invoker for the kernel and a factory for creating invokers.

Rename CK Tile grouped conv factory.

Add empty instance factory.

Fully functional CK Tile profiler.

Fix transferring data back to host for validation.

Improve the grouped conv kernel name generation in CK Tile.

Remove unnecessary compilations.

Add min blocks per CU to invoker name.

Add more instances.

Better split-K handling in the template instantiation.

Add more instances.

Enable vector loads in grouped conv bwd weight kernels.

Add more instances.

[CK Tile] contraction multi d - kernel & example (#2901)

* Initial commit. create batched_contraction_kernel file

* initial problem definition

* implement initial example to launch kernel

* add universal gemm to contraction. initial phase

* complete implementation for special case all Dims are 1 and no Ds

* clean code

* initial changes to support multi dimensional G

* more progress in implementing multiple G

* tmp commit

* manage dynamic NumDimG in kernel

* improving example for multi M,N,K,G handling. start generalizing kernel. it is a temporary commit

* implement the example for general Multi dimension G M N K and test different reference calculation algorithms

* 2 functions for reference using multi dimensional and flat indexing

* clean the code for muti dimentional G, M, N, K contraction and add some logs

* Add Make descriptor function in kernel for merging Ms, Ns, Ks for A, B, E

* some cleaning on kernel

* clean the code for  calculating the offsets from flatten batch number

* Start adding MultiD support to kernel and example

* more changes to manage multi D in kernel and example

* manage passing multi d to kernel and testing.

* complete multi D support in kernel. modify example code to support it

* Correct algorithm to calc the correct offset values for D tensor batches and some code cleaning

* Minor fix

* Generalize example code for variable NumD tensors and apply cleanup based on review feedback

* Refactored code and addressed review feedback

* refactoring, cleaning, add documents, in kernel side and example codes

* Optimize batch offset calculation in kernel

* Inline CalculateBatchOffset in batched contraction kernel, update CHANGELOG.md

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

Update include path to break the remod's cyclic dep issue (#2978)

* Update include path to break the cyclic dep issue

* Use ck_tile::permute_vectors_i4x4_b in tile engine

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[CK_TILE] Batched Gemm Kernel IsSupported function checks (#2860)

* Add valid check batched gemm part1

* [CK_TILE] Add batched gemm kernel IsSupported func checks

* revert broken pre-commit hook changes

* revert broken pre-commit hook changes v2

* Clarify error messages

[CK_TILE] Blockwise GEMM pipeline v6 - port of v5 from old CK (#2955)

* First checkpoint

* Second checkpoint - hot loop scheduler

* Third checkpoint - init main operator

* Fourth checkpoint - main loop ready

* Fifth checkpoint - main loop fix

* Sixth checkpoint - ReadWritecompFunc

* Seventh checkpoint - Tail finished

* [CK_TILE] Blockwise gemm pipeline v5 complete

* Working

* Working fixes 2

* Rename v5 to v77 temporarily

* Data type adjustment

* Data type adjustment 2

* [CK_TILE] Blockwise Gemm pipeline v5 add tests

* [CK_TILE] Fix calculation error

* TEMP: check pipeline

* Fix name to V6

* naming and documentation changes

* WIP dump

* Try fixing v1

* Failing tests v5

* Debugging

* Changes v2

* F16 tests working great

* Working BlockwiseGemmPipelineV5 as V6

* Cleanup and format

* Merging changes part1

* [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6

* Remove commented code

* Fix gfx950 build issues

* Fix file formatting

* Review changes, more concat info, add bf16 bf8 tests

* Fix formatting

* Add bf16 and bf8 tests

---------

Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)

* Enable the adapted LDS B layout for Row-Major

* fix formatting

* Implement specialized col-major A LDS block descriptor

* Fix formatting

* Use VecLoadSize for AK1/BK1

* Fix some thread access pattern values

* Use GetVectorSizeA for A

* Fix formatting

* Add extra condition to avoid division by zero

* disable layout for wave32

* remove extra else

* fix formatting

* Fix formatting

* Rename one remaining TileDistributionEncodingPattern2D

* Use integer ceil division

* revert remod.py changes

* also revert utility.hpp

* use getA/BTileAccessPattern everywhere

* use integer_divide_ceil for AK0 too

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

Enable GMock and improve gtest configuration (#2976)

Our current cmake/gtest.cmake file does not enable gmock. Gmock is needed for matchers that are needed for more readable unit tests. This PR enables gmock and does a little cleanup in gtest.cmake:

* Enable BUILD_GMOCK by default (was previously disabled)
* Patch gtest-src/googlemock/CMakeLists.txt for broken include path.
* Add configuration to gmock if the target is used.

No other changes in this PR, but I've verified I can use gmock matchers correctly once I include these changes in other code.

[CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm (#2540)

* [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm

* Update rmsnorm host reference

* Update tree reduction of rmsnorm for reference host

* Fix cross warp for m > 1 cases

* Add RMSNorm model selectable option for host reference

* Fix save_unquant cases

* Update reference rmsnorm forward function to use enum for model sensitivity

* Update reference rmsnorm calculation for model sensitivity

* Fix m warp for layernorm

* Adjust parameter of reference for twoPass

* Fix clang format

* Run clang-format-overwrite.sh to fix formating issue

* fix clang format

---------

Co-authored-by: MHYang <mengyang@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

Fix: Handle JSON boolean values (pad_m, pad_n, pad_k and persistent) in gemm_instance_builder (#3008)

[CK_TILE] Switch into universal gemms for conv bwds (#2981)

* switch into universal gemms for conv bwds

* some fixes and support universal gemm in conv fwd

* add reviewer comments

Revert "[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)" (#3017)

This reverts commit d2bbca3.

update s_barrier's logic in gfx12 architecture (#3003)

change s_waitcnt's logic in gfx1250

change s_waitcnt's logic in gfx1250

update comment

fixing group id (#3002)

feat(grouped_gemm_multi_d): add support for bf16

test(grouped_gemm_multi_d): add unit test for bf16 support

Felix/opt sorting (#2902)

* merge felix/sorting
* opt moe sorting  (#2822)
* opt moe storing for 2k
---------
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>

Disable streamk extended regression tests for now (#3016)

re-enable clang-format by default (#3030)

* re-enable clang-format by default

* fix clang format

use branch develop to test hipTensor (#3034)

docs: add quant mode comparison to readme (#3032)

* docs: add quant mode comparison to readme

* Update example/ck_tile/38_block_scale_gemm/README.md

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

---------

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

Fix compiler noreturn error for ck tile permute test (#3036)

Enable storelse for fmha_fwd_trload kernel (#3023)

Take universal GEMM pipeline into use for grouped convolutions.

Add more instances.

Disqualify benchmarking results from kernels that do not pass validation.

Add more instances.

Add double smem buffer instances.

Rename conv factory.

Add grouped conv fwd direction profiling into CK Tile profiler.

Fix fwd layouts.

Add missing header.

Fix validation.

Improve profiler output.

Create runner script to runs CK and CK Tile profilers.

Add back BF16 instances.

Optimize calculation of the CPU reference.

Improve profiler output.

Parallel compilation of the CK Tile instances.

Improve benchmarking and analysis script.

Add new kernel instances.

Small script improvements.

Script to convert MIOpenDriver commands to CK profiler input.

Benchmarking script improvements.

More script improvements.

Fix compilation issues on MI300.

Add more gfx942 instances.

Plot a large set of benchmark results.

Print out aggregated statistics.

tmp save

tmp save

save work

next save
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants