Streamk unit tests#3016
Merged
Merged
Conversation
fd36db6 to
6deaaa9
Compare
illsilin
approved these changes
Oct 14, 2025
AviralGoelAMD
pushed a commit
that referenced
this pull request
Oct 16, 2025
bartekxk
pushed a commit
that referenced
this pull request
Dec 5, 2025
bartekxk
pushed a commit
that referenced
this pull request
Dec 5, 2025
Improve the grouped conv kernel name generation in CK Tile. Remove unnecessary compilations. Add min blocks per CU to invoker name. Add more instances. Better split-K handling in the template instantiation. Add more instances. Enable vector loads in grouped conv bwd weight kernels. Add more instances. [CK Tile] contraction multi d - kernel & example (#2901) * Initial commit. create batched_contraction_kernel file * initial problem definition * implement initial example to launch kernel * add universal gemm to contraction. initial phase * complete implementation for special case all Dims are 1 and no Ds * clean code * initial changes to support multi dimensional G * more progress in implementing multiple G * tmp commit * manage dynamic NumDimG in kernel * improving example for multi M,N,K,G handling. start generalizing kernel. it is a temporary commit * implement the example for general Multi dimension G M N K and test different reference calculation algorithms * 2 functions for reference using multi dimensional and flat indexing * clean the code for muti dimentional G, M, N, K contraction and add some logs * Add Make descriptor function in kernel for merging Ms, Ns, Ks for A, B, E * some cleaning on kernel * clean the code for calculating the offsets from flatten batch number * Start adding MultiD support to kernel and example * more changes to manage multi D in kernel and example * manage passing multi d to kernel and testing. * complete multi D support in kernel. modify example code to support it * Correct algorithm to calc the correct offset values for D tensor batches and some code cleaning * Minor fix * Generalize example code for variable NumD tensors and apply cleanup based on review feedback * Refactored code and addressed review feedback * refactoring, cleaning, add documents, in kernel side and example codes * Optimize batch offset calculation in kernel * Inline CalculateBatchOffset in batched contraction kernel, update CHANGELOG.md --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Update include path to break the remod's cyclic dep issue (#2978) * Update include path to break the cyclic dep issue * Use ck_tile::permute_vectors_i4x4_b in tile engine --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [CK_TILE] Batched Gemm Kernel IsSupported function checks (#2860) * Add valid check batched gemm part1 * [CK_TILE] Add batched gemm kernel IsSupported func checks * revert broken pre-commit hook changes * revert broken pre-commit hook changes v2 * Clarify error messages [CK_TILE] Blockwise GEMM pipeline v6 - port of v5 from old CK (#2955) * First checkpoint * Second checkpoint - hot loop scheduler * Third checkpoint - init main operator * Fourth checkpoint - main loop ready * Fifth checkpoint - main loop fix * Sixth checkpoint - ReadWritecompFunc * Seventh checkpoint - Tail finished * [CK_TILE] Blockwise gemm pipeline v5 complete * Working * Working fixes 2 * Rename v5 to v77 temporarily * Data type adjustment * Data type adjustment 2 * [CK_TILE] Blockwise Gemm pipeline v5 add tests * [CK_TILE] Fix calculation error * TEMP: check pipeline * Fix name to V6 * naming and documentation changes * WIP dump * Try fixing v1 * Failing tests v5 * Debugging * Changes v2 * F16 tests working great * Working BlockwiseGemmPipelineV5 as V6 * Cleanup and format * Merging changes part1 * [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6 * Remove commented code * Fix gfx950 build issues * Fix file formatting * Review changes, more concat info, add bf16 bf8 tests * Fix formatting * Add bf16 and bf8 tests --------- Co-authored-by: Adam Osewski <Adam.Osewski@amd.com> [CK_TILE] Non-K Major from old CK to CK-Tile (#2442) * Enable the adapted LDS B layout for Row-Major * fix formatting * Implement specialized col-major A LDS block descriptor * Fix formatting * Use VecLoadSize for AK1/BK1 * Fix some thread access pattern values * Use GetVectorSizeA for A * Fix formatting * Add extra condition to avoid division by zero * disable layout for wave32 * remove extra else * fix formatting * Fix formatting * Rename one remaining TileDistributionEncodingPattern2D * Use integer ceil division * revert remod.py changes * also revert utility.hpp * use getA/BTileAccessPattern everywhere * use integer_divide_ceil for AK0 too --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: Adam Osewski <Adam.Osewski@amd.com> Enable GMock and improve gtest configuration (#2976) Our current cmake/gtest.cmake file does not enable gmock. Gmock is needed for matchers that are needed for more readable unit tests. This PR enables gmock and does a little cleanup in gtest.cmake: * Enable BUILD_GMOCK by default (was previously disabled) * Patch gtest-src/googlemock/CMakeLists.txt for broken include path. * Add configuration to gmock if the target is used. No other changes in this PR, but I've verified I can use gmock matchers correctly once I include these changes in other code. [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm (#2540) * [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm * Update rmsnorm host reference * Update tree reduction of rmsnorm for reference host * Fix cross warp for m > 1 cases * Add RMSNorm model selectable option for host reference * Fix save_unquant cases * Update reference rmsnorm forward function to use enum for model sensitivity * Update reference rmsnorm calculation for model sensitivity * Fix m warp for layernorm * Adjust parameter of reference for twoPass * Fix clang format * Run clang-format-overwrite.sh to fix formating issue * fix clang format --------- Co-authored-by: MHYang <mengyang@amd.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Fix: Handle JSON boolean values (pad_m, pad_n, pad_k and persistent) in gemm_instance_builder (#3008) [CK_TILE] Switch into universal gemms for conv bwds (#2981) * switch into universal gemms for conv bwds * some fixes and support universal gemm in conv fwd * add reviewer comments Revert "[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)" (#3017) This reverts commit d2bbca3. update s_barrier's logic in gfx12 architecture (#3003) change s_waitcnt's logic in gfx1250 change s_waitcnt's logic in gfx1250 update comment fixing group id (#3002) feat(grouped_gemm_multi_d): add support for bf16 test(grouped_gemm_multi_d): add unit test for bf16 support Felix/opt sorting (#2902) * merge felix/sorting * opt moe sorting (#2822) * opt moe storing for 2k --------- Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: coderfeli <coderfeli@163.com> Disable streamk extended regression tests for now (#3016) re-enable clang-format by default (#3030) * re-enable clang-format by default * fix clang format use branch develop to test hipTensor (#3034) docs: add quant mode comparison to readme (#3032) * docs: add quant mode comparison to readme * Update example/ck_tile/38_block_scale_gemm/README.md Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> --------- Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> Fix compiler noreturn error for ck tile permute test (#3036) Enable storelse for fmha_fwd_trload kernel (#3023) Take universal GEMM pipeline into use for grouped convolutions. Add more instances. Disqualify benchmarking results from kernels that do not pass validation. Add more instances. Add double smem buffer instances. Rename conv factory. Add grouped conv fwd direction profiling into CK Tile profiler. Fix fwd layouts. Add missing header. Fix validation. Improve profiler output. Create runner script to runs CK and CK Tile profilers. Add back BF16 instances. Optimize calculation of the CPU reference. Improve profiler output. Parallel compilation of the CK Tile instances. Improve benchmarking and analysis script. Add new kernel instances. Small script improvements. Script to convert MIOpenDriver commands to CK profiler input. Benchmarking script improvements. More script improvements. Fix compilation issues on MI300. Add more gfx942 instances. Plot a large set of benchmark results. Print out aggregated statistics. tmp save tmp save save work next save
bartekxk
pushed a commit
that referenced
this pull request
Dec 5, 2025
Skeleton for the ckTileProfiler. WIP: CK Tile conv bwd profiler. Added a placeholder conv bwd instance factory for CK Tile profiler. Create invoker for the kernel and a factory for creating invokers. Rename CK Tile grouped conv factory. Add empty instance factory. Fully functional CK Tile profiler. Fix transferring data back to host for validation. Improve the grouped conv kernel name generation in CK Tile. Remove unnecessary compilations. Add min blocks per CU to invoker name. Add more instances. Better split-K handling in the template instantiation. Add more instances. Enable vector loads in grouped conv bwd weight kernels. Add more instances. [CK Tile] contraction multi d - kernel & example (#2901) * Initial commit. create batched_contraction_kernel file * initial problem definition * implement initial example to launch kernel * add universal gemm to contraction. initial phase * complete implementation for special case all Dims are 1 and no Ds * clean code * initial changes to support multi dimensional G * more progress in implementing multiple G * tmp commit * manage dynamic NumDimG in kernel * improving example for multi M,N,K,G handling. start generalizing kernel. it is a temporary commit * implement the example for general Multi dimension G M N K and test different reference calculation algorithms * 2 functions for reference using multi dimensional and flat indexing * clean the code for muti dimentional G, M, N, K contraction and add some logs * Add Make descriptor function in kernel for merging Ms, Ns, Ks for A, B, E * some cleaning on kernel * clean the code for calculating the offsets from flatten batch number * Start adding MultiD support to kernel and example * more changes to manage multi D in kernel and example * manage passing multi d to kernel and testing. * complete multi D support in kernel. modify example code to support it * Correct algorithm to calc the correct offset values for D tensor batches and some code cleaning * Minor fix * Generalize example code for variable NumD tensors and apply cleanup based on review feedback * Refactored code and addressed review feedback * refactoring, cleaning, add documents, in kernel side and example codes * Optimize batch offset calculation in kernel * Inline CalculateBatchOffset in batched contraction kernel, update CHANGELOG.md --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Update include path to break the remod's cyclic dep issue (#2978) * Update include path to break the cyclic dep issue * Use ck_tile::permute_vectors_i4x4_b in tile engine --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [CK_TILE] Batched Gemm Kernel IsSupported function checks (#2860) * Add valid check batched gemm part1 * [CK_TILE] Add batched gemm kernel IsSupported func checks * revert broken pre-commit hook changes * revert broken pre-commit hook changes v2 * Clarify error messages [CK_TILE] Blockwise GEMM pipeline v6 - port of v5 from old CK (#2955) * First checkpoint * Second checkpoint - hot loop scheduler * Third checkpoint - init main operator * Fourth checkpoint - main loop ready * Fifth checkpoint - main loop fix * Sixth checkpoint - ReadWritecompFunc * Seventh checkpoint - Tail finished * [CK_TILE] Blockwise gemm pipeline v5 complete * Working * Working fixes 2 * Rename v5 to v77 temporarily * Data type adjustment * Data type adjustment 2 * [CK_TILE] Blockwise Gemm pipeline v5 add tests * [CK_TILE] Fix calculation error * TEMP: check pipeline * Fix name to V6 * naming and documentation changes * WIP dump * Try fixing v1 * Failing tests v5 * Debugging * Changes v2 * F16 tests working great * Working BlockwiseGemmPipelineV5 as V6 * Cleanup and format * Merging changes part1 * [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6 * Remove commented code * Fix gfx950 build issues * Fix file formatting * Review changes, more concat info, add bf16 bf8 tests * Fix formatting * Add bf16 and bf8 tests --------- Co-authored-by: Adam Osewski <Adam.Osewski@amd.com> [CK_TILE] Non-K Major from old CK to CK-Tile (#2442) * Enable the adapted LDS B layout for Row-Major * fix formatting * Implement specialized col-major A LDS block descriptor * Fix formatting * Use VecLoadSize for AK1/BK1 * Fix some thread access pattern values * Use GetVectorSizeA for A * Fix formatting * Add extra condition to avoid division by zero * disable layout for wave32 * remove extra else * fix formatting * Fix formatting * Rename one remaining TileDistributionEncodingPattern2D * Use integer ceil division * revert remod.py changes * also revert utility.hpp * use getA/BTileAccessPattern everywhere * use integer_divide_ceil for AK0 too --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: Adam Osewski <Adam.Osewski@amd.com> Enable GMock and improve gtest configuration (#2976) Our current cmake/gtest.cmake file does not enable gmock. Gmock is needed for matchers that are needed for more readable unit tests. This PR enables gmock and does a little cleanup in gtest.cmake: * Enable BUILD_GMOCK by default (was previously disabled) * Patch gtest-src/googlemock/CMakeLists.txt for broken include path. * Add configuration to gmock if the target is used. No other changes in this PR, but I've verified I can use gmock matchers correctly once I include these changes in other code. [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm (#2540) * [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm * Update rmsnorm host reference * Update tree reduction of rmsnorm for reference host * Fix cross warp for m > 1 cases * Add RMSNorm model selectable option for host reference * Fix save_unquant cases * Update reference rmsnorm forward function to use enum for model sensitivity * Update reference rmsnorm calculation for model sensitivity * Fix m warp for layernorm * Adjust parameter of reference for twoPass * Fix clang format * Run clang-format-overwrite.sh to fix formating issue * fix clang format --------- Co-authored-by: MHYang <mengyang@amd.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Fix: Handle JSON boolean values (pad_m, pad_n, pad_k and persistent) in gemm_instance_builder (#3008) [CK_TILE] Switch into universal gemms for conv bwds (#2981) * switch into universal gemms for conv bwds * some fixes and support universal gemm in conv fwd * add reviewer comments Revert "[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)" (#3017) This reverts commit d2bbca3. update s_barrier's logic in gfx12 architecture (#3003) change s_waitcnt's logic in gfx1250 change s_waitcnt's logic in gfx1250 update comment fixing group id (#3002) feat(grouped_gemm_multi_d): add support for bf16 test(grouped_gemm_multi_d): add unit test for bf16 support Felix/opt sorting (#2902) * merge felix/sorting * opt moe sorting (#2822) * opt moe storing for 2k --------- Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: coderfeli <coderfeli@163.com> Disable streamk extended regression tests for now (#3016) re-enable clang-format by default (#3030) * re-enable clang-format by default * fix clang format use branch develop to test hipTensor (#3034) docs: add quant mode comparison to readme (#3032) * docs: add quant mode comparison to readme * Update example/ck_tile/38_block_scale_gemm/README.md Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> --------- Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> Fix compiler noreturn error for ck tile permute test (#3036) Enable storelse for fmha_fwd_trload kernel (#3023) Take universal GEMM pipeline into use for grouped convolutions. Add more instances. Disqualify benchmarking results from kernels that do not pass validation. Add more instances. Add double smem buffer instances. Rename conv factory. Add grouped conv fwd direction profiling into CK Tile profiler. Fix fwd layouts. Add missing header. Fix validation. Improve profiler output. Create runner script to runs CK and CK Tile profilers. Add back BF16 instances. Optimize calculation of the CPU reference. Improve profiler output. Parallel compilation of the CK Tile instances. Improve benchmarking and analysis script. Add new kernel instances. Small script improvements. Script to convert MIOpenDriver commands to CK profiler input. Benchmarking script improvements. More script improvements. Fix compilation issues on MI300. Add more gfx942 instances. Plot a large set of benchmark results. Print out aggregated statistics. tmp save tmp save save work next save
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed changes
Regression tests are failing on bf16, which is a known issue on streamK with atomic reduction. The error threshold calculations need to be adjusted for this case, so we need to disable extended tests for now.
Checklist
Please put an
xinto the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.clang-formaton all changed filesDiscussion
If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered