Merge develop into master by asroy · Pull Request #43 · ROCm/composable_kernel

asroy · 2021-10-21T01:27:21Z

No description provided.

* Squashed 'src/composable_kernel/' content from commit f6edda6 git-subtree-dir: src/composable_kernel git-subtree-split: f6edda6 * add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files * Squashed 'src/composable_kernel/' changes from f6edda6..5781adf 5781adf Update develop (#5) (#6) 97e6d51 Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile 7b1ec41 refactor 49c33aa refactor 54b3e73 rename git-subtree-dir: src/composable_kernel git-subtree-split: 5781adf * fix * refactor * remove online compilation from CK * refactor * fix * add ctest * add c-style pointer cast * vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast * fix clang warning suppression * tidy * suppress cppcheck * fix enum issue * revert chagnes to hip build * fix kernel filename * update CK build script * rename * rename * make innner product compatiable on gfx900 * Update src/include/miopen/solver/ck_utility_common.hpp Co-authored-by: JD <Jehandad.Khan@amd.com> * compiler parameter use stream * use int instead of index_t in kernel wrapper * DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element * refactor * refactor * change cmakelist * change ck common utility * fix Co-authored-by: JD <Jehandad.Khan@amd.com>

#1108) * add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files * make inner product compatible on gfx900 * Update src/include/miopen/solver/ck_utility_common.hpp * compiler parameter use stream * use int instead of index_t in kernel wrapper * DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element * Add dynamic generic reduction kernel layer (kernel wrappers, kernel implementations and utilities) * Some updates to dynamic composable kernel facility for the need of dynamic generic reduction * Update to generic reduction C++ host interface layer to support dynamic generic reduction * Update to remove tidy complaints in host interface layer * Change the unary operator form from void op(T &x) to T op(T x) * Update to pass single workspace pointer for all kernels (fix for OpenCL backend) * Use cppcheck-suppress to prevent some strange warnings * Re-use operator [] and () for DynamicBuffer and update to depending codes * Remove useless codes in first call threadwise/warpwise/blockwise kernel wrappers * [performance] Remove un-needed local buffer initialization Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: JD <Jehandad.Khan@amd.com>

…dynamic_reduction_pr

…namic_reduction_pr [MIOpen Downstream] Dynamic Reduction PR

* init StaticBufferV2 * clean * adopt old output stage for staticBufferV2 * clean * remove hack * clean * clean * clean code * move c_buffer alloc into blockwise gemm * add adaptors for m/n_thread_data_on_grid * adjust blockwise_gemm_xdlops * reorder ops in GEMM hot loop Co-authored-by: Chao Liu <chao.liu2@amd.com>

* add parameters * tweak gemm * tweak * update conv * update script * adding bwd 1x1 * update script * adding 1x1 bwd * debugging bwd 1x1 failure * update script * update script * test * test v100 * clean up

* Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel * Fix with regard to implementing GetZeroVal() in both kernel and host * Avoid convert to compType from dstDataType before writting the output value * Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator * Add CONSTANT decorator for descriptor read buffer * Use get_thread_local_1d_id() for thread local Id * Rename GetZeroVal() to GetReductionZeroVal() in the kernels * Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp * Occasional tiny simplification and update in the kernel files * Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers * Update to remove OpenCL tidy checking failures * Update for better readability * Remove unused codes and not-needed template parameters in the kernel wrappers Co-authored-by: Chao Liu <chao.liu2@amd.com>

* add add new algorithm from v4r4r2 * program once issue * add split k functiion * redefine code * add a matrix unmerge * add b matrix unmerge k0 * trans a and b to gridegemm * nhwc init * no hacks and vector load * add hacks * modify some parameter * fix tuning prometer for fp32 * fix tuning prometer for fp16 * start change gridwise k split * init ok * revome a b matrix k0mk1 desc in grid * carewrite lculate gridsize * add kbatch to CalculateBottomIndex * remove some unused funtion * add clear data function before call kernel * out hacks * in hacks * rename device convolution file and function name * modify kBatch value * fix some tuning code * start from v4r4 nhwc * nhwc atomic is able to run * just for fp32 * enable nchw atomic * tweak * tweak * re-arrange gridwise gemm hot loop for wrw * add wrw v4r5 * v4r4r5 fp16 * v4r4r4 fp16 * v4r4r2 fp16 * V4R4R4XDLNHWC fp16 * V4R4R2XDLATOMICNCHW fp16 * adjust for fp16 * input gridsize * change kbatch to gridsize * testing wrw * clean up * k_batch to gridsize * fix bug * wrw v4r4r4 kbatch change to gride size * wrw v4r4r2 kbatch change to gride size * after merge , change gridwise gemm v2r4 * change MakeCBlockClusterAdaptor * other method use new gridwise gemm * clean up * chapad method nge to make_right_pad_transform * kbatch out from transform function * clean up and fix bug * fix bug * using function type reduce template parameters * using auto replace define fuction type * clean up Co-authored-by: ltqin <letaoqin@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: Jing Zhang <jizhan@amd.com>

Update flash_attention.py

Chao Liu and others added 10 commits August 19, 2021 10:55

:Merge remote-tracking branch 'origin/develop' into CK_upstream

df0d681

Merge remote-tracking branch 'origin/develop' into miopen_downstream-…

b725e3f

…dynamic_reduction_pr

Merge pull request #31 from ROCmSoftwarePlatform/miopen_downstream-dy…

f305beb

…namic_reduction_pr [MIOpen Downstream] Dynamic Reduction PR

Tweak GEMM kernel (#38)

b3e8d57

* add parameters * tweak gemm * tweak * update conv * update script * adding bwd 1x1 * update script * adding 1x1 bwd * debugging bwd 1x1 failure * update script * update script * test * test v100 * clean up

bug fix (#39)

c301879

asroy merged commit 38a90b6 into master Oct 21, 2021

illsilin mentioned this pull request Oct 15, 2025

use branch develop to test hipTensor #3034

Merged

7 tasks

hyoon1 pushed a commit to hyoon1/composable_kernel that referenced this pull request Mar 19, 2026

Merge pull request ROCm#43 from eric-tc-wong/patch-1

04fb198

Update flash_attention.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge develop into master#43

Merge develop into master#43
asroy merged 10 commits into
masterfrom
develop

asroy commented Oct 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

asroy commented Oct 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants