Skip to content

[MIOpen Downstream] Fix Reduction Kernel#34

Merged
asroy merged 14 commits into
developfrom
miopen_downstream-reduction_fix_generic
Oct 6, 2021
Merged

[MIOpen Downstream] Fix Reduction Kernel#34
asroy merged 14 commits into
developfrom
miopen_downstream-reduction_fix_generic

Conversation

@qianfengz
Copy link
Copy Markdown
Contributor

This P.R includes everything of the kernel layer stuffs from MIOpen's dynamic generic reduction implementation.

  1. kernel wrappers
  2. Grid-wise generic reduction kernels in four methods (Direct_ThreadWise, Direct_WarpWise, BlockWise, MultiBlock)
  3. Re-usable reduction functions (ThreadWise, WarpWise, BlockWise)
  4. Some Addings/Changes to the C.K utilities

@asroy asroy changed the title Miopen downstream reduction fix generic [MIOpen Downstream] Fix Reduction Sep 22, 2021
@asroy asroy changed the title [MIOpen Downstream] Fix Reduction [MIOpen Downstream] Fix Reduction Kernel Sep 22, 2021
@qianfengz qianfengz requested a review from asroy September 23, 2021 06:20
@asroy
Copy link
Copy Markdown
Contributor

asroy commented Sep 29, 2021

A syncing PR of MIOpen PR ROCm/MIOpen#1156

@qianfengz
Copy link
Copy Markdown
Contributor Author

Just push two commits from MIOpen/reduction_fix_generic. This should be the final synchronization from reduction_fix_generic

asroy pushed a commit that referenced this pull request Oct 6, 2021
* create files for xdlops

* working on blockwise_gemm_xdlops

* add KReduction

* add m/n repeats

* add 2x2 pipeline

* added 128x128 wavegemm

* use StaticBuffer of vector_type

* break vector type to blk_size

* add kpack into xldops_gemm and blockwise_gemm

* abroadcast only

* add fp32 mfma instructions

* adding fp16 mfma

* pack half4_t

* rename kperwave to kpack

* add 32x32x8fp16

* add fp16 mfma

* clean code

* clean code

* V4r4 xdlops kpack (#35)

* add kpack with incorrect results

* bug fix for make_dynamic_naive_tensor_descriptor_aligned_v2

* add 1x1 kernel

* add gridwise_gemm_v2 - single_buffer

* enabled dwordx4 for fp16

Co-authored-by: Chao Liu <chao.liu2@amd.com>

* refactor fwd-v4r4-xdlops

* add v4r4-nhwc-xdlop

* improve some perf of nhwc and nchw by tuning parameters, and change scheuduling in gridwise-gemm loop

* tweak scheduling in gridwise gemm

* add v4r3 with a single output copy

* init commit: output with slice win

* adding sliceWin

* add multiple repeats pattern

* starting adding bwd-v4r1-xdlops

* use tuple as SrcBuffer

* adding bwd-data v4r1 nhwc xdlops

* fix bug in make_dynamic_naive_tensor_descriptor_aligned_v2()

* fix bug in host bwd-data conv

* initial implementation of bwd-data v4r1 nhwc xdlops

* add launch bound flags

* enable launch bound

* add m/nrepeat=4

* tweak bwd-data v4r1 nhwc xdlops

* added bwd-data v4r1 nhwc xlops with output A and weight B

* add fwd-v4r4 nhwc xdlops, A input, B weight, C output

Co-authored-by: Chao Liu <chao.liu2@amd.com>
@asroy asroy merged commit b2dc55f into develop Oct 6, 2021
@qianfengz qianfengz deleted the miopen_downstream-reduction_fix_generic branch June 13, 2022 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants