[rocFFT] Add ability to configure kernel per architecture#2450
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #2450 +/- ##
============================================
- Coverage 85.85% 52.93% -32.92%
============================================
Files 303 120 -183
Lines 21742 29438 +7696
Branches 0 3799 +3799
============================================
- Hits 18665 15582 -3083
- Misses 3077 12841 +9764
- Partials 0 1015 +1015
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
Is there any reason you didn't just use empty string for the generic arch name? |
I thought the generic arch name would make more sense than an empty string for describing the purpose here, but an empty string should also work. |
An empty string would be annoying to handle in the getline() loop in stockham_gen.cpp. |
malcolmroberts
left a comment
There was a problem hiding this comment.
I mentioned device instead of arch, but we can probably just add the relevant data like CU count, or whatever else we want.
Are you thinking of replacing arch with more detailed data like CU count, LDS size, L1/L2/L3 cache etc..., or this would be in addition to arch? And do we want to tune with that level of detail? |
…and solution map kernel builds.
…ct StockhamGeneratorSpecs.
…for aot and solution map kernel builds." This reverts commit ea12d45.
… construct StockhamGeneratorSpecs." This reverts commit 8467235.
This reverts commit a429b2b.
…avio-teixeira/rocm-libraries into function_pool_device_arch
* unify pipeline signature with existing example * iwyu * move stuff around in load-tile-transpose * cleanups in batched transpose pipeline * comments * use same inputs size * cleaner printf * print host args * use 64 block sides in the 37_transpose example * roll back grid dimension size adjustment for 37_transpose example * transpose grid for 37_transpose to unify with 35_batched_transpose * unify grid computation logic * make policy methods device only (since they are used only on device from the pipeline) * more host/device attribute cleanups * copy over problem * move over pipeline and policy * add switch to batched transpose api * make the lds problem more similar to original problem * factor out logic into traits * factor out conditional compilation into trait parameter * propagate pipeline to args * unhardcode pipeline dispatch parameter * refactor vector size * put warp tile out of dispatch * rename template parameter for trait * rewrite vector size in terms of problem * mark policy-internal struct variable as device * factor out input distribution and thread access pattern from policies * reword vector size * use datatype across batched transpose pipelines, problems and kernel * remove transpose traits from lds pipeline * add padding to the lds pipeline *interface* * add comment * remove ck_tile example #37 * update cmakelists * add test for new pipeline * update batched transpose test * roll back load_tile_transpose changes * remove comments * pack dispatch parameters into a config * padM can be enabled * adjust lds vector size to enable padding along N * update test * clean up logic * swap m/n input vector size * adjust perf test script * sweep over C/W in perf test * count both read and written bytes into bandwidth (x2 the number) * clang-format * widen size range for perf test * remove 64k x 64k case; it's too large for index * remove thread tile from dispatch * Solve merge conflict * fix compile * modify the transpose * solve the test error and clang format * Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463) * Add logging to IsSupported. * Less casting in AddClamp * Conv+bias+clamp instances & profiler BF16 * Fix 3D instances & run just 1x for verification. * :Run just once for verification conv fwd. * ckProfiler conv fwd clampwq * Remove exec bit & formatting * Add support for MultiD for grouped conv fwd v3. * Enable 2Lds. * clean * align instances * align instances * profiler fixes * Fixes * fix * fix --------- Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Fixing 0ms and inf GB/s issue in img2col (#2565) issue : ==== ``` sh $ bin/tile_example_img2col Perf: 0 ms, inf GB/s ``` solution : ====== Problem occured because config.time_kernel is false by default. if false, then no need to calculate perf, just print proper message `image_to_coloumn: pass, No Perf generated due to config.time_kernel=0` * merge with develop * solve clang format --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> [ROCm/composable_kernel commit: 821cd26]
Motivation
Add the ability to configure kernel parameters (workgroup size, threads-per-transform, length factorization, etc..) per architecture and precision.
Technical Details
Main changes are contained within kernel-generator.py and function_pool.h, where the concept of architecture has been added.
For regular entries in kernel-generator.py that do not specify the architecture, the concept of gfx_generic is introduced to deal with those. The generic entries should behave similar to configuration entries before the current changes. The gfx_generic concept also supports different lds size configurations similar to what we currently have implemented.
Test Plan
Current tests should pass without issues and no additional tests are required for now. Performance should also not be affected by the current changes. Once this PR is merged, new kernels will be added with per precision/architecture optimizations.
Test Result
All tests should pass without any issues.