[rocFFT] Add ability to configure kernel per architecture by eng-flavio-teixeira · Pull Request #2450 · ROCm/rocm-libraries

eng-flavio-teixeira · 2025-11-04T15:59:42Z

Motivation

Add the ability to configure kernel parameters (workgroup size, threads-per-transform, length factorization, etc..) per architecture and precision.

Technical Details

Main changes are contained within kernel-generator.py and function_pool.h, where the concept of architecture has been added.
For regular entries in kernel-generator.py that do not specify the architecture, the concept of gfx_generic is introduced to deal with those. The generic entries should behave similar to configuration entries before the current changes. The gfx_generic concept also supports different lds size configurations similar to what we currently have implemented.

Test Plan

Current tests should pass without issues and no additional tests are required for now. Performance should also not be affected by the current changes. Once this PR is merged, new kernels will be added with per precision/architecture optimizations.

Test Result

All tests should pass without any issues.

…implemented.

codecov-commenter · 2025-11-05T23:55:34Z

Codecov Report

❌ Patch coverage is 56.47059% with 37 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...rojects/rocfft/library/src/include/function_pool.h	41.38%	15 Missing and 2 partials ⚠️
...ects/rocfft/library/src/include/function_map_key.h	51.61%	15 Missing ⚠️
projects/rocfft/shared/device_properties.h	66.67%	2 Missing and 3 partials ⚠️

❗ There is a different number of reports uploaded between BASE (edc55b2) and HEAD (5210b5c). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (edc55b2) HEAD (5210b5c)

hipSPARSE 1 0

Additional details and impacted files

@@             Coverage Diff              @@
##           develop    #2450       +/-   ##
============================================
- Coverage    85.85%   52.93%   -32.92%     
============================================
  Files          303      120      -183     
  Lines        21742    29438     +7696     
  Branches         0     3799     +3799     
============================================
- Hits         18665    15582     -3083     
- Misses        3077    12841     +9764     
- Partials         0     1015     +1015

Flag	Coverage Δ
hipSPARSE	`?`
rocFFT	`52.93% <56.47%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...rocfft/library/src/device/generator/stockham_gen.h	`84.62% <100.00%> (ø)`
...rojects/rocfft/library/src/rtc_stockham_kernel.cpp	`81.61% <100.00%> (ø)`
projects/rocfft/shared/device_properties.h	`47.37% <66.67%> (ø)`
...ects/rocfft/library/src/include/function_map_key.h	`38.14% <51.61%> (ø)`
...rojects/rocfft/library/src/include/function_pool.h	`39.81% <41.38%> (ø)`

... and 418 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…_pool_device_arch

evetsso · 2025-12-02T21:57:43Z

Is there any reason you didn't just use empty string for the generic arch name?

eng-flavio-teixeira · 2025-12-02T22:49:00Z

Is there any reason you didn't just use empty string for the generic arch name?

I thought the generic arch name would make more sense than an empty string for describing the purpose here, but an empty string should also work.

eng-flavio-teixeira · 2025-12-04T21:40:23Z

Is there any reason you didn't just use empty string for the generic arch name?

An empty string would be annoying to handle in the getline() loop in stockham_gen.cpp.
The gfx_generic could be removed from the supported_arch enum in config_arch.py, but it is much easier if we have something other than an empty string to parse the line in stockham_gen.

malcolmroberts

I mentioned device instead of arch, but we can probably just add the relevant data like CU count, or whatever else we want.

eng-flavio-teixeira · 2025-12-05T21:49:51Z

I mentioned device instead of arch, but we can probably just add the relevant data like CU count, or whatever else we want.

Are you thinking of replacing arch with more detailed data like CU count, LDS size, L1/L2/L3 cache etc..., or this would be in addition to arch? And do we want to tune with that level of detail?

…and solution map kernel builds.

…ct StockhamGeneratorSpecs.

…evice.

…for aot and solution map kernel builds." This reverts commit ea12d45.

… construct StockhamGeneratorSpecs." This reverts commit 8467235.

This reverts commit a429b2b.

…avio-teixeira/rocm-libraries into function_pool_device_arch

* unify pipeline signature with existing example * iwyu * move stuff around in load-tile-transpose * cleanups in batched transpose pipeline * comments * use same inputs size * cleaner printf * print host args * use 64 block sides in the 37_transpose example * roll back grid dimension size adjustment for 37_transpose example * transpose grid for 37_transpose to unify with 35_batched_transpose * unify grid computation logic * make policy methods device only (since they are used only on device from the pipeline) * more host/device attribute cleanups * copy over problem * move over pipeline and policy * add switch to batched transpose api * make the lds problem more similar to original problem * factor out logic into traits * factor out conditional compilation into trait parameter * propagate pipeline to args * unhardcode pipeline dispatch parameter * refactor vector size * put warp tile out of dispatch * rename template parameter for trait * rewrite vector size in terms of problem * mark policy-internal struct variable as device * factor out input distribution and thread access pattern from policies * reword vector size * use datatype across batched transpose pipelines, problems and kernel * remove transpose traits from lds pipeline * add padding to the lds pipeline *interface* * add comment * remove ck_tile example #37 * update cmakelists * add test for new pipeline * update batched transpose test * roll back load_tile_transpose changes * remove comments * pack dispatch parameters into a config * padM can be enabled * adjust lds vector size to enable padding along N * update test * clean up logic * swap m/n input vector size * adjust perf test script * sweep over C/W in perf test * count both read and written bytes into bandwidth (x2 the number) * clang-format * widen size range for perf test * remove 64k x 64k case; it's too large for index * remove thread tile from dispatch * Solve merge conflict * fix compile * modify the transpose * solve the test error and clang format * Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463) * Add logging to IsSupported. * Less casting in AddClamp * Conv+bias+clamp instances & profiler BF16 * Fix 3D instances & run just 1x for verification. * :Run just once for verification conv fwd. * ckProfiler conv fwd clampwq * Remove exec bit & formatting * Add support for MultiD for grouped conv fwd v3. * Enable 2Lds. * clean * align instances * align instances * profiler fixes * Fixes * fix * fix --------- Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Fixing 0ms and inf GB/s issue in img2col (#2565) issue : ==== ``` sh $ bin/tile_example_img2col Perf: 0 ms, inf GB/s ``` solution : ====== Problem occured because config.time_kernel is false by default. if false, then no need to calculate perf, just print proper message `image_to_coloumn: pass, No Perf generated due to config.time_kernel=0` * merge with develop * solve clang format --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> [ROCm/composable_kernel commit: 821cd26]

- Add ability to have per precision kernel configuration entries.

218573d

github-actions Bot added the project: rocfft label Nov 4, 2025

assistant-librarian Bot added the organization: ROCm label Nov 4, 2025

- Change bytes_per_element calculation to match what was previously …

e64821d

…implemented.

eng-flavio-teixeira added 2 commits November 24, 2025 16:58

- Add support for arch specific entries in the function pool.

2512301

- Further changes for arch support in the function pool.

15761c2

eng-flavio-teixeira marked this pull request as ready for review November 27, 2025 19:08

eng-flavio-teixeira requested a review from a team as a code owner November 27, 2025 19:08

eng-flavio-teixeira changed the title ~~Add ability to configure kernel per architecture~~ [rocFFT] Add ability to configure kernel per architecture Nov 27, 2025

eng-flavio-teixeira added 2 commits November 27, 2025 12:19

Merge commit '85ea1d36a78c3fca8485cbf6833497cfd18596ac' into function…

91948d0

…_pool_device_arch

- CHANGELOG.

7217193

eng-flavio-teixeira requested a review from a team as a code owner November 27, 2025 19:26

github-actions Bot added the documentation label Nov 27, 2025

- Remove suffix from get_curr_gcn_arch_name().

326090e

af-ayala reviewed Dec 2, 2025

View reviewed changes

Comment thread projects/rocfft/shared/device_properties.h

evetsso self-requested a review December 3, 2025 16:30

- Address review suggestion.

8a9ee5b

malcolmroberts reviewed Dec 5, 2025

View reviewed changes

Comment thread projects/rocfft/library/src/device/generator/stockham_gen.cpp

malcolmroberts approved these changes Dec 5, 2025

View reviewed changes

evetsso approved these changes Dec 5, 2025

View reviewed changes

eng-flavio-teixeira added 3 commits December 8, 2025 11:05

- Add missing gcn_arch_name in kernel_name()

7b0d134

- Fix gcn_arch_name handling for partial-pass kernels.

7055d54

- Fix partial-pass kernel check.

554633e

af-ayala self-requested a review December 9, 2025 09:06

af-ayala approved these changes Dec 9, 2025

View reviewed changes

regan-amd approved these changes Dec 9, 2025

View reviewed changes

Comment thread projects/rocfft/library/src/device/kernel-generator.py

eng-flavio-teixeira added 14 commits December 10, 2025 09:59

- Add comment.

24db868

Merge branch 'develop' into function_pool_device_arch

21ba23e

Merge branch 'develop' into function_pool_device_arch

03fef0c

- No need to put device arch name in StockhamGeneratorSpecs for aot …

ea12d45

…and solution map kernel builds.

- Remove no longer needed calls to get_curr_gcn_arch_name to constru…

8467235

…ct StockhamGeneratorSpecs.

- Remove header include.

a429b2b

- Handle the case where FMKey is constructed without a visible HIP d…

6114b2f

…evice.

Revert " - No need to put device arch name in StockhamGeneratorSpecs …

0980785

…for aot and solution map kernel builds." This reverts commit ea12d45.

Revert " - Remove no longer needed calls to get_curr_gcn_arch_name to…

21e66ae

… construct StockhamGeneratorSpecs." This reverts commit 8467235.

Revert " - Remove header include."

f1e5984

This reverts commit a429b2b.

Merge branch 'develop' into function_pool_device_arch

e22d8fa

- Take hip return into account.

36a30b7

Merge branch 'function_pool_device_arch' of https://github.com/eng-fl…

19ca6d6

…avio-teixeira/rocm-libraries into function_pool_device_arch

Merge branch 'develop' into function_pool_device_arch

5210b5c

eng-flavio-teixeira merged commit 3d60cc3 into ROCm:develop Dec 11, 2025
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rocFFT] Add ability to configure kernel per architecture#2450

[rocFFT] Add ability to configure kernel per architecture#2450
eng-flavio-teixeira merged 25 commits into
ROCm:developfrom
eng-flavio-teixeira:function_pool_device_arch

eng-flavio-teixeira commented Nov 4, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Nov 5, 2025 •

edited

Loading

Uh oh!

evetsso commented Dec 2, 2025

Uh oh!

eng-flavio-teixeira commented Dec 2, 2025

Uh oh!

Uh oh!

eng-flavio-teixeira commented Dec 4, 2025

Uh oh!

Uh oh!

malcolmroberts left a comment

Uh oh!

eng-flavio-teixeira commented Dec 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

eng-flavio-teixeira commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Uh oh!

codecov-commenter commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

evetsso commented Dec 2, 2025

Uh oh!

eng-flavio-teixeira commented Dec 2, 2025

Uh oh!

Uh oh!

eng-flavio-teixeira commented Dec 4, 2025

Uh oh!

Uh oh!

malcolmroberts left a comment

Choose a reason for hiding this comment

Uh oh!

eng-flavio-teixeira commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

eng-flavio-teixeira commented Nov 4, 2025 •

edited

Loading

codecov-commenter commented Nov 5, 2025 •

edited

Loading

eng-flavio-teixeira commented Dec 5, 2025 •

edited

Loading