Skip to content

[rocFFT] Add ability to configure kernel per architecture#2450

Merged
eng-flavio-teixeira merged 25 commits into
ROCm:developfrom
eng-flavio-teixeira:function_pool_device_arch
Dec 11, 2025
Merged

[rocFFT] Add ability to configure kernel per architecture#2450
eng-flavio-teixeira merged 25 commits into
ROCm:developfrom
eng-flavio-teixeira:function_pool_device_arch

Conversation

@eng-flavio-teixeira
Copy link
Copy Markdown
Contributor

@eng-flavio-teixeira eng-flavio-teixeira commented Nov 4, 2025

Motivation

Add the ability to configure kernel parameters (workgroup size, threads-per-transform, length factorization, etc..) per architecture and precision.

Technical Details

Main changes are contained within kernel-generator.py and function_pool.h, where the concept of architecture has been added.
For regular entries in kernel-generator.py that do not specify the architecture, the concept of gfx_generic is introduced to deal with those. The generic entries should behave similar to configuration entries before the current changes. The gfx_generic concept also supports different lds size configurations similar to what we currently have implemented.

Test Plan

Current tests should pass without issues and no additional tests are required for now. Performance should also not be affected by the current changes. Once this PR is merged, new kernels will be added with per precision/architecture optimizations.

Test Result

All tests should pass without any issues.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 5, 2025

Codecov Report

❌ Patch coverage is 56.47059% with 37 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...rojects/rocfft/library/src/include/function_pool.h 41.38% 15 Missing and 2 partials ⚠️
...ects/rocfft/library/src/include/function_map_key.h 51.61% 15 Missing ⚠️
projects/rocfft/shared/device_properties.h 66.67% 2 Missing and 3 partials ⚠️

❗ There is a different number of reports uploaded between BASE (edc55b2) and HEAD (5210b5c). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (edc55b2) HEAD (5210b5c)
hipSPARSE 1 0
Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #2450       +/-   ##
============================================
- Coverage    85.85%   52.93%   -32.92%     
============================================
  Files          303      120      -183     
  Lines        21742    29438     +7696     
  Branches         0     3799     +3799     
============================================
- Hits         18665    15582     -3083     
- Misses        3077    12841     +9764     
- Partials         0     1015     +1015     
Flag Coverage Δ
hipSPARSE ?
rocFFT 52.93% <56.47%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...rocfft/library/src/device/generator/stockham_gen.h 84.62% <100.00%> (ø)
...rojects/rocfft/library/src/rtc_stockham_kernel.cpp 81.61% <100.00%> (ø)
projects/rocfft/shared/device_properties.h 47.37% <66.67%> (ø)
...ects/rocfft/library/src/include/function_map_key.h 38.14% <51.61%> (ø)
...rojects/rocfft/library/src/include/function_pool.h 39.81% <41.38%> (ø)

... and 418 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@eng-flavio-teixeira eng-flavio-teixeira marked this pull request as ready for review November 27, 2025 19:08
@eng-flavio-teixeira eng-flavio-teixeira requested a review from a team as a code owner November 27, 2025 19:08
@eng-flavio-teixeira eng-flavio-teixeira changed the title Add ability to configure kernel per architecture [rocFFT] Add ability to configure kernel per architecture Nov 27, 2025
@evetsso
Copy link
Copy Markdown
Contributor

evetsso commented Dec 2, 2025

Is there any reason you didn't just use empty string for the generic arch name?

@eng-flavio-teixeira
Copy link
Copy Markdown
Contributor Author

Is there any reason you didn't just use empty string for the generic arch name?

I thought the generic arch name would make more sense than an empty string for describing the purpose here, but an empty string should also work.

Comment thread projects/rocfft/shared/device_properties.h
@evetsso evetsso self-requested a review December 3, 2025 16:30
@eng-flavio-teixeira
Copy link
Copy Markdown
Contributor Author

Is there any reason you didn't just use empty string for the generic arch name?

An empty string would be annoying to handle in the getline() loop in stockham_gen.cpp.
The gfx_generic could be removed from the supported_arch enum in config_arch.py, but it is much easier if we have something other than an empty string to parse the line in stockham_gen.

Comment thread projects/rocfft/library/src/device/generator/stockham_gen.cpp
Copy link
Copy Markdown
Contributor

@malcolmroberts malcolmroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned device instead of arch, but we can probably just add the relevant data like CU count, or whatever else we want.

@eng-flavio-teixeira
Copy link
Copy Markdown
Contributor Author

eng-flavio-teixeira commented Dec 5, 2025

I mentioned device instead of arch, but we can probably just add the relevant data like CU count, or whatever else we want.

Are you thinking of replacing arch with more detailed data like CU count, LDS size, L1/L2/L3 cache etc..., or this would be in addition to arch? And do we want to tune with that level of detail?

@af-ayala af-ayala self-requested a review December 9, 2025 09:06
Comment thread projects/rocfft/library/src/device/kernel-generator.py
@eng-flavio-teixeira eng-flavio-teixeira merged commit 3d60cc3 into ROCm:develop Dec 11, 2025
16 checks passed
ammallya pushed a commit that referenced this pull request Feb 3, 2026
* unify pipeline signature with existing example

* iwyu

* move stuff around in load-tile-transpose

* cleanups in batched transpose pipeline

* comments

* use same inputs size

* cleaner printf

* print host args

* use 64 block sides in the 37_transpose example

* roll back grid dimension size adjustment for 37_transpose example

* transpose grid for 37_transpose to unify with 35_batched_transpose

* unify grid computation logic

* make policy methods device only (since they are used only on device from the pipeline)

* more host/device attribute cleanups

* copy over problem

* move over pipeline and policy

* add switch to batched transpose api

* make the lds problem more similar to original problem

* factor out logic into traits

* factor out conditional compilation into trait parameter

* propagate pipeline to args

* unhardcode pipeline dispatch parameter

* refactor vector size

* put warp tile out of dispatch

* rename template parameter for trait

* rewrite vector size in terms of problem

* mark policy-internal struct variable as device

* factor out input distribution and thread access pattern from policies

* reword vector size

* use datatype across batched transpose pipelines, problems and kernel

* remove transpose traits from lds pipeline

* add padding to the lds pipeline *interface*

* add comment

* remove ck_tile example #37

* update cmakelists

* add test for new pipeline

* update batched transpose test

* roll back load_tile_transpose changes

* remove comments

* pack dispatch parameters into a config

* padM can be enabled

* adjust lds vector size to enable padding along N

* update test

* clean up logic

* swap m/n input vector size

* adjust perf test script

* sweep over C/W in perf test

* count both read and written bytes into bandwidth (x2 the number)

* clang-format

* widen size range for perf test

* remove 64k x 64k case; it's too large for index

* remove thread tile from dispatch

* Solve merge conflict

* fix compile

* modify the transpose

* solve the test error and clang format

* Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)

* Add logging to IsSupported.

* Less casting in AddClamp

* Conv+bias+clamp instances & profiler BF16

* Fix 3D instances & run just 1x for verification.

* :Run just once for verification conv fwd.

* ckProfiler conv fwd clampwq

* Remove exec bit & formatting

* Add support for MultiD for grouped conv fwd v3.

* Enable 2Lds.

* clean

* align instances

* align instances

* profiler fixes

* Fixes

* fix

* fix

---------

Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Fixing 0ms and inf GB/s issue in img2col (#2565)

issue :
====
``` sh
$ bin/tile_example_img2col
Perf: 0 ms, inf GB/s
```

solution :
======
Problem occured because config.time_kernel is false by default.
if false, then no need to calculate perf, just print proper message

`image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`

* merge with develop

* solve clang format

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>

[ROCm/composable_kernel commit: 821cd26]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants