[rocPRIM] Config modernization by NB4444 · Pull Request #2955 · ROCm/rocm-libraries

NB4444 · 2025-11-27T10:56:08Z

Motivation

Our previous configuration system had become limiting in several ways. Most importantly, it was not able to differentiate between individual GPUs when selecting config parameters. This made proper tuning difficult and prevented future work involving SPIR-V–specific tuning. In addition, the old approach relied heavily on complex template metaprogramming, which had become difficult to maintain. With the move to C++17, we now have cleaner and more expressive language features available, making this a good opportunity to redesign the system.

Technical Details

All changes are internal. There are no API changes for users.

The majority of the diff in this PR consists of the new configuration definitions themselves, so while the PR appears large, the actual code changes are relatively small.

New Configuration Structure

Each algorithm now defines a *_config_picker templated on the target and value type. Below is a simplified example:

template<class Target, class value_type>
constexpr <algo_name>_config_picker()
    -> std::enable_if_t<
        std::is_same_v<Target,
                       comp_target<gen::gcn5, target_arch::gfx906, gpu::mi50, rep::amdgcn>>,
        <algo_name>_config_params>
{
    // Tuned configuration #1
    if constexpr (/* condition for this combination */)
    {
        return <algo_name>_config_params{ ... };
    }
    // Tuned configuration #2
    if constexpr (/* condition for this combination */)
    {
        return <algo_name>_config_params{ ... };
    }
    // Default for this target
    return <algo_name>_config_params_base<value_type>();
}

Each tuned target provides a similar overload. For untuned or unknown targets, we provide a general fallback:

template<class Target, class value_type>
constexpr auto <algo_name>_config_picker()
    -> std::enable_if_t<
        std::is_same_v<Target,
                       comp_target<gen::unknown, target_arch::unknown, gpu::generic, rep::amdgcn>>,
        <algo_name>_config_params>
{
    // Fallback: use a commonly tuned target (often MI100)
    return <algo_name>_config_picker<
        comp_target<gen::cdna1, target_arch::gfx908, gpu::mi100, rep::amdgcn>,
        key_type, value_type>();
}

All available tuned targets are listed in:

using <algo_name>_targets = comp_targets<
    comp_target<gen::gcn5, target_arch::gfx906, gpu::mi50, rep::amdgcn>,
    ...,
    comp_target<gen::unknown, target_arch::unknown, gpu::generic, rep::amdgcn>>;

How Config Selection Works Now

In the new system, kernels are compiled for all tuned targets. At runtime, if the current GPU does not have dedicated tuning, the library uses the most_common_config policy to choose the best matching compiled kernel.

The selection policy (tested in test_config_dispatch.cpp) attempts to match, in decreasing priority:

Exact GPU model
Architecture
Generation

If no match is found, it falls back to the unknown target. If multiple candidates match, the last one listed in the comp_targets type list is chosen, which gives us a controlled and predictable fallback order.

We also pass the selected target into kernel compilation, enabling compile-time specialization based on GPU, architecture, and generation.

Target struct

The target struct currently stores only:

GPU generation
Architecture
GPU Name
Representation (rep), which distinguishes SPIR-V from native AMDGCN

The rep field is not yet functional (requires compiler support), and the dispatch policy does not consider it at the moment. Also this target structs makes it relatively easy to store more data.

Scripts

The python script changes in this PR are there for scripts that used the configs as input/output.

Summary of Improvements:

Better differentiation and selection across GPUs
Cleaner C++17-based implementation
Easier extension for future SPIR-V tuning
Improved maintainability of config definitions
Added more flexibility for future features.

Test Plan

Some tests were added in test_config_dispatch.cpp, these and all the other tests should pass. Also everything needs to be benchmarked to see if the correct configs are chosen.

Test Result

All tests pass, benchmarks are still WIP.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

codecov-commenter · 2025-11-27T11:38:05Z

Codecov Report

❌ Patch coverage is 17.23313% with 3458 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...clude/rocprim/device/detail/config/device_scan.hpp	13.16%	462 Missing ⚠️
.../device/detail/config/device_run_length_encode.hpp	13.53%	358 Missing ⚠️
...ude/rocprim/device/detail/config/device_reduce.hpp	13.51%	352 Missing ⚠️
...il/config/device_run_length_encode_non_trivial.hpp	13.42%	316 Missing ⚠️
...prim/device/detail/config/device_adjacent_find.hpp	14.41%	196 Missing ⚠️
...evice/detail/config/device_partition_predicate.hpp	16.02%	194 Missing ⚠️
...evice/detail/config/device_partition_three_way.hpp	16.02%	194 Missing ⚠️
...tail/config/device_partition_two_way_predicate.hpp	16.02%	194 Missing ⚠️
...tail/config/device_adjacent_difference_inplace.hpp	15.92%	169 Missing ⚠️
...evice/detail/config/device_adjacent_difference.hpp	16.42%	168 Missing ⚠️
... and 7 more

❗ There is a different number of reports uploaded between BASE (82a516d) and HEAD (984c824). Click for more details.

HEAD has 2 uploads less than BASE

Flag BASE (82a516d) HEAD (984c824)

rocFFT 1 0

hipSPARSE 1 0

Additional details and impacted files

@@             Coverage Diff              @@
##           develop    #2955       +/-   ##
============================================
- Coverage    68.07%   42.94%   -25.13%     
============================================
  Files          425      193      -232     
  Lines        51229    28160    -23069     
  Branches      3802      699     -3103     
============================================
- Hits         34872    12093    -22779     
- Misses       15343    15475      +132     
+ Partials      1014      592      -422

Flag	Coverage Δ
hipCUB	`81.76% <85.71%> (?)`
hipSPARSE	`?`
rocFFT	`?`
rocPRIM	`38.96% <16.77%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...rojects/rocprim/rocprim/include/rocprim/config.hpp	`100.00% <ø> (ø)`
...prim/device/detail/config/device_binary_search.hpp	`14.43% <ø> (ø)`
.../rocprim/device/detail/config/device_histogram.hpp	`17.37% <ø> (ø)`
...ocprim/device/detail/config/device_lower_bound.hpp	`14.72% <ø> (ø)`
...lude/rocprim/device/detail/config/device_merge.hpp	`13.33% <ø> (ø)`
...ce/detail/config/device_merge_sort_block_merge.hpp	`14.53% <ø> (ø)`
...ice/detail/config/device_merge_sort_block_sort.hpp	`13.54% <ø> (ø)`
...evice/detail/config/device_radix_sort_onesweep.hpp	`7.49% <ø> (ø)`
...prim/device/detail/config/device_reduce_by_key.hpp	`13.13% <ø> (ø)`
...ocprim/device/detail/config/device_scan_by_key.hpp	`14.04% <ø> (ø)`
... and 78 more

... and 530 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

* First checkpoint * Second checkpoint - hot loop scheduler * Third checkpoint - init main operator * Fourth checkpoint - main loop ready * Fifth checkpoint - main loop fix * Sixth checkpoint - ReadWritecompFunc * Seventh checkpoint - Tail finished * [CK_TILE] Blockwise gemm pipeline v5 complete * Working * Working fixes 2 * Rename v5 to v77 temporarily * Data type adjustment * Data type adjustment 2 * [CK_TILE] Blockwise Gemm pipeline v5 add tests * [CK_TILE] Fix calculation error * TEMP: check pipeline * Fix name to V6 * naming and documentation changes * WIP dump * Try fixing v1 * Failing tests v5 * Debugging * Changes v2 * F16 tests working great * Working BlockwiseGemmPipelineV5 as V6 * Cleanup and format * Merging changes part1 * [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6 * Remove commented code * Fix gfx950 build issues * Fix file formatting * Review changes, more concat info, add bf16 bf8 tests * Fix formatting * Add bf16 and bf8 tests --------- Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

NB4444 · 2025-12-15T10:42:10Z

I have also added a fix for generic build types, and added support for the gfx1101, gfx1152 and gfx1153.

umfranzw

This looks great - thanks @NB4444, I think it's much improved over the old system.

NB4444 · 2025-12-16T10:19:39Z

I have added some more missing architectures.

stanleytsang-amd · 2025-12-17T21:51:27Z

@NB4444 Since the last update on Monday, device_histogram unit test is failing on gfx942:

[----------] 1 test from RocprimDeviceHistogramMultiEven/10, where TypeParam = params3<int,4u,3u,2000u,0,2000,int,int,rocprim::ROCPRIM_400200_NS::default_config,true>

[ RUN ] RocprimDeviceHistogramMultiEven/10.MultiEven

../../../../test/rocprim/test_utils_assertions.hpp:86: Failure

Expected equality of these values:

val

Which is: 2

expected

Which is: 1

where index = 1610

Google Test trace:

../../../../test/rocprim/test_device_histogram.cpp:769: with channel = 0

../../../../test/rocprim/test_device_histogram.cpp:654: with size = 4

../../../../test/rocprim/test_device_histogram.cpp:653: with seed = 133108200

../../../../test/rocprim/test_device_histogram.cpp:641: with dim = {1, 1, 0}

../../../../test/rocprim/test_device_histogram.cpp:600: with device_id = 0

../../../../test/rocprim/test_utils_assertions.hpp:139: Failure

Expected: protected_assert_eq(result[i], expected[i], i) doesn't generate new fatal failures in the current thread.

Actual: it does.

Google Test trace:

../../../../test/rocprim/test_device_histogram.cpp:769: with channel = 0

../../../../test/rocprim/test_device_histogram.cpp:654: with size = 4

../../../../test/rocprim/test_device_histogram.cpp:653: with seed = 133108200

../../../../test/rocprim/test_device_histogram.cpp:641: with dim = {1, 1, 0}

../../../../test/rocprim/test_device_histogram.cpp:600: with device_id = 0

../../../../test/rocprim/test_device_histogram.cpp:772: Failure

Expected: test_utils::assert_eq(histogram[channel], histogram_expected[channel], bins[channel]) doesn't generate new fatal failures in the current thread.

Actual: it does.

Google Test trace:

../../../../test/rocprim/test_device_histogram.cpp:769: with channel = 0

../../../../test/rocprim/test_device_histogram.cpp:654: with size = 4

../../../../test/rocprim/test_device_histogram.cpp:653: with seed = 133108200

../../../../test/rocprim/test_device_histogram.cpp:641: with dim = {1, 1, 0}

../../../../test/rocprim/test_device_histogram.cpp:600: with device_id = 0

NB4444 · 2025-12-18T14:57:50Z

I’ve added a temporary workaround for the failure. The change that exposed the issue was adding additional architectures to the string array in commit 85f49bf. The same change on develop also triggers the test failure.

The root cause appears to be in hipgraph, specifically in the private global histogram optimization for gfx942. As a temporary measure, I’ve disabled this optimization when used with hipgraphs.

I’ll investigate further tomorrow, but the underlying issue is unrelated to the config system changes themselves. It’s still unclear why the seemingly unrelated change of adding architectures ended up triggering this problem.

NB4444 · 2025-12-19T11:35:24Z

I chose for a different temporary solution that changes the actual change in the PR that caused the issue. There seems some kind of overflow. When the items in the std::array (or other C style array) exceeds 16 items we start seeing this unrelated failing test. This can be fixed by setting the array size one larger then the amount of items. I will investigate this, because this is not really a satisfactory solution, but it is unrelated to the PR changes, the issue was already there it did just not exceed the size of 16.

…stem"

fix predicate_flag config choosing error.

NB4444 · 2026-01-05T12:57:57Z

I replaced the workaround, with something a bit more permanent. Which does not rely on undefined behavior.

umfranzw · 2026-01-05T16:53:38Z

I've reviewed the updates, and CI is now passing, so I think this is good to merge.

[rocPRIM] Config modernization MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Our previous configuration system had become limiting in several ways. Most importantly, it was not able to differentiate between individual GPUs when selecting config parameters. This made proper tuning difficult and prevented future work involving SPIR-V–specific tuning. In addition, the old approach relied heavily on complex template metaprogramming, which had become difficult to maintain. With the move to C++17, we now have cleaner and more expressive language features available, making this a good opportunity to redesign the system. ## Technical Details All changes are internal. **There are no API changes for users.** The majority of the diff in this PR consists of the new configuration definitions themselves, so while the PR appears large, the actual code changes are relatively small. ### New Configuration Structure Each algorithm now defines a *_config_picker templated on the target and value type. Below is a simplified example: ```cpp template<class Target, class value_type> constexpr <algo_name>_config_picker() -> std::enable_if_t< std::is_same_v<Target, comp_target<gen::gcn5, target_arch::gfx906, gpu::mi50, rep::amdgcn>>, <algo_name>_config_params> { // Tuned configuration #1 if constexpr (/* condition for this combination */) { return <algo_name>_config_params{ ... }; } // Tuned configuration #2 if constexpr (/* condition for this combination */) { return <algo_name>_config_params{ ... }; } // Default for this target return <algo_name>_config_params_base<value_type>(); } ``` Each tuned target provides a similar overload. For untuned or unknown targets, we provide a general fallback: ```cpp template<class Target, class value_type> constexpr auto <algo_name>_config_picker() -> std::enable_if_t< std::is_same_v<Target, comp_target<gen::unknown, target_arch::unknown, gpu::generic, rep::amdgcn>>, <algo_name>_config_params> { // Fallback: use a commonly tuned target (often MI100) return <algo_name>_config_picker< comp_target<gen::cdna1, target_arch::gfx908, gpu::mi100, rep::amdgcn>, key_type, value_type>(); } ``` All available tuned targets are listed in: ```cpp using <algo_name>_targets = comp_targets< comp_target<gen::gcn5, target_arch::gfx906, gpu::mi50, rep::amdgcn>, ..., comp_target<gen::unknown, target_arch::unknown, gpu::generic, rep::amdgcn>>; ``` ### How Config Selection Works Now In the new system, kernels are compiled for all tuned targets. At runtime, if the current GPU does not have dedicated tuning, the library uses the most_common_config policy to choose the best matching compiled kernel. The selection policy (tested in test_config_dispatch.cpp) attempts to match, in decreasing priority: 1. Exact GPU model 2. Architecture 3. Generation If no match is found, it falls back to the unknown target. If multiple candidates match, the last one listed in the comp_targets type list is chosen, which gives us a controlled and predictable fallback order. We also pass the selected target into kernel compilation, enabling compile-time specialization based on GPU, architecture, and generation. ### Target struct The target struct currently stores only: - GPU generation - Architecture - GPU Name - Representation (rep), which distinguishes SPIR-V from native AMDGCN The rep field is not yet functional (requires compiler support), and the dispatch policy does not consider it at the moment. Also this target structs makes it relatively easy to store more data. ### Scripts The python script changes in this PR are there for scripts that used the configs as input/output. ### Summary of Improvements: - Better differentiation and selection across GPUs - Cleaner C++17-based implementation - Easier extension for future SPIR-V tuning - Improved maintainability of config definitions - Added more flexibility for future features. ## Test Plan Some tests were added in test_config_dispatch.cpp, these and all the other tests should pass. Also everything needs to be benchmarked to see if the correct configs are chosen. ## Test Result All tests pass, benchmarks are still WIP. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* First checkpoint * Second checkpoint - hot loop scheduler * Third checkpoint - init main operator * Fourth checkpoint - main loop ready * Fifth checkpoint - main loop fix * Sixth checkpoint - ReadWritecompFunc * Seventh checkpoint - Tail finished * [CK_TILE] Blockwise gemm pipeline v5 complete * Working * Working fixes 2 * Rename v5 to v77 temporarily * Data type adjustment * Data type adjustment 2 * [CK_TILE] Blockwise Gemm pipeline v5 add tests * [CK_TILE] Fix calculation error * TEMP: check pipeline * Fix name to V6 * naming and documentation changes * WIP dump * Try fixing v1 * Failing tests v5 * Debugging * Changes v2 * F16 tests working great * Working BlockwiseGemmPipelineV5 as V6 * Cleanup and format * Merging changes part1 * [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6 * Remove commented code * Fix gfx950 build issues * Fix file formatting * Review changes, more concat info, add bf16 bf8 tests * Fix formatting * Add bf16 and bf8 tests --------- Co-authored-by: Adam Osewski <Adam.Osewski@amd.com> [ROCm/composable_kernel commit: 634634f]

NB4444 self-assigned this Nov 27, 2025

NB4444 added the organization: streamhpc contributors from streamhpc label Nov 27, 2025

github-actions Bot added project: hipcub project: rocprim labels Nov 27, 2025

NB4444 force-pushed the users/NB4444/config-tuning-modernization branch 2 times, most recently from 15bebc3 to 0cf0edf Compare November 27, 2025 14:26

github-actions Bot added the documentation label Nov 27, 2025

NB4444 force-pushed the users/NB4444/config-tuning-modernization branch from 0cf0edf to 9f96c6f Compare November 28, 2025 09:07

eble-amd mentioned this pull request Dec 4, 2025

gfx1152 & gfx1153 bring up ROCm/TheRock#2310

Open

72 tasks

eble-amd reviewed Dec 4, 2025

View reviewed changes

Comment thread projects/rocprim/rocprim/include/rocprim/device/config_types.hpp

eble-amd reviewed Dec 4, 2025

View reviewed changes

Comment thread projects/rocprim/rocprim/include/rocprim/device/config_types.hpp

NB4444 force-pushed the users/NB4444/config-tuning-modernization branch from a97c9ec to 5d7ad96 Compare December 8, 2025 14:21

NB4444 marked this pull request as ready for review December 15, 2025 10:39

NB4444 requested review from a team as code owners December 15, 2025 10:39

NB4444 force-pushed the users/NB4444/config-tuning-modernization branch from 5d7ad96 to 3d5ee81 Compare December 15, 2025 10:39

NB4444 requested a review from eble-amd December 15, 2025 10:42

umfranzw approved these changes Dec 15, 2025

View reviewed changes

NB4444 force-pushed the users/NB4444/config-tuning-modernization branch from 3d5ee81 to a76c444 Compare December 16, 2025 10:13

NB4444 force-pushed the users/NB4444/config-tuning-modernization branch from 8ea8812 to d4628a4 Compare December 16, 2025 10:32

stanleytsang-amd mentioned this pull request Dec 16, 2025

[rocPRIM] Add gfx1151 target arch details #3418

Closed

1 task

amd-mtrifuno mentioned this pull request Dec 17, 2025

[rocPRIM] Add gfx1150 support #3388

Merged

NB4444 force-pushed the users/NB4444/config-tuning-modernization branch from f628921 to c187a16 Compare December 19, 2025 13:15

NB4444 and others added 21 commits January 5, 2026 06:50

Resolve "Update configs for new config system part 3"

74d8689

Resolve "Update configs for new config system part 1"

20d93ca

Resolve "Update configs for new config system part 2"

78de57f

Update the config for radix_onesweep based on upstream changes

9cf6496

Resolve "New Config system tests"

43a9915

Resolve "Consistency in config tags"

73b13ac

Resolve "Remove all unused config functions old system"

629896f

Resolve "Update autotune create_optimization script for new config sy…

b3639b6

…stem"

Resolve "Update apply_config_improvement script for new configs"

079fad8

Added to CHANGELOG

fa3b066

Cleanup target_config

0b5a48e

Fix base block methods adjacent_difference_config

293777c

Clear previous caches before current one is created

6448f13

Give device_histogram the same fallback as previous configs system and

a6a1cf6

fix predicate_flag config choosing error.

Resolve "Fix generic compile target new config system"

8afe9ae

Manually fixing the worst regression after fixing predicate_flag

79679b6

Add more arch for configs

c9fcc81

Add more supported architectures

f23bea4

Scope the define to rocprim

6dd20f0

Add temp fix for failing test

f4be231

TEMP FIX: instead of disabling optimization for array size one larger

405ded6

NB4444 force-pushed the users/NB4444/config-tuning-modernization branch from c187a16 to 405ded6 Compare January 5, 2026 06:52

Replace workaround with less undefined fix

984c824

umfranzw merged commit 87175b8 into develop Jan 5, 2026
26 checks passed

umfranzw deleted the users/NB4444/config-tuning-modernization branch January 5, 2026 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rocPRIM] Config modernization#2955

[rocPRIM] Config modernization#2955
umfranzw merged 26 commits into
developfrom
users/NB4444/config-tuning-modernization

NB4444 commented Nov 27, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Nov 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

NB4444 commented Dec 15, 2025

Uh oh!

umfranzw left a comment

Uh oh!

NB4444 commented Dec 16, 2025

Uh oh!

stanleytsang-amd commented Dec 17, 2025 •

edited

Loading

Uh oh!

NB4444 commented Dec 18, 2025

Uh oh!

NB4444 commented Dec 19, 2025

Uh oh!

NB4444 commented Jan 5, 2026

Uh oh!

umfranzw commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

NB4444 commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

New Configuration Structure

How Config Selection Works Now

Target struct

Scripts

Summary of Improvements:

Test Plan

Test Result

Submission Checklist

Uh oh!

codecov-commenter commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

NB4444 commented Dec 15, 2025

Uh oh!

umfranzw left a comment

Choose a reason for hiding this comment

Uh oh!

NB4444 commented Dec 16, 2025

Uh oh!

stanleytsang-amd commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NB4444 commented Dec 18, 2025

Uh oh!

NB4444 commented Dec 19, 2025

Uh oh!

NB4444 commented Jan 5, 2026

Uh oh!

umfranzw commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

NB4444 commented Nov 27, 2025 •

edited

Loading

codecov-commenter commented Nov 27, 2025 •

edited

Loading

stanleytsang-amd commented Dec 17, 2025 •

edited

Loading