Skip to content

[rocPRIM] Config modernization#2955

Merged
umfranzw merged 26 commits into
developfrom
users/NB4444/config-tuning-modernization
Jan 5, 2026
Merged

[rocPRIM] Config modernization#2955
umfranzw merged 26 commits into
developfrom
users/NB4444/config-tuning-modernization

Conversation

@NB4444
Copy link
Copy Markdown
Contributor

@NB4444 NB4444 commented Nov 27, 2025

Motivation

Our previous configuration system had become limiting in several ways. Most importantly, it was not able to differentiate between individual GPUs when selecting config parameters. This made proper tuning difficult and prevented future work involving SPIR-V–specific tuning. In addition, the old approach relied heavily on complex template metaprogramming, which had become difficult to maintain. With the move to C++17, we now have cleaner and more expressive language features available, making this a good opportunity to redesign the system.

Technical Details

All changes are internal. There are no API changes for users.

The majority of the diff in this PR consists of the new configuration definitions themselves, so while the PR appears large, the actual code changes are relatively small.

New Configuration Structure

Each algorithm now defines a *_config_picker templated on the target and value type. Below is a simplified example:

template<class Target, class value_type>
constexpr <algo_name>_config_picker()
    -> std::enable_if_t<
        std::is_same_v<Target,
                       comp_target<gen::gcn5, target_arch::gfx906, gpu::mi50, rep::amdgcn>>,
        <algo_name>_config_params>
{
    // Tuned configuration #1
    if constexpr (/* condition for this combination */)
    {
        return <algo_name>_config_params{ ... };
    }
    // Tuned configuration #2
    if constexpr (/* condition for this combination */)
    {
        return <algo_name>_config_params{ ... };
    }
    // Default for this target
    return <algo_name>_config_params_base<value_type>();
}

Each tuned target provides a similar overload. For untuned or unknown targets, we provide a general fallback:

template<class Target, class value_type>
constexpr auto <algo_name>_config_picker()
    -> std::enable_if_t<
        std::is_same_v<Target,
                       comp_target<gen::unknown, target_arch::unknown, gpu::generic, rep::amdgcn>>,
        <algo_name>_config_params>
{
    // Fallback: use a commonly tuned target (often MI100)
    return <algo_name>_config_picker<
        comp_target<gen::cdna1, target_arch::gfx908, gpu::mi100, rep::amdgcn>,
        key_type, value_type>();
}

All available tuned targets are listed in:

using <algo_name>_targets = comp_targets<
    comp_target<gen::gcn5, target_arch::gfx906, gpu::mi50, rep::amdgcn>,
    ...,
    comp_target<gen::unknown, target_arch::unknown, gpu::generic, rep::amdgcn>>;

How Config Selection Works Now

In the new system, kernels are compiled for all tuned targets. At runtime, if the current GPU does not have dedicated tuning, the library uses the most_common_config policy to choose the best matching compiled kernel.

The selection policy (tested in test_config_dispatch.cpp) attempts to match, in decreasing priority:

  1. Exact GPU model
  2. Architecture
  3. Generation

If no match is found, it falls back to the unknown target. If multiple candidates match, the last one listed in the comp_targets type list is chosen, which gives us a controlled and predictable fallback order.

We also pass the selected target into kernel compilation, enabling compile-time specialization based on GPU, architecture, and generation.

Target struct

The target struct currently stores only:

  • GPU generation
  • Architecture
  • GPU Name
  • Representation (rep), which distinguishes SPIR-V from native AMDGCN

The rep field is not yet functional (requires compiler support), and the dispatch policy does not consider it at the moment. Also this target structs makes it relatively easy to store more data.

Scripts

The python script changes in this PR are there for scripts that used the configs as input/output.

Summary of Improvements:

  • Better differentiation and selection across GPUs
  • Cleaner C++17-based implementation
  • Easier extension for future SPIR-V tuning
  • Improved maintainability of config definitions
  • Added more flexibility for future features.

Test Plan

Some tests were added in test_config_dispatch.cpp, these and all the other tests should pass. Also everything needs to be benchmarked to see if the correct configs are chosen.

Test Result

All tests pass, benchmarks are still WIP.

Submission Checklist

@NB4444 NB4444 self-assigned this Nov 27, 2025
@NB4444 NB4444 added the organization: streamhpc contributors from streamhpc label Nov 27, 2025
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 27, 2025

Codecov Report

❌ Patch coverage is 17.23313% with 3458 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...clude/rocprim/device/detail/config/device_scan.hpp 13.16% 462 Missing ⚠️
.../device/detail/config/device_run_length_encode.hpp 13.53% 358 Missing ⚠️
...ude/rocprim/device/detail/config/device_reduce.hpp 13.51% 352 Missing ⚠️
...il/config/device_run_length_encode_non_trivial.hpp 13.42% 316 Missing ⚠️
...prim/device/detail/config/device_adjacent_find.hpp 14.41% 196 Missing ⚠️
...evice/detail/config/device_partition_predicate.hpp 16.02% 194 Missing ⚠️
...evice/detail/config/device_partition_three_way.hpp 16.02% 194 Missing ⚠️
...tail/config/device_partition_two_way_predicate.hpp 16.02% 194 Missing ⚠️
...tail/config/device_adjacent_difference_inplace.hpp 15.92% 169 Missing ⚠️
...evice/detail/config/device_adjacent_difference.hpp 16.42% 168 Missing ⚠️
... and 7 more

❗ There is a different number of reports uploaded between BASE (82a516d) and HEAD (984c824). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (82a516d) HEAD (984c824)
rocFFT 1 0
hipSPARSE 1 0
Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #2955       +/-   ##
============================================
- Coverage    68.07%   42.94%   -25.13%     
============================================
  Files          425      193      -232     
  Lines        51229    28160    -23069     
  Branches      3802      699     -3103     
============================================
- Hits         34872    12093    -22779     
- Misses       15343    15475      +132     
+ Partials      1014      592      -422     
Flag Coverage Δ
hipCUB 81.76% <85.71%> (?)
hipSPARSE ?
rocFFT ?
rocPRIM 38.96% <16.77%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...rojects/rocprim/rocprim/include/rocprim/config.hpp 100.00% <ø> (ø)
...prim/device/detail/config/device_binary_search.hpp 14.43% <ø> (ø)
.../rocprim/device/detail/config/device_histogram.hpp 17.37% <ø> (ø)
...ocprim/device/detail/config/device_lower_bound.hpp 14.72% <ø> (ø)
...lude/rocprim/device/detail/config/device_merge.hpp 13.33% <ø> (ø)
...ce/detail/config/device_merge_sort_block_merge.hpp 14.53% <ø> (ø)
...ice/detail/config/device_merge_sort_block_sort.hpp 13.54% <ø> (ø)
...evice/detail/config/device_radix_sort_onesweep.hpp 7.49% <ø> (ø)
...prim/device/detail/config/device_reduce_by_key.hpp 13.13% <ø> (ø)
...ocprim/device/detail/config/device_scan_by_key.hpp 14.04% <ø> (ø)
... and 78 more

... and 530 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@NB4444 NB4444 force-pushed the users/NB4444/config-tuning-modernization branch 2 times, most recently from 15bebc3 to 0cf0edf Compare November 27, 2025 14:26
john00003 pushed a commit that referenced this pull request Nov 27, 2025
* First checkpoint

* Second checkpoint - hot loop scheduler

* Third checkpoint - init main operator

* Fourth checkpoint - main loop ready

* Fifth checkpoint - main loop fix

* Sixth checkpoint - ReadWritecompFunc

* Seventh checkpoint - Tail finished

* [CK_TILE] Blockwise gemm pipeline v5 complete

* Working

* Working fixes 2

* Rename v5 to v77 temporarily

* Data type adjustment

* Data type adjustment 2

* [CK_TILE] Blockwise Gemm pipeline v5 add tests

* [CK_TILE] Fix calculation error

* TEMP: check pipeline

* Fix name to V6

* naming and documentation changes

* WIP dump

* Try fixing v1

* Failing tests v5

* Debugging

* Changes v2

* F16 tests working great

* Working BlockwiseGemmPipelineV5 as V6

* Cleanup and format

* Merging changes part1

* [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6

* Remove commented code

* Fix gfx950 build issues

* Fix file formatting

* Review changes, more concat info, add bf16 bf8 tests

* Fix formatting

* Add bf16 and bf8 tests

---------

Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
@NB4444 NB4444 force-pushed the users/NB4444/config-tuning-modernization branch from 0cf0edf to 9f96c6f Compare November 28, 2025 09:07
Comment thread projects/rocprim/rocprim/include/rocprim/device/config_types.hpp
Comment thread projects/rocprim/rocprim/include/rocprim/device/config_types.hpp
@NB4444 NB4444 force-pushed the users/NB4444/config-tuning-modernization branch from a97c9ec to 5d7ad96 Compare December 8, 2025 14:21
@NB4444 NB4444 marked this pull request as ready for review December 15, 2025 10:39
@NB4444 NB4444 requested review from a team as code owners December 15, 2025 10:39
@NB4444 NB4444 force-pushed the users/NB4444/config-tuning-modernization branch from 5d7ad96 to 3d5ee81 Compare December 15, 2025 10:39
@NB4444
Copy link
Copy Markdown
Contributor Author

NB4444 commented Dec 15, 2025

I have also added a fix for generic build types, and added support for the gfx1101, gfx1152 and gfx1153.

@NB4444 NB4444 requested a review from eble-amd December 15, 2025 10:42
Copy link
Copy Markdown
Contributor

@umfranzw umfranzw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great - thanks @NB4444, I think it's much improved over the old system.

@NB4444 NB4444 force-pushed the users/NB4444/config-tuning-modernization branch from 3d5ee81 to a76c444 Compare December 16, 2025 10:13
@NB4444
Copy link
Copy Markdown
Contributor Author

NB4444 commented Dec 16, 2025

I have added some more missing architectures.

@NB4444 NB4444 force-pushed the users/NB4444/config-tuning-modernization branch from 8ea8812 to d4628a4 Compare December 16, 2025 10:32
@stanleytsang-amd
Copy link
Copy Markdown
Contributor

stanleytsang-amd commented Dec 17, 2025

@NB4444 Since the last update on Monday, device_histogram unit test is failing on gfx942:

[----------] 1 test from RocprimDeviceHistogramMultiEven/10, where TypeParam = params3<int,4u,3u,2000u,0,2000,int,int,rocprim::ROCPRIM_400200_NS::default_config,true>

[ RUN ] RocprimDeviceHistogramMultiEven/10.MultiEven

../../../../test/rocprim/test_utils_assertions.hpp:86: Failure

Expected equality of these values:

val

Which is: 2

expected

Which is: 1

where index = 1610

Google Test trace:

../../../../test/rocprim/test_device_histogram.cpp:769: with channel = 0

../../../../test/rocprim/test_device_histogram.cpp:654: with size = 4

../../../../test/rocprim/test_device_histogram.cpp:653: with seed = 133108200

../../../../test/rocprim/test_device_histogram.cpp:641: with dim = {1, 1, 0}

../../../../test/rocprim/test_device_histogram.cpp:600: with device_id = 0

../../../../test/rocprim/test_utils_assertions.hpp:139: Failure

Expected: protected_assert_eq(result[i], expected[i], i) doesn't generate new fatal failures in the current thread.

Actual: it does.

Google Test trace:

../../../../test/rocprim/test_device_histogram.cpp:769: with channel = 0

../../../../test/rocprim/test_device_histogram.cpp:654: with size = 4

../../../../test/rocprim/test_device_histogram.cpp:653: with seed = 133108200

../../../../test/rocprim/test_device_histogram.cpp:641: with dim = {1, 1, 0}

../../../../test/rocprim/test_device_histogram.cpp:600: with device_id = 0

../../../../test/rocprim/test_device_histogram.cpp:772: Failure

Expected: test_utils::assert_eq(histogram[channel], histogram_expected[channel], bins[channel]) doesn't generate new fatal failures in the current thread.

Actual: it does.

Google Test trace:

../../../../test/rocprim/test_device_histogram.cpp:769: with channel = 0

../../../../test/rocprim/test_device_histogram.cpp:654: with size = 4

../../../../test/rocprim/test_device_histogram.cpp:653: with seed = 133108200

../../../../test/rocprim/test_device_histogram.cpp:641: with dim = {1, 1, 0}

../../../../test/rocprim/test_device_histogram.cpp:600: with device_id = 0

@NB4444
Copy link
Copy Markdown
Contributor Author

NB4444 commented Dec 18, 2025

I’ve added a temporary workaround for the failure. The change that exposed the issue was adding additional architectures to the string array in commit 85f49bf. The same change on develop also triggers the test failure.

The root cause appears to be in hipgraph, specifically in the private global histogram optimization for gfx942. As a temporary measure, I’ve disabled this optimization when used with hipgraphs.

I’ll investigate further tomorrow, but the underlying issue is unrelated to the config system changes themselves. It’s still unclear why the seemingly unrelated change of adding architectures ended up triggering this problem.

@NB4444
Copy link
Copy Markdown
Contributor Author

NB4444 commented Dec 19, 2025

I chose for a different temporary solution that changes the actual change in the PR that caused the issue. There seems some kind of overflow. When the items in the std::array (or other C style array) exceeds 16 items we start seeing this unrelated failing test. This can be fixed by setting the array size one larger then the amount of items. I will investigate this, because this is not really a satisfactory solution, but it is unrelated to the PR changes, the issue was already there it did just not exceed the size of 16.

@NB4444 NB4444 force-pushed the users/NB4444/config-tuning-modernization branch from f628921 to c187a16 Compare December 19, 2025 13:15
@NB4444 NB4444 force-pushed the users/NB4444/config-tuning-modernization branch from c187a16 to 405ded6 Compare January 5, 2026 06:52
@NB4444
Copy link
Copy Markdown
Contributor Author

NB4444 commented Jan 5, 2026

I replaced the workaround, with something a bit more permanent. Which does not rely on undefined behavior.

@umfranzw
Copy link
Copy Markdown
Contributor

umfranzw commented Jan 5, 2026

I've reviewed the updates, and CI is now passing, so I think this is good to merge.

@umfranzw umfranzw merged commit 87175b8 into develop Jan 5, 2026
26 checks passed
@umfranzw umfranzw deleted the users/NB4444/config-tuning-modernization branch January 5, 2026 16:54
assistant-librarian Bot pushed a commit to ROCm/hipCUB that referenced this pull request Jan 5, 2026
[rocPRIM] Config modernization
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Our previous configuration system had become limiting in several ways.
Most importantly, it was not able to differentiate between individual
GPUs when selecting config parameters. This made proper tuning difficult
and prevented future work involving SPIR-V–specific tuning. In addition,
the old approach relied heavily on complex template metaprogramming,
which had become difficult to maintain. With the move to C++17, we now
have cleaner and more expressive language features available, making
this a good opportunity to redesign the system.

## Technical Details

All changes are internal. **There are no API changes for users.**

The majority of the diff in this PR consists of the new configuration
definitions themselves, so while the PR appears large, the actual code
changes are relatively small.

### New Configuration Structure

Each algorithm now defines a *_config_picker templated on the target and
value type. Below is a simplified example:

```cpp
template<class Target, class value_type>
constexpr <algo_name>_config_picker()
    -> std::enable_if_t<
        std::is_same_v<Target,
                       comp_target<gen::gcn5, target_arch::gfx906, gpu::mi50, rep::amdgcn>>,
        <algo_name>_config_params>
{
    // Tuned configuration #1
    if constexpr (/* condition for this combination */)
    {
        return <algo_name>_config_params{ ... };
    }
    // Tuned configuration #2
    if constexpr (/* condition for this combination */)
    {
        return <algo_name>_config_params{ ... };
    }
    // Default for this target
    return <algo_name>_config_params_base<value_type>();
}
```

Each tuned target provides a similar overload. For untuned or unknown
targets, we provide a general fallback:

```cpp
template<class Target, class value_type>
constexpr auto <algo_name>_config_picker()
    -> std::enable_if_t<
        std::is_same_v<Target,
                       comp_target<gen::unknown, target_arch::unknown, gpu::generic, rep::amdgcn>>,
        <algo_name>_config_params>
{
    // Fallback: use a commonly tuned target (often MI100)
    return <algo_name>_config_picker<
        comp_target<gen::cdna1, target_arch::gfx908, gpu::mi100, rep::amdgcn>,
        key_type, value_type>();
}
```

All available tuned targets are listed in:
```cpp
using <algo_name>_targets = comp_targets<
    comp_target<gen::gcn5, target_arch::gfx906, gpu::mi50, rep::amdgcn>,
    ...,
    comp_target<gen::unknown, target_arch::unknown, gpu::generic, rep::amdgcn>>;
```
### How Config Selection Works Now

In the new system, kernels are compiled for all tuned targets. At
runtime, if the current GPU does not have dedicated tuning, the library
uses the most_common_config policy to choose the best matching compiled
kernel.

The selection policy (tested in test_config_dispatch.cpp) attempts to
match, in decreasing priority:
1. Exact GPU model
2. Architecture
3. Generation

If no match is found, it falls back to the unknown target. If multiple
candidates match, the last one listed in the comp_targets type list is
chosen, which gives us a controlled and predictable fallback order.

We also pass the selected target into kernel compilation, enabling
compile-time specialization based on GPU, architecture, and generation.

### Target struct
The target struct currently stores only:
- GPU generation
- Architecture
- GPU Name
- Representation (rep), which distinguishes SPIR-V from native AMDGCN

The rep field is not yet functional (requires compiler support), and the
dispatch policy does not consider it at the moment. Also this target
structs makes it relatively easy to store more data.

### Scripts
The python script changes in this PR are there for scripts that used the
configs as input/output.

### Summary of Improvements:
- Better differentiation and selection across GPUs
- Cleaner C++17-based implementation
- Easier extension for future SPIR-V tuning
- Improved maintainability of config definitions
- Added more flexibility for future features.

## Test Plan

Some tests were added in test_config_dispatch.cpp, these and all the
other tests should pass. Also everything needs to be benchmarked to see
if the correct configs are chosen.

## Test Result

All tests pass, benchmarks are still WIP.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
assistant-librarian Bot pushed a commit to ROCm/rocPRIM that referenced this pull request Jan 5, 2026
[rocPRIM] Config modernization
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Our previous configuration system had become limiting in several ways.
Most importantly, it was not able to differentiate between individual
GPUs when selecting config parameters. This made proper tuning difficult
and prevented future work involving SPIR-V–specific tuning. In addition,
the old approach relied heavily on complex template metaprogramming,
which had become difficult to maintain. With the move to C++17, we now
have cleaner and more expressive language features available, making
this a good opportunity to redesign the system.

## Technical Details

All changes are internal. **There are no API changes for users.**

The majority of the diff in this PR consists of the new configuration
definitions themselves, so while the PR appears large, the actual code
changes are relatively small.

### New Configuration Structure

Each algorithm now defines a *_config_picker templated on the target and
value type. Below is a simplified example:

```cpp
template<class Target, class value_type>
constexpr <algo_name>_config_picker()
    -> std::enable_if_t<
        std::is_same_v<Target,
                       comp_target<gen::gcn5, target_arch::gfx906, gpu::mi50, rep::amdgcn>>,
        <algo_name>_config_params>
{
    // Tuned configuration #1
    if constexpr (/* condition for this combination */)
    {
        return <algo_name>_config_params{ ... };
    }
    // Tuned configuration #2
    if constexpr (/* condition for this combination */)
    {
        return <algo_name>_config_params{ ... };
    }
    // Default for this target
    return <algo_name>_config_params_base<value_type>();
}
```

Each tuned target provides a similar overload. For untuned or unknown
targets, we provide a general fallback:

```cpp
template<class Target, class value_type>
constexpr auto <algo_name>_config_picker()
    -> std::enable_if_t<
        std::is_same_v<Target,
                       comp_target<gen::unknown, target_arch::unknown, gpu::generic, rep::amdgcn>>,
        <algo_name>_config_params>
{
    // Fallback: use a commonly tuned target (often MI100)
    return <algo_name>_config_picker<
        comp_target<gen::cdna1, target_arch::gfx908, gpu::mi100, rep::amdgcn>,
        key_type, value_type>();
}
```

All available tuned targets are listed in:
```cpp
using <algo_name>_targets = comp_targets<
    comp_target<gen::gcn5, target_arch::gfx906, gpu::mi50, rep::amdgcn>,
    ...,
    comp_target<gen::unknown, target_arch::unknown, gpu::generic, rep::amdgcn>>;
```
### How Config Selection Works Now

In the new system, kernels are compiled for all tuned targets. At
runtime, if the current GPU does not have dedicated tuning, the library
uses the most_common_config policy to choose the best matching compiled
kernel.

The selection policy (tested in test_config_dispatch.cpp) attempts to
match, in decreasing priority:
1. Exact GPU model
2. Architecture
3. Generation

If no match is found, it falls back to the unknown target. If multiple
candidates match, the last one listed in the comp_targets type list is
chosen, which gives us a controlled and predictable fallback order.

We also pass the selected target into kernel compilation, enabling
compile-time specialization based on GPU, architecture, and generation.

### Target struct
The target struct currently stores only:
- GPU generation
- Architecture
- GPU Name
- Representation (rep), which distinguishes SPIR-V from native AMDGCN

The rep field is not yet functional (requires compiler support), and the
dispatch policy does not consider it at the moment. Also this target
structs makes it relatively easy to store more data.

### Scripts
The python script changes in this PR are there for scripts that used the
configs as input/output.

### Summary of Improvements:
- Better differentiation and selection across GPUs
- Cleaner C++17-based implementation
- Easier extension for future SPIR-V tuning
- Improved maintainability of config definitions
- Added more flexibility for future features.

## Test Plan

Some tests were added in test_config_dispatch.cpp, these and all the
other tests should pass. Also everything needs to be benchmarked to see
if the correct configs are chosen.

## Test Result

All tests pass, benchmarks are still WIP.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
ammallya pushed a commit that referenced this pull request Feb 3, 2026
* First checkpoint

* Second checkpoint - hot loop scheduler

* Third checkpoint - init main operator

* Fourth checkpoint - main loop ready

* Fifth checkpoint - main loop fix

* Sixth checkpoint - ReadWritecompFunc

* Seventh checkpoint - Tail finished

* [CK_TILE] Blockwise gemm pipeline v5 complete

* Working

* Working fixes 2

* Rename v5 to v77 temporarily

* Data type adjustment

* Data type adjustment 2

* [CK_TILE] Blockwise Gemm pipeline v5 add tests

* [CK_TILE] Fix calculation error

* TEMP: check pipeline

* Fix name to V6

* naming and documentation changes

* WIP dump

* Try fixing v1

* Failing tests v5

* Debugging

* Changes v2

* F16 tests working great

* Working BlockwiseGemmPipelineV5 as V6

* Cleanup and format

* Merging changes part1

* [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6

* Remove commented code

* Fix gfx950 build issues

* Fix file formatting

* Review changes, more concat info, add bf16 bf8 tests

* Fix formatting

* Add bf16 and bf8 tests

---------

Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

[ROCm/composable_kernel commit: 634634f]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants