[rocPRIM] Config modernization#2955
Conversation
15bebc3 to
0cf0edf
Compare
* First checkpoint * Second checkpoint - hot loop scheduler * Third checkpoint - init main operator * Fourth checkpoint - main loop ready * Fifth checkpoint - main loop fix * Sixth checkpoint - ReadWritecompFunc * Seventh checkpoint - Tail finished * [CK_TILE] Blockwise gemm pipeline v5 complete * Working * Working fixes 2 * Rename v5 to v77 temporarily * Data type adjustment * Data type adjustment 2 * [CK_TILE] Blockwise Gemm pipeline v5 add tests * [CK_TILE] Fix calculation error * TEMP: check pipeline * Fix name to V6 * naming and documentation changes * WIP dump * Try fixing v1 * Failing tests v5 * Debugging * Changes v2 * F16 tests working great * Working BlockwiseGemmPipelineV5 as V6 * Cleanup and format * Merging changes part1 * [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6 * Remove commented code * Fix gfx950 build issues * Fix file formatting * Review changes, more concat info, add bf16 bf8 tests * Fix formatting * Add bf16 and bf8 tests --------- Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
0cf0edf to
9f96c6f
Compare
a97c9ec to
5d7ad96
Compare
5d7ad96 to
3d5ee81
Compare
|
I have also added a fix for generic build types, and added support for the gfx1101, gfx1152 and gfx1153. |
3d5ee81 to
a76c444
Compare
|
I have added some more missing architectures. |
8ea8812 to
d4628a4
Compare
|
@NB4444 Since the last update on Monday, device_histogram unit test is failing on gfx942: [----------] 1 test from RocprimDeviceHistogramMultiEven/10, where TypeParam = params3<int,4u,3u,2000u,0,2000,int,int,rocprim::ROCPRIM_400200_NS::default_config,true> [ RUN ] RocprimDeviceHistogramMultiEven/10.MultiEven ../../../../test/rocprim/test_utils_assertions.hpp:86: Failure Expected equality of these values: val expected where index = 1610 Google Test trace: ../../../../test/rocprim/test_device_histogram.cpp:769: with channel = 0 ../../../../test/rocprim/test_device_histogram.cpp:654: with size = 4 ../../../../test/rocprim/test_device_histogram.cpp:653: with seed = 133108200 ../../../../test/rocprim/test_device_histogram.cpp:641: with dim = {1, 1, 0} ../../../../test/rocprim/test_device_histogram.cpp:600: with device_id = 0 ../../../../test/rocprim/test_utils_assertions.hpp:139: Failure Expected: protected_assert_eq(result[i], expected[i], i) doesn't generate new fatal failures in the current thread. Actual: it does. Google Test trace: ../../../../test/rocprim/test_device_histogram.cpp:769: with channel = 0 ../../../../test/rocprim/test_device_histogram.cpp:654: with size = 4 ../../../../test/rocprim/test_device_histogram.cpp:653: with seed = 133108200 ../../../../test/rocprim/test_device_histogram.cpp:641: with dim = {1, 1, 0} ../../../../test/rocprim/test_device_histogram.cpp:600: with device_id = 0 ../../../../test/rocprim/test_device_histogram.cpp:772: Failure Expected: test_utils::assert_eq(histogram[channel], histogram_expected[channel], bins[channel]) doesn't generate new fatal failures in the current thread. Actual: it does. Google Test trace: ../../../../test/rocprim/test_device_histogram.cpp:769: with channel = 0 ../../../../test/rocprim/test_device_histogram.cpp:654: with size = 4 ../../../../test/rocprim/test_device_histogram.cpp:653: with seed = 133108200 ../../../../test/rocprim/test_device_histogram.cpp:641: with dim = {1, 1, 0} ../../../../test/rocprim/test_device_histogram.cpp:600: with device_id = 0 |
|
I’ve added a temporary workaround for the failure. The change that exposed the issue was adding additional architectures to the string array in commit 85f49bf. The same change on develop also triggers the test failure. The root cause appears to be in hipgraph, specifically in the private global histogram optimization for gfx942. As a temporary measure, I’ve disabled this optimization when used with hipgraphs. I’ll investigate further tomorrow, but the underlying issue is unrelated to the config system changes themselves. It’s still unclear why the seemingly unrelated change of adding architectures ended up triggering this problem. |
|
I chose for a different temporary solution that changes the actual change in the PR that caused the issue. There seems some kind of overflow. When the items in the std::array (or other C style array) exceeds 16 items we start seeing this unrelated failing test. This can be fixed by setting the array size one larger then the amount of items. I will investigate this, because this is not really a satisfactory solution, but it is unrelated to the PR changes, the issue was already there it did just not exceed the size of 16. |
f628921 to
c187a16
Compare
fix predicate_flag config choosing error.
c187a16 to
405ded6
Compare
|
I replaced the workaround, with something a bit more permanent. Which does not rely on undefined behavior. |
|
I've reviewed the updates, and CI is now passing, so I think this is good to merge. |
[rocPRIM] Config modernization
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Our previous configuration system had become limiting in several ways.
Most importantly, it was not able to differentiate between individual
GPUs when selecting config parameters. This made proper tuning difficult
and prevented future work involving SPIR-V–specific tuning. In addition,
the old approach relied heavily on complex template metaprogramming,
which had become difficult to maintain. With the move to C++17, we now
have cleaner and more expressive language features available, making
this a good opportunity to redesign the system.
## Technical Details
All changes are internal. **There are no API changes for users.**
The majority of the diff in this PR consists of the new configuration
definitions themselves, so while the PR appears large, the actual code
changes are relatively small.
### New Configuration Structure
Each algorithm now defines a *_config_picker templated on the target and
value type. Below is a simplified example:
```cpp
template<class Target, class value_type>
constexpr <algo_name>_config_picker()
-> std::enable_if_t<
std::is_same_v<Target,
comp_target<gen::gcn5, target_arch::gfx906, gpu::mi50, rep::amdgcn>>,
<algo_name>_config_params>
{
// Tuned configuration #1
if constexpr (/* condition for this combination */)
{
return <algo_name>_config_params{ ... };
}
// Tuned configuration #2
if constexpr (/* condition for this combination */)
{
return <algo_name>_config_params{ ... };
}
// Default for this target
return <algo_name>_config_params_base<value_type>();
}
```
Each tuned target provides a similar overload. For untuned or unknown
targets, we provide a general fallback:
```cpp
template<class Target, class value_type>
constexpr auto <algo_name>_config_picker()
-> std::enable_if_t<
std::is_same_v<Target,
comp_target<gen::unknown, target_arch::unknown, gpu::generic, rep::amdgcn>>,
<algo_name>_config_params>
{
// Fallback: use a commonly tuned target (often MI100)
return <algo_name>_config_picker<
comp_target<gen::cdna1, target_arch::gfx908, gpu::mi100, rep::amdgcn>,
key_type, value_type>();
}
```
All available tuned targets are listed in:
```cpp
using <algo_name>_targets = comp_targets<
comp_target<gen::gcn5, target_arch::gfx906, gpu::mi50, rep::amdgcn>,
...,
comp_target<gen::unknown, target_arch::unknown, gpu::generic, rep::amdgcn>>;
```
### How Config Selection Works Now
In the new system, kernels are compiled for all tuned targets. At
runtime, if the current GPU does not have dedicated tuning, the library
uses the most_common_config policy to choose the best matching compiled
kernel.
The selection policy (tested in test_config_dispatch.cpp) attempts to
match, in decreasing priority:
1. Exact GPU model
2. Architecture
3. Generation
If no match is found, it falls back to the unknown target. If multiple
candidates match, the last one listed in the comp_targets type list is
chosen, which gives us a controlled and predictable fallback order.
We also pass the selected target into kernel compilation, enabling
compile-time specialization based on GPU, architecture, and generation.
### Target struct
The target struct currently stores only:
- GPU generation
- Architecture
- GPU Name
- Representation (rep), which distinguishes SPIR-V from native AMDGCN
The rep field is not yet functional (requires compiler support), and the
dispatch policy does not consider it at the moment. Also this target
structs makes it relatively easy to store more data.
### Scripts
The python script changes in this PR are there for scripts that used the
configs as input/output.
### Summary of Improvements:
- Better differentiation and selection across GPUs
- Cleaner C++17-based implementation
- Easier extension for future SPIR-V tuning
- Improved maintainability of config definitions
- Added more flexibility for future features.
## Test Plan
Some tests were added in test_config_dispatch.cpp, these and all the
other tests should pass. Also everything needs to be benchmarked to see
if the correct configs are chosen.
## Test Result
All tests pass, benchmarks are still WIP.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[rocPRIM] Config modernization
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Our previous configuration system had become limiting in several ways.
Most importantly, it was not able to differentiate between individual
GPUs when selecting config parameters. This made proper tuning difficult
and prevented future work involving SPIR-V–specific tuning. In addition,
the old approach relied heavily on complex template metaprogramming,
which had become difficult to maintain. With the move to C++17, we now
have cleaner and more expressive language features available, making
this a good opportunity to redesign the system.
## Technical Details
All changes are internal. **There are no API changes for users.**
The majority of the diff in this PR consists of the new configuration
definitions themselves, so while the PR appears large, the actual code
changes are relatively small.
### New Configuration Structure
Each algorithm now defines a *_config_picker templated on the target and
value type. Below is a simplified example:
```cpp
template<class Target, class value_type>
constexpr <algo_name>_config_picker()
-> std::enable_if_t<
std::is_same_v<Target,
comp_target<gen::gcn5, target_arch::gfx906, gpu::mi50, rep::amdgcn>>,
<algo_name>_config_params>
{
// Tuned configuration #1
if constexpr (/* condition for this combination */)
{
return <algo_name>_config_params{ ... };
}
// Tuned configuration #2
if constexpr (/* condition for this combination */)
{
return <algo_name>_config_params{ ... };
}
// Default for this target
return <algo_name>_config_params_base<value_type>();
}
```
Each tuned target provides a similar overload. For untuned or unknown
targets, we provide a general fallback:
```cpp
template<class Target, class value_type>
constexpr auto <algo_name>_config_picker()
-> std::enable_if_t<
std::is_same_v<Target,
comp_target<gen::unknown, target_arch::unknown, gpu::generic, rep::amdgcn>>,
<algo_name>_config_params>
{
// Fallback: use a commonly tuned target (often MI100)
return <algo_name>_config_picker<
comp_target<gen::cdna1, target_arch::gfx908, gpu::mi100, rep::amdgcn>,
key_type, value_type>();
}
```
All available tuned targets are listed in:
```cpp
using <algo_name>_targets = comp_targets<
comp_target<gen::gcn5, target_arch::gfx906, gpu::mi50, rep::amdgcn>,
...,
comp_target<gen::unknown, target_arch::unknown, gpu::generic, rep::amdgcn>>;
```
### How Config Selection Works Now
In the new system, kernels are compiled for all tuned targets. At
runtime, if the current GPU does not have dedicated tuning, the library
uses the most_common_config policy to choose the best matching compiled
kernel.
The selection policy (tested in test_config_dispatch.cpp) attempts to
match, in decreasing priority:
1. Exact GPU model
2. Architecture
3. Generation
If no match is found, it falls back to the unknown target. If multiple
candidates match, the last one listed in the comp_targets type list is
chosen, which gives us a controlled and predictable fallback order.
We also pass the selected target into kernel compilation, enabling
compile-time specialization based on GPU, architecture, and generation.
### Target struct
The target struct currently stores only:
- GPU generation
- Architecture
- GPU Name
- Representation (rep), which distinguishes SPIR-V from native AMDGCN
The rep field is not yet functional (requires compiler support), and the
dispatch policy does not consider it at the moment. Also this target
structs makes it relatively easy to store more data.
### Scripts
The python script changes in this PR are there for scripts that used the
configs as input/output.
### Summary of Improvements:
- Better differentiation and selection across GPUs
- Cleaner C++17-based implementation
- Easier extension for future SPIR-V tuning
- Improved maintainability of config definitions
- Added more flexibility for future features.
## Test Plan
Some tests were added in test_config_dispatch.cpp, these and all the
other tests should pass. Also everything needs to be benchmarked to see
if the correct configs are chosen.
## Test Result
All tests pass, benchmarks are still WIP.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
* First checkpoint * Second checkpoint - hot loop scheduler * Third checkpoint - init main operator * Fourth checkpoint - main loop ready * Fifth checkpoint - main loop fix * Sixth checkpoint - ReadWritecompFunc * Seventh checkpoint - Tail finished * [CK_TILE] Blockwise gemm pipeline v5 complete * Working * Working fixes 2 * Rename v5 to v77 temporarily * Data type adjustment * Data type adjustment 2 * [CK_TILE] Blockwise Gemm pipeline v5 add tests * [CK_TILE] Fix calculation error * TEMP: check pipeline * Fix name to V6 * naming and documentation changes * WIP dump * Try fixing v1 * Failing tests v5 * Debugging * Changes v2 * F16 tests working great * Working BlockwiseGemmPipelineV5 as V6 * Cleanup and format * Merging changes part1 * [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6 * Remove commented code * Fix gfx950 build issues * Fix file formatting * Review changes, more concat info, add bf16 bf8 tests * Fix formatting * Add bf16 and bf8 tests --------- Co-authored-by: Adam Osewski <Adam.Osewski@amd.com> [ROCm/composable_kernel commit: 634634f]
Motivation
Our previous configuration system had become limiting in several ways. Most importantly, it was not able to differentiate between individual GPUs when selecting config parameters. This made proper tuning difficult and prevented future work involving SPIR-V–specific tuning. In addition, the old approach relied heavily on complex template metaprogramming, which had become difficult to maintain. With the move to C++17, we now have cleaner and more expressive language features available, making this a good opportunity to redesign the system.
Technical Details
All changes are internal. There are no API changes for users.
The majority of the diff in this PR consists of the new configuration definitions themselves, so while the PR appears large, the actual code changes are relatively small.
New Configuration Structure
Each algorithm now defines a *_config_picker templated on the target and value type. Below is a simplified example:
Each tuned target provides a similar overload. For untuned or unknown targets, we provide a general fallback:
All available tuned targets are listed in:
using <algo_name>_targets = comp_targets< comp_target<gen::gcn5, target_arch::gfx906, gpu::mi50, rep::amdgcn>, ..., comp_target<gen::unknown, target_arch::unknown, gpu::generic, rep::amdgcn>>;How Config Selection Works Now
In the new system, kernels are compiled for all tuned targets. At runtime, if the current GPU does not have dedicated tuning, the library uses the most_common_config policy to choose the best matching compiled kernel.
The selection policy (tested in test_config_dispatch.cpp) attempts to match, in decreasing priority:
If no match is found, it falls back to the unknown target. If multiple candidates match, the last one listed in the comp_targets type list is chosen, which gives us a controlled and predictable fallback order.
We also pass the selected target into kernel compilation, enabling compile-time specialization based on GPU, architecture, and generation.
Target struct
The target struct currently stores only:
The rep field is not yet functional (requires compiler support), and the dispatch policy does not consider it at the moment. Also this target structs makes it relatively easy to store more data.
Scripts
The python script changes in this PR are there for scripts that used the configs as input/output.
Summary of Improvements:
Test Plan
Some tests were added in test_config_dispatch.cpp, these and all the other tests should pass. Also everything needs to be benchmarked to see if the correct configs are chosen.
Test Result
All tests pass, benchmarks are still WIP.
Submission Checklist