Develop stream 2025 05 06#731
Conversation
…the type trait interface"
spolifroni-amd
left a comment
There was a problem hiding this comment.
The changelog needs some changes and a few clarifications.
| * past default: `class AccType = detail::input_type_t<InitValueType>>` | ||
| * new default: `class AccType = rocprim::invoke_result_binary_op_t<rocprim::detail::input_type_t<InitValueType>, BinaryFunction>` | ||
|
|
||
| * Added function `is_build_in` into `rocprim::traits::get`. |
There was a problem hiding this comment.
Move to the "added" section.
There was a problem hiding this comment.
Moved it to the added section.
| * Added function `is_build_in` into `rocprim::traits::get`. | ||
| * Changed the parameters `long_radix_bits` and `LongRadixBits` from `segmented_radix_sort` to `radix_bits` and `RadixBits` respectively. | ||
| * Marked the initialisation constructor of `rocprim::reverse_iterator<Iter>` `explicit`, use `rocprim::make_reverse_iterator`. | ||
| * Dropped `c++14` support for rocprim. |
There was a problem hiding this comment.
In the hipCUB changelog, it clarifies that there's only C++17 support now. Do you want to specify that here too? If not, move this to "removed".
| * Dropped `c++14` support for rocprim. | |
| * Dropped `C++14` support for rocprim. |
There was a problem hiding this comment.
I both moved it to removed and clarified that we only support C++17 now.
| * Merged `radix_key_codec` into type_traits system. | ||
| * Renamed `type_traits_interface.hpp` to `type_traits.hpp`, rename the original `type_traits.hpp` to `type_traits_functions.hpp`. | ||
| * Removed `radix_sort.hpp` | ||
| * Added virtual shared memory as a fallback option in `rocprim::device_merge` when it exceeds shared memory capacity, similar to `rocprim::device_select`, `rocprim::device_partition`, and `rocprim::device_merge_sort`, which already include this feature. |
There was a problem hiding this comment.
move to the "added" section.
There was a problem hiding this comment.
Moved it to the added section.
| * Renamed `type_traits_interface.hpp` to `type_traits.hpp`, rename the original `type_traits.hpp` to `type_traits_functions.hpp`. | ||
| * Removed `radix_sort.hpp` | ||
| * Added virtual shared memory as a fallback option in `rocprim::device_merge` when it exceeds shared memory capacity, similar to `rocprim::device_select`, `rocprim::device_partition`, and `rocprim::device_merge_sort`, which already include this feature. | ||
| * Added initial value support to device level inclusive scans. |
There was a problem hiding this comment.
Move to the "added" section
There was a problem hiding this comment.
Moved it to the added section.
| * Removed `radix_sort.hpp` | ||
| * Added virtual shared memory as a fallback option in `rocprim::device_merge` when it exceeds shared memory capacity, similar to `rocprim::device_select`, `rocprim::device_partition`, and `rocprim::device_merge_sort`, which already include this feature. | ||
| * Added initial value support to device level inclusive scans. | ||
| * Made new optimization for `device_transform` when the input and output are pointers. |
There was a problem hiding this comment.
I think this should be in "added". Not sure though because the sentence isn't clear. Is this a backend change where optimization has been added, or is this something a user would need to make changes for?
There was a problem hiding this comment.
Tried to clarify the statement and moved it to added.
| * Added virtual shared memory as a fallback option in `rocprim::device_merge` when it exceeds shared memory capacity, similar to `rocprim::device_select`, `rocprim::device_partition`, and `rocprim::device_merge_sort`, which already include this feature. | ||
| * Added initial value support to device level inclusive scans. | ||
| * Made new optimization for `device_transform` when the input and output are pointers. | ||
| * Added `LoadType` to `transform_config`, which is used for the `device_transform` when the input and output are pointers. |
There was a problem hiding this comment.
Moved it to the added section.
|
|
||
| ### Resolved issues | ||
|
|
||
| * Fixed `device_batch_memcpy` its reported benchmarking throughput being 2x lower than it was in reality. |
There was a problem hiding this comment.
| * Fixed `device_batch_memcpy` its reported benchmarking throughput being 2x lower than it was in reality. | |
| * Fixed an issue where `device_batch_memcpy` reported benchmarking throughput being 2x lower than it was in reality. |
There was a problem hiding this comment.
Made the suggested changes.
| ### Resolved issues | ||
|
|
||
| * Fixed `device_batch_memcpy` its reported benchmarking throughput being 2x lower than it was in reality. | ||
| * Fixed `device_segmented_reduce` its reported autotuning throughput being 5x lower than it was in reality. |
There was a problem hiding this comment.
| * Fixed `device_segmented_reduce` its reported autotuning throughput being 5x lower than it was in reality. | |
| * Fixed an issue where `device_segmented_reduce` reported autotuning throughput being 5x lower than it was in reality. |
There was a problem hiding this comment.
Made the suggested changes.
|
I am currently looking into the device_scan failure on gfx90a. I found two issues so far, one of them I already have a fix. The |
|
I have added a fix and a temp fix for the device_scan failure. |
|
Closing the pull request in this repo. Please refer to the migrated pull request for updates. |
The changes we made to the benchmarks: - Applied new benchmark abstraction to increase maintainability. - Introduced device_ptr to the benchmarks. - Output more data in JSON output of the benchmarks. Deprecation and removal related to major release: - Remove [[deprecated]]float_bit_mask and all uses of it from the type trait interface, ROCm/rocPRIM@a472e80. - Remove short_radix_bits in segmented_radix_sort_config_params, ROCm/rocPRIM@38ac5a5. - Remove already deprecated functions, ROCm/rocPRIM@4d265b7. - Move rocprim::detail::radix_key_codec_base into traits system, ROCm/rocPRIM@7c664b3. CCCL 2.7 parity changes: - Add virtual shared memory fallback to device_merge, ROCm/rocPRIM@ca3b0cd. - Add device-level inclusive_scan with initial value support, ROCm/rocPRIM@cff2e16. - Change default scan accumulator type to be in line with (hip) CUB, ROCm/rocPRIM@5f7accb. - Match CUB's behavior in rocPRIM for device merge, ROCm/rocPRIM@4239ef5. - Create tests for rocPRIM's bit_cast, ROCm/rocPRIM@668f913. Performance optimizations: - Implement tuning for rocprim::search_n, ROCm/rocPRIM@79b4655. - Make use of vectorized load in rocprim::transform, ROCm/rocPRIM@a755431. SPIR-V support changes: - ROCPRIM_WAVEFRONT_SIZE' from architecture defines, ROCm/rocPRIM@68354b7. - Implement mechanism for wavefront size-based dispatching, ROCm/rocPRIM@ff1b0c5. - Fix various compile issues when targeting spir-v, ROCm/rocPRIM@53a1bea. - SPIR-V: warp sort, ROCm/rocPRIM@1ed975a. - Disable dispatching with macro for usage with spir-v, ROCm/rocPRIM@960461d. - Additional fix for warp_sort for SPIR-v, ROCm/rocPRIM@c50cd7a. - fix(intrinsics/atomics.hpp): fix atomics when compiler to spirv, ROCm/rocPRIM@527c24c. - fix: improve compatibility with spir-v target in algorithms using 'lane_mask_type', ROCm/rocPRIM@5815656. - Also some additions to our CI for SPIR-V. - SPIR-V: warp reduce/scan, ROCm/rocPRIM@ab9dc0a. - SPIR-V: block scan/reduce/RLD, ROCm/rocPRIM@9f0dcf1. - Temporarily stop running device_partition test for SPIR-V due to hanging, ROCm/rocPRIM@75820ee. - SPIR-V: warp exchange/load/store, ROCm/rocPRIM@d3a8911. - SPIR-V: block exchange/load/store (and funcs), ROCm/rocPRIM@cff88f8. - SPIR-V: block radix rank/sort, ROCm/rocPRIM@cc1c028. - SPIR-V: lookback_scan_state, ROCm/rocPRIM@1ed863a. - Added generic pragmas and created fallback for atomics, ROCm/rocPRIM@4784b69. General bug fixes: - Fix the warning: explicit specialization cannot have a storage class, ROCm/rocPRIM@0249ce1. - Fix compilation failure in hipCUB/rocThrust to rocPRIM, ROCm/rocPRIM@6c747d3. - Fix some compile issues introduced in rocm 6.4, ROCm/rocPRIM@87b473f. - Fix compile warning in thread_load for the new compiler, ROCm/rocPRIM@a69a9bb. - Call non-static method properly in data generation utility, ROCm/rocPRIM@6ad6c2d. - device_merge_sort custom_huge_type failing test, ROCm/rocPRIM@7caf280. - Fix for device_run_length_encode failing test, ROCm/rocPRIM@b2bb04c. - fix: skip including the init value in block aggregate for warp and block scan, https://projects.streamhpc.com/amd/libraries/rocPRIM/-/commit/5bad310091a66bf84057ce1bb1e5e11ddf79f40c. Testing coverage: Prepare to move 'lookback_scan' to public API, ROCm/rocPRIM@7da3c6a. --- 🔁 Imported from [ROCm/rocPRIM#731](ROCm/rocPRIM#731) 🧑💻 Originally authored by @NB4444 --------- Co-authored-by: Sander Bos <sander@streamhpc.com> Co-authored-by: Yung-Sheng Tu <yung-sheng@streamhpc.com> Co-authored-by: Cenxuan Tian <cenxuan@streamhpc.com> Co-authored-by: Nick Breed <nick@streamhpc.com> Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com> Co-authored-by: Balint Csala <balint.csala@streamhpc.com> Co-authored-by: Jaap Blok <jaap@streamhpc.com> Co-authored-by: Nara Prasetya <nara@streamhpc.com> Co-authored-by: Borys Petrov <borys@streamhpc.com> Co-authored-by: Balint Siklosi <balint.siklosi@streamhpc.com> Co-authored-by: Saiyang Zhang <saiyang@streamhpc.com> Co-authored-by: Bence Parajdi <bence@streamhpc.com> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Joseph Macaranas <Joseph.Macaranas@amd.com>
The changes we made to the benchmarks: - Applied new benchmark abstraction to increase maintainability. - Introduced device_ptr to the benchmarks. - Output more data in JSON output of the benchmarks. Deprecation and removal related to major release: - Remove [[deprecated]]float_bit_mask and all uses of it from the type trait interface, a472e80. - Remove short_radix_bits in segmented_radix_sort_config_params, 38ac5a5. - Remove already deprecated functions, 4d265b7. - Move rocprim::detail::radix_key_codec_base into traits system, 7c664b3. CCCL 2.7 parity changes: - Add virtual shared memory fallback to device_merge, ca3b0cd. - Add device-level inclusive_scan with initial value support, cff2e16. - Change default scan accumulator type to be in line with (hip) CUB, 5f7accb. - Match CUB's behavior in rocPRIM for device merge, 4239ef5. - Create tests for rocPRIM's bit_cast, 668f913. Performance optimizations: - Implement tuning for rocprim::search_n, 79b4655. - Make use of vectorized load in rocprim::transform, a755431. SPIR-V support changes: - ROCPRIM_WAVEFRONT_SIZE' from architecture defines, 68354b7. - Implement mechanism for wavefront size-based dispatching, ff1b0c5. - Fix various compile issues when targeting spir-v, 53a1bea. - SPIR-V: warp sort, 1ed975a. - Disable dispatching with macro for usage with spir-v, 960461d. - Additional fix for warp_sort for SPIR-v, c50cd7a. - fix(intrinsics/atomics.hpp): fix atomics when compiler to spirv, 527c24c. - fix: improve compatibility with spir-v target in algorithms using 'lane_mask_type', 5815656. - Also some additions to our CI for SPIR-V. - SPIR-V: warp reduce/scan, ab9dc0a. - SPIR-V: block scan/reduce/RLD, 9f0dcf1. - Temporarily stop running device_partition test for SPIR-V due to hanging, 75820ee. - SPIR-V: warp exchange/load/store, d3a8911. - SPIR-V: block exchange/load/store (and funcs), cff88f8. - SPIR-V: block radix rank/sort, cc1c028. - SPIR-V: lookback_scan_state, 1ed863a. - Added generic pragmas and created fallback for atomics, 4784b69. General bug fixes: - Fix the warning: explicit specialization cannot have a storage class, 0249ce1. - Fix compilation failure in hipCUB/rocThrust to rocPRIM, 6c747d3. - Fix some compile issues introduced in rocm 6.4, 87b473f. - Fix compile warning in thread_load for the new compiler, a69a9bb. - Call non-static method properly in data generation utility, 6ad6c2d. - device_merge_sort custom_huge_type failing test, 7caf280. - Fix for device_run_length_encode failing test, b2bb04c. - fix: skip including the init value in block aggregate for warp and block scan, https://projects.streamhpc.com/amd/libraries/rocPRIM/-/commit/5bad310091a66bf84057ce1bb1e5e11ddf79f40c. Testing coverage: Prepare to move 'lookback_scan' to public API, 7da3c6a. --- 🔁 Imported from [#731](#731) 🧑💻 Originally authored by @NB4444 --------- Co-authored-by: Sander Bos <sander@streamhpc.com> Co-authored-by: Yung-Sheng Tu <yung-sheng@streamhpc.com> Co-authored-by: Cenxuan Tian <cenxuan@streamhpc.com> Co-authored-by: Nick Breed <nick@streamhpc.com> Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com> Co-authored-by: Balint Csala <balint.csala@streamhpc.com> Co-authored-by: Jaap Blok <jaap@streamhpc.com> Co-authored-by: Nara Prasetya <nara@streamhpc.com> Co-authored-by: Borys Petrov <borys@streamhpc.com> Co-authored-by: Balint Siklosi <balint.siklosi@streamhpc.com> Co-authored-by: Saiyang Zhang <saiyang@streamhpc.com> Co-authored-by: Bence Parajdi <bence@streamhpc.com> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Joseph Macaranas <Joseph.Macaranas@amd.com>
The changes we made to the benchmarks:
Deprecation and removal related to major release:
CCCL 2.7 parity changes:
Performance optimizations:
SPIR-V support changes:
General bug fixes:
Testing coverage:
Prepare to move 'lookback_scan' to public API, 7da3c6a.