Skip to content

Develop stream 2025 05 06#731

Closed
NB4444 wants to merge 84 commits into
ROCm:developfrom
StreamHPC:develop_stream_2025_05_06
Closed

Develop stream 2025 05 06#731
NB4444 wants to merge 84 commits into
ROCm:developfrom
StreamHPC:develop_stream_2025_05_06

Conversation

@NB4444
Copy link
Copy Markdown
Contributor

@NB4444 NB4444 commented May 6, 2025

The changes we made to the benchmarks:

  • Applied new benchmark abstraction to increase maintainability.
  • Introduced device_ptr to the benchmarks.
  • Output more data in JSON output of the benchmarks.

Deprecation and removal related to major release:

  • Remove [[deprecated]]float_bit_mask and all uses of it from the type trait interface, a472e80.
  • Remove short_radix_bits in segmented_radix_sort_config_params, 38ac5a5.
  • Remove already deprecated functions, 4d265b7.
  • Move rocprim::detail::radix_key_codec_base into traits system, 7c664b3.

CCCL 2.7 parity changes:

  • Add virtual shared memory fallback to device_merge, ca3b0cd.
  • Add device-level inclusive_scan with initial value support, cff2e16.
  • Change default scan accumulator type to be in line with (hip) CUB, 5f7accb.
  • Match CUB's behavior in rocPRIM for device merge, 4239ef5.
  • Create tests for rocPRIM's bit_cast, 668f913.

Performance optimizations:

  • Implement tuning for rocprim::search_n, 79b4655.
  • Make use of vectorized load in rocprim::transform, a755431.

SPIR-V support changes:

  • ROCPRIM_WAVEFRONT_SIZE' from architecture defines, 68354b7.
  • Implement mechanism for wavefront size-based dispatching, ff1b0c5.
  • Fix various compile issues when targeting spir-v, 53a1bea.
  • SPIR-V: warp sort, 1ed975a.
  • Disable dispatching with macro for usage with spir-v, 960461d.
  • Additional fix for warp_sort for SPIR-v, c50cd7a.
  • fix(intrinsics/atomics.hpp): fix atomics when compiler to spirv, 527c24c.
  • fix: improve compatibility with spir-v target in algorithms using 'lane_mask_type', 5815656.
  • Also some additions to our CI for SPIR-V.
  • SPIR-V: warp reduce/scan, ab9dc0a.
  • SPIR-V: block scan/reduce/RLD, 9f0dcf1.
  • Temporarily stop running device_partition test for SPIR-V due to hanging, 75820ee.
  • SPIR-V: warp exchange/load/store, d3a8911.
  • SPIR-V: block exchange/load/store (and funcs), cff88f8.
  • SPIR-V: block radix rank/sort, cc1c028.
  • SPIR-V: lookback_scan_state, 1ed863a.
  • Added generic pragmas and created fallback for atomics, 4784b69.

General bug fixes:

Testing coverage:
Prepare to move 'lookback_scan' to public API, 7da3c6a.

MyNameIsTrez and others added 30 commits May 6, 2025 13:37
@NB4444 NB4444 marked this pull request as ready for review May 12, 2025 12:50
Copy link
Copy Markdown
Contributor

@spolifroni-amd spolifroni-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog needs some changes and a few clarifications.

Comment thread CHANGELOG.md Outdated
* past default: `class AccType = detail::input_type_t<InitValueType>>`
* new default: `class AccType = rocprim::invoke_result_binary_op_t<rocprim::detail::input_type_t<InitValueType>, BinaryFunction>`

* Added function `is_build_in` into `rocprim::traits::get`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to the "added" section.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to the added section.

Comment thread CHANGELOG.md Outdated
* Added function `is_build_in` into `rocprim::traits::get`.
* Changed the parameters `long_radix_bits` and `LongRadixBits` from `segmented_radix_sort` to `radix_bits` and `RadixBits` respectively.
* Marked the initialisation constructor of `rocprim::reverse_iterator<Iter>` `explicit`, use `rocprim::make_reverse_iterator`.
* Dropped `c++14` support for rocprim.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the hipCUB changelog, it clarifies that there's only C++17 support now. Do you want to specify that here too? If not, move this to "removed".

Suggested change
* Dropped `c++14` support for rocprim.
* Dropped `C++14` support for rocprim.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I both moved it to removed and clarified that we only support C++17 now.

Comment thread CHANGELOG.md Outdated
* Merged `radix_key_codec` into type_traits system.
* Renamed `type_traits_interface.hpp` to `type_traits.hpp`, rename the original `type_traits.hpp` to `type_traits_functions.hpp`.
* Removed `radix_sort.hpp`
* Added virtual shared memory as a fallback option in `rocprim::device_merge` when it exceeds shared memory capacity, similar to `rocprim::device_select`, `rocprim::device_partition`, and `rocprim::device_merge_sort`, which already include this feature.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to the "added" section.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to the added section.

Comment thread CHANGELOG.md Outdated
* Renamed `type_traits_interface.hpp` to `type_traits.hpp`, rename the original `type_traits.hpp` to `type_traits_functions.hpp`.
* Removed `radix_sort.hpp`
* Added virtual shared memory as a fallback option in `rocprim::device_merge` when it exceeds shared memory capacity, similar to `rocprim::device_select`, `rocprim::device_partition`, and `rocprim::device_merge_sort`, which already include this feature.
* Added initial value support to device level inclusive scans.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to the "added" section

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to the added section.

Comment thread CHANGELOG.md Outdated
* Removed `radix_sort.hpp`
* Added virtual shared memory as a fallback option in `rocprim::device_merge` when it exceeds shared memory capacity, similar to `rocprim::device_select`, `rocprim::device_partition`, and `rocprim::device_merge_sort`, which already include this feature.
* Added initial value support to device level inclusive scans.
* Made new optimization for `device_transform` when the input and output are pointers.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be in "added". Not sure though because the sentence isn't clear. Is this a backend change where optimization has been added, or is this something a user would need to make changes for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to clarify the statement and moved it to added.

Comment thread CHANGELOG.md Outdated
* Added virtual shared memory as a fallback option in `rocprim::device_merge` when it exceeds shared memory capacity, similar to `rocprim::device_select`, `rocprim::device_partition`, and `rocprim::device_merge_sort`, which already include this feature.
* Added initial value support to device level inclusive scans.
* Made new optimization for `device_transform` when the input and output are pointers.
* Added `LoadType` to `transform_config`, which is used for the `device_transform` when the input and output are pointers.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to "added"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to the added section.

Comment thread CHANGELOG.md Outdated

### Resolved issues

* Fixed `device_batch_memcpy` its reported benchmarking throughput being 2x lower than it was in reality.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Fixed `device_batch_memcpy` its reported benchmarking throughput being 2x lower than it was in reality.
* Fixed an issue where `device_batch_memcpy` reported benchmarking throughput being 2x lower than it was in reality.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the suggested changes.

Comment thread CHANGELOG.md Outdated
### Resolved issues

* Fixed `device_batch_memcpy` its reported benchmarking throughput being 2x lower than it was in reality.
* Fixed `device_segmented_reduce` its reported autotuning throughput being 5x lower than it was in reality.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Fixed `device_segmented_reduce` its reported autotuning throughput being 5x lower than it was in reality.
* Fixed an issue where `device_segmented_reduce` reported autotuning throughput being 5x lower than it was in reality.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the suggested changes.

@NB4444
Copy link
Copy Markdown
Contributor Author

NB4444 commented May 14, 2025

I am currently looking into the device_scan failure on gfx90a. I found two issues so far, one of them I already have a fix. The block_scan_reduce_then_scan had some missing wave_barrier. The other one seems to be an issue in the thread_scan_inclusive. It might be a compiler issue, but no luck so far in creating a smaller reproducer.

@NB4444
Copy link
Copy Markdown
Contributor Author

NB4444 commented May 19, 2025

I have added a fix and a temp fix for the device_scan failure.

@jayhawk-commits
Copy link
Copy Markdown
Contributor

Closing the pull request in this repo. Please refer to the migrated pull request for updates.

stanleytsang-amd pushed a commit to ROCm/rocm-libraries that referenced this pull request May 22, 2025
The changes we made to the benchmarks:
- Applied new benchmark abstraction to increase maintainability.
- Introduced device_ptr to the benchmarks.
- Output more data in JSON output of the benchmarks.

Deprecation and removal related to major release:
- Remove [[deprecated]]float_bit_mask and all uses of it from the type
trait interface,
ROCm/rocPRIM@a472e80.
- Remove short_radix_bits in segmented_radix_sort_config_params,
ROCm/rocPRIM@38ac5a5.
- Remove already deprecated functions,
ROCm/rocPRIM@4d265b7.
- Move rocprim::detail::radix_key_codec_base into traits system,
ROCm/rocPRIM@7c664b3.

CCCL 2.7 parity changes:
- Add virtual shared memory fallback to device_merge,
ROCm/rocPRIM@ca3b0cd.
- Add device-level inclusive_scan with initial value support,
ROCm/rocPRIM@cff2e16.
- Change default scan accumulator type to be in line with (hip) CUB,
ROCm/rocPRIM@5f7accb.
- Match CUB's behavior in rocPRIM for device merge,
ROCm/rocPRIM@4239ef5.
- Create tests for rocPRIM's bit_cast,
ROCm/rocPRIM@668f913.

Performance optimizations:
- Implement tuning for rocprim::search_n,
ROCm/rocPRIM@79b4655.
- Make use of vectorized load in rocprim::transform,
ROCm/rocPRIM@a755431.

SPIR-V support changes:
- ROCPRIM_WAVEFRONT_SIZE' from architecture defines,
ROCm/rocPRIM@68354b7.
- Implement mechanism for wavefront size-based dispatching,
ROCm/rocPRIM@ff1b0c5.
- Fix various compile issues when targeting spir-v,
ROCm/rocPRIM@53a1bea.
- SPIR-V: warp sort,
ROCm/rocPRIM@1ed975a.
- Disable dispatching with macro for usage with spir-v,
ROCm/rocPRIM@960461d.
- Additional fix for warp_sort for SPIR-v,
ROCm/rocPRIM@c50cd7a.
- fix(intrinsics/atomics.hpp): fix atomics when compiler to spirv,
ROCm/rocPRIM@527c24c.
- fix: improve compatibility with spir-v target in algorithms using
'lane_mask_type',
ROCm/rocPRIM@5815656.
- Also some additions to our CI for SPIR-V.
- SPIR-V: warp reduce/scan,
ROCm/rocPRIM@ab9dc0a.
- SPIR-V: block scan/reduce/RLD,
ROCm/rocPRIM@9f0dcf1.
- Temporarily stop running device_partition test for SPIR-V due to
hanging,
ROCm/rocPRIM@75820ee.
- SPIR-V: warp exchange/load/store,
ROCm/rocPRIM@d3a8911.
- SPIR-V: block exchange/load/store (and funcs),
ROCm/rocPRIM@cff88f8.
- SPIR-V: block radix rank/sort,
ROCm/rocPRIM@cc1c028.
- SPIR-V: lookback_scan_state,
ROCm/rocPRIM@1ed863a.
- Added generic pragmas and created fallback for atomics,
ROCm/rocPRIM@4784b69.

General bug fixes:
- Fix the warning: explicit specialization cannot have a storage class,
ROCm/rocPRIM@0249ce1.
- Fix compilation failure in hipCUB/rocThrust to rocPRIM,
ROCm/rocPRIM@6c747d3.
- Fix some compile issues introduced in rocm 6.4,
ROCm/rocPRIM@87b473f.
- Fix compile warning in thread_load for the new compiler,
ROCm/rocPRIM@a69a9bb.
- Call non-static method properly in data generation utility,
ROCm/rocPRIM@6ad6c2d.
- device_merge_sort custom_huge_type failing test,
ROCm/rocPRIM@7caf280.
- Fix for device_run_length_encode failing test,
ROCm/rocPRIM@b2bb04c.
- fix: skip including the init value in block aggregate for warp and
block scan,
https://projects.streamhpc.com/amd/libraries/rocPRIM/-/commit/5bad310091a66bf84057ce1bb1e5e11ddf79f40c.

Testing coverage:
Prepare to move 'lookback_scan' to public API,
ROCm/rocPRIM@7da3c6a.

---
🔁 Imported from
[ROCm/rocPRIM#731](ROCm/rocPRIM#731)
🧑‍💻 Originally authored by @NB4444

---------

Co-authored-by: Sander Bos <sander@streamhpc.com>
Co-authored-by: Yung-Sheng Tu <yung-sheng@streamhpc.com>
Co-authored-by: Cenxuan Tian <cenxuan@streamhpc.com>
Co-authored-by: Nick Breed <nick@streamhpc.com>
Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com>
Co-authored-by: Balint Csala <balint.csala@streamhpc.com>
Co-authored-by: Jaap Blok <jaap@streamhpc.com>
Co-authored-by: Nara Prasetya <nara@streamhpc.com>
Co-authored-by: Borys Petrov <borys@streamhpc.com>
Co-authored-by: Balint Siklosi <balint.siklosi@streamhpc.com>
Co-authored-by: Saiyang Zhang <saiyang@streamhpc.com>
Co-authored-by: Bence Parajdi <bence@streamhpc.com>
Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com>
Co-authored-by: Joseph Macaranas <Joseph.Macaranas@amd.com>
ammallya pushed a commit that referenced this pull request Oct 28, 2025
The changes we made to the benchmarks:
- Applied new benchmark abstraction to increase maintainability.
- Introduced device_ptr to the benchmarks.
- Output more data in JSON output of the benchmarks.

Deprecation and removal related to major release:
- Remove [[deprecated]]float_bit_mask and all uses of it from the type
trait interface,
a472e80.
- Remove short_radix_bits in segmented_radix_sort_config_params,
38ac5a5.
- Remove already deprecated functions,
4d265b7.
- Move rocprim::detail::radix_key_codec_base into traits system,
7c664b3.

CCCL 2.7 parity changes:
- Add virtual shared memory fallback to device_merge,
ca3b0cd.
- Add device-level inclusive_scan with initial value support,
cff2e16.
- Change default scan accumulator type to be in line with (hip) CUB,
5f7accb.
- Match CUB's behavior in rocPRIM for device merge,
4239ef5.
- Create tests for rocPRIM's bit_cast,
668f913.

Performance optimizations:
- Implement tuning for rocprim::search_n,
79b4655.
- Make use of vectorized load in rocprim::transform,
a755431.

SPIR-V support changes:
- ROCPRIM_WAVEFRONT_SIZE' from architecture defines,
68354b7.
- Implement mechanism for wavefront size-based dispatching,
ff1b0c5.
- Fix various compile issues when targeting spir-v,
53a1bea.
- SPIR-V: warp sort,
1ed975a.
- Disable dispatching with macro for usage with spir-v,
960461d.
- Additional fix for warp_sort for SPIR-v,
c50cd7a.
- fix(intrinsics/atomics.hpp): fix atomics when compiler to spirv,
527c24c.
- fix: improve compatibility with spir-v target in algorithms using
'lane_mask_type',
5815656.
- Also some additions to our CI for SPIR-V.
- SPIR-V: warp reduce/scan,
ab9dc0a.
- SPIR-V: block scan/reduce/RLD,
9f0dcf1.
- Temporarily stop running device_partition test for SPIR-V due to
hanging,
75820ee.
- SPIR-V: warp exchange/load/store,
d3a8911.
- SPIR-V: block exchange/load/store (and funcs),
cff88f8.
- SPIR-V: block radix rank/sort,
cc1c028.
- SPIR-V: lookback_scan_state,
1ed863a.
- Added generic pragmas and created fallback for atomics,
4784b69.

General bug fixes:
- Fix the warning: explicit specialization cannot have a storage class,
0249ce1.
- Fix compilation failure in hipCUB/rocThrust to rocPRIM,
6c747d3.
- Fix some compile issues introduced in rocm 6.4,
87b473f.
- Fix compile warning in thread_load for the new compiler,
a69a9bb.
- Call non-static method properly in data generation utility,
6ad6c2d.
- device_merge_sort custom_huge_type failing test,
7caf280.
- Fix for device_run_length_encode failing test,
b2bb04c.
- fix: skip including the init value in block aggregate for warp and
block scan,
https://projects.streamhpc.com/amd/libraries/rocPRIM/-/commit/5bad310091a66bf84057ce1bb1e5e11ddf79f40c.

Testing coverage:
Prepare to move 'lookback_scan' to public API,
7da3c6a.

---
🔁 Imported from
[#731](#731)
🧑‍💻 Originally authored by @NB4444

---------

Co-authored-by: Sander Bos <sander@streamhpc.com>
Co-authored-by: Yung-Sheng Tu <yung-sheng@streamhpc.com>
Co-authored-by: Cenxuan Tian <cenxuan@streamhpc.com>
Co-authored-by: Nick Breed <nick@streamhpc.com>
Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com>
Co-authored-by: Balint Csala <balint.csala@streamhpc.com>
Co-authored-by: Jaap Blok <jaap@streamhpc.com>
Co-authored-by: Nara Prasetya <nara@streamhpc.com>
Co-authored-by: Borys Petrov <borys@streamhpc.com>
Co-authored-by: Balint Siklosi <balint.siklosi@streamhpc.com>
Co-authored-by: Saiyang Zhang <saiyang@streamhpc.com>
Co-authored-by: Bence Parajdi <bence@streamhpc.com>
Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com>
Co-authored-by: Joseph Macaranas <Joseph.Macaranas@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.