Use HIP API shfl functions rather than hc::#28
Conversation
The HIP Platform should have it's own shuffle functions, and when using HIP-Clang, we don't want to use HCC specific functions.
|
Like I said in #23, shuffle functions still do not work correctly in HIP when compiled with |
|
Right, sorry about that. I made this PR to track it. So if shfl functions are merged into ROCm 1.8.x, then this should work. |
|
Yes, I think we have two options:
|
Does the develop-gfx906 get merged into develop later? |
|
The plan was to merge |
|
I'm not sure the timeline for roc-1.9.x to be released, but I think it should include |
|
I just pushed a branch |
|
Do you think we can close this PR and move the discussion to #22? I'll leave a comment in that issue about what, in our opinion, would be the best way to handle adding changes required for hip-clang. |
|
We can use macro |
That's a good point, but Anyway, it's good to know |
|
we could keep develop-hip-clang branch until all PR's are merged. |
|
So in this case we need #if defined(ROCPRIM_HC_API) || (defined(__HIP_PLATFORM_HCC__) && !defined(__HIP__))right? That way we can merge it to |
|
I've added the case as you've suggested. |
Use HIP API shfl functions rather than hc:: [ROCm/rocPRIM commit: eff7d06]
[rocPRIM] Add gfx1152 and gfx1153
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Enable gfx1152 and gfx1153.
## Technical Details
1. combine arrays into tables and use local macros to reduce repetition
(for maintainability)
2. monkey-see-monkey-do wherever `gfx11...` was found
## Test Plan
Build existing ctests for, and run them on, gfx1152 and gfx1153.
## Test Result
###
<details>
<summary>gfx1152 passed (click to see log)</summary>
```
INFO:root:++ Exec [/tmp/eble]$ ctest --test-dir /tmp/eble/rocm/bin/rocprim --output-o
n-failure --parallel 2 --exclude-regex 'rocprim.lookback_reproducibility|rocprim.link
ing|rocprim.device_merge_inplace|rocprim.device_merge_sort|rocprim.device_partition|r
ocprim.device_radix_sort|rocprim.device_scan|rocprim.device_select|rocprim.device_fin
d_first_of|rocprim.device_reduce_by_key' --timeout 60
Test project /tmp/eble/rocm/bin/rocprim
Start 1: hip.device_api
Start 2: hip.async_copy
1/73 Test #2: hip.async_copy .............................. Passed 0.01 sec
Start 3: hip.ordered_block_id
2/73 Test #1: hip.device_api .............................. Passed 0.02 sec
Start 4: rocprim.internal_merge_path
3/73 Test #4: rocprim.internal_merge_path ................. Passed 0.01 sec
Start 5: rocprim.basic_test
4/73 Test #3: hip.ordered_block_id ........................ Passed 0.01 sec
Start 6: rocprim.arg_index_iterator
5/73 Test #5: rocprim.basic_test .......................... Passed 0.01 sec
Start 7: rocprim.temporary_storage_partitioning
6/73 Test #6: rocprim.arg_index_iterator .................. Passed 0.01 sec
Start 8: rocprim.block_adjacent_difference
7/73 Test #7: rocprim.temporary_storage_partitioning ...... Passed 0.01 sec
Start 9: rocprim.block_discontinuity
8/73 Test #8: rocprim.block_adjacent_difference ........... Passed 2.34 sec
Start 10: rocprim.bit_cast
9/73 Test #10: rocprim.bit_cast ............................ Passed 0.02 sec
Start 11: rocprim.block_exchange
10/73 Test #11: rocprim.block_exchange ...................... Passed 0.73 sec
Start 12: rocprim.block_histogram
11/73 Test #12: rocprim.block_histogram ..................... Passed 0.54 sec
Start 13: rocprim.block_load_store
12/73 Test #13: rocprim.block_load_store .................... Passed 0.44 sec
Start 14: rocprim.block_sort_merge
13/73 Test #14: rocprim.block_sort_merge .................... Passed 0.02 sec
Start 15: rocprim.block_sort_merge_stable
14/73 Test #15: rocprim.block_sort_merge_stable ............. Passed 0.02 sec
Start 16: rocprim.block_radix_rank
15/73 Test #16: rocprim.block_radix_rank .................... Passed 0.03 sec
Start 17: rocprim.block_radix_sort
16/73 Test #17: rocprim.block_radix_sort .................... Passed 4.79 sec
Start 18: rocprim.block_reduce
17/73 Test #18: rocprim.block_reduce ........................ Passed 0.26 sec
Start 19: rocprim.block_run_length_decode
18/73 Test #19: rocprim.block_run_length_decode ............. Passed 0.54 sec
Start 20: rocprim.block_scan
19/73 Test #20: rocprim.block_scan .......................... Passed 0.04 sec
Start 21: rocprim.block_shuffle
20/73 Test #21: rocprim.block_shuffle ....................... Passed 2.70 sec
Start 22: rocprim.block_sort_bitonic
21/73 Test #9: rocprim.block_discontinuity ................. Passed 17.36 sec
Start 23: rocprim.config_dispatch
22/73 Test #23: rocprim.config_dispatch ..................... Passed 0.09 sec
Start 24: rocprim.constant_iterator
23/73 Test #24: rocprim.constant_iterator ................... Passed 0.07 sec
Start 25: rocprim.counting_iterator
24/73 Test #25: rocprim.counting_iterator ................... Passed 0.07 sec
Start 26: rocprim.device_batch_memcpy
25/73 Test #26: rocprim.device_batch_memcpy ................. Passed 1.20 sec
Start 27: rocprim.device_binary_search
26/73 Test #27: rocprim.device_binary_search ................ Passed 0.02 sec
Start 28: rocprim.device_adjacent_difference
27/73 Test #28: rocprim.device_adjacent_difference .......... Passed 0.01 sec
Start 29: rocprim.device_adjacent_find
28/73 Test #29: rocprim.device_adjacent_find ................ Passed 0.01 sec
Start 30: rocprim.device_find_end
29/73 Test #30: rocprim.device_find_end ..................... Passed 0.01 sec
Start 31: rocprim.device_histogram
30/73 Test #22: rocprim.block_sort_bitonic .................. Passed 13.23 sec
Start 32: rocprim.device_merge
31/73 Test #31: rocprim.device_histogram .................... Passed 7.85 sec
Start 33: rocprim.nth_element
32/73 Test #33: rocprim.nth_element ......................... Passed 0.03 sec
Start 34: rocprim.device_partial_sort
33/73 Test #34: rocprim.device_partial_sort ................. Passed 0.02 sec
Start 35: rocprim.device_reduce
34/73 Test #35: rocprim.device_reduce ....................... Passed 8.94 sec
Start 36: rocprim.device_run_length_encode
35/73 Test #32: rocprim.device_merge ........................ Passed 14.05 sec
Start 37: rocprim.device_search
36/73 Test #37: rocprim.device_search ....................... Passed 0.02 sec
Start 38: rocprim.device_segmented_radix_sort
37/73 Test #36: rocprim.device_run_length_encode ............ Passed 13.92 sec
Start 39: rocprim.device_search_n
38/73 Test #39: rocprim.device_search_n ..................... Passed 0.02 sec
Start 40: rocprim.device_segmented_reduce
39/73 Test #40: rocprim.device_segmented_reduce ............. Passed 5.21 sec
Start 41: rocprim.device_segmented_scan
40/73 Test #41: rocprim.device_segmented_scan ............... Passed 0.02 sec
Start 42: rocprim.device_transform
41/73 Test #42: rocprim.device_transform .................... Passed 13.90 sec
Start 43: rocprim.discard_iterator
42/73 Test #43: rocprim.discard_iterator .................... Passed 0.07 sec
Start 44: rocprim.radix_key_codec
43/73 Test #44: rocprim.radix_key_codec ..................... Pas09:54:11 [55/1943]
Start 45: rocprim.predicate_iterator
44/73 Test #45: rocprim.predicate_iterator .................. Passed 0.07 sec
Start 46: rocprim.reverse_iterator
45/73 Test #46: rocprim.reverse_iterator .................... Passed 0.09 sec
Start 47: rocprim.rocprim_tuple
46/73 Test #47: rocprim.rocprim_tuple ....................... Passed 0.01 sec
Start 48: rocprim.rocprim_types
47/73 Test #48: rocprim.rocprim_types ....................... Passed 0.01 sec
Start 49: rocprim.texture_cache_iterator
48/73 Test #49: rocprim.texture_cache_iterator .............. Passed 0.01 sec
Start 50: rocprim.thread
49/73 Test #50: rocprim.thread .............................. Passed 0.07 sec
Start 51: rocprim.thread_algos
50/73 Test #51: rocprim.thread_algos ........................ Passed 0.35 sec
Start 52: rocprim.tuple
51/73 Test #52: rocprim.tuple ............................... Passed 0.02 sec
Start 53: rocprim.utils_sort_checker
52/73 Test #53: rocprim.utils_sort_checker .................. Passed 0.01 sec
Start 54: rocprim.transform_iterator
53/73 Test #54: rocprim.transform_iterator .................. Passed 0.11 sec
Start 55: rocprim.type_traits_interface_cpp17
54/73 Test #55: rocprim.type_traits_interface_cpp17 ......... Passed 0.01 sec
Start 56: rocprim.type_traits_interface_gnupp17
55/73 Test #56: rocprim.type_traits_interface_gnupp17 ....... Passed 0.01 sec
Start 57: rocprim.type_traits_interface_cpp20
56/73 Test #57: rocprim.type_traits_interface_cpp20 ......... Passed 0.01 sec
Start 58: rocprim.type_traits_interface_gnupp20
57/73 Test #58: rocprim.type_traits_interface_gnupp20 ....... Passed 0.01 sec
Start 59: rocprim.no_half_operators
58/73 Test #59: rocprim.no_half_operators ................... Passed 0.01 sec
Start 60: rocprim.intrinsics
59/73 Test #60: rocprim.intrinsics .......................... Passed 0.21 sec
Start 61: rocprim.intrinsics_atomic
60/73 Test #61: rocprim.intrinsics_atomic ................... Passed 0.02 sec
Start 62: rocprim.invoke_result
61/73 Test #62: rocprim.invoke_result ....................... Passed 0.01 sec
Start 63: rocprim.warp_exchange
62/73 Test #63: rocprim.warp_exchange ....................... Passed 0.08 sec
Start 64: rocprim.warp_load
63/73 Test #64: rocprim.warp_load ........................... Passed 0.08 sec
Start 65: rocprim.warp_reduce
64/73 Test #65: rocprim.warp_reduce ......................... Passed 0.14 sec
Start 66: rocprim.warp_scan
65/73 Test #66: rocprim.warp_scan ........................... Passed 0.20 sec
Start 67: rocprim.warp_scan_disable_dpp_disable_dpp
66/73 Test #67: rocprim.warp_scan_disable_dpp_disable_dpp ... Passed 0.21 sec
Start 68: rocprim.warp_sort
67/73 Test #68: rocprim.warp_sort ........................... Passed 0.09 sec
Start 69: rocprim.warp_store
68/73 Test #69: rocprim.warp_store .......................... Passed 0.02 sec
Start 70: rocprim.zip_iterator
69/73 Test #70: rocprim.zip_iterator ........................ Passed 0.02 sec
Start 71: rocprim.accumulator_t
70/73 Test #71: rocprim.accumulator_t ....................... Passed 0.02 sec
Start 72: hipgraph.basic
71/73 Test #72: hipgraph.basic .............................. Passed 0.02 sec
Start 73: hipgraph.algs
72/73 Test #73: hipgraph.algs ............................... Passed 0.01 sec
73/73 Test #38: rocprim.device_segmented_radix_sort ......... Passed 31.97 sec
100% tests passed, 0 tests failed out of 73
Total Test time (real) = 71.80 sec
✅ test_rocprim.py PASSED
```
</details>
<details>
<summary>gfx1153 passed (click to see log)</summary>
```
INFO:root:++ Exec [/tmp/eble]$ ctest --test-dir /tmp/eble/rocm/bin/rocprim --output-o
n-failure --parallel 1 --exclude-regex 'rocprim.lookback_reproducibility|rocprim.link
ing|rocprim.device_merge_inplace|rocprim.device_merge_sort|rocprim.device_partition|r
ocprim.device_radix_sort|rocprim.device_scan|rocprim.device_select|rocprim.device_fin
d_first_of|rocprim.device_reduce_by_key' --timeout 60
Test project /tmp/eble/rocm/bin/rocprim
Start 1: hip.device_api
1/73 Test #1: hip.device_api .............................. Passed 0.01 sec
Start 2: hip.async_copy
2/73 Test #2: hip.async_copy .............................. Passed 0.01 sec
Start 3: hip.ordered_block_id
3/73 Test #3: hip.ordered_block_id ........................ Passed 0.01 sec
Start 4: rocprim.internal_merge_path
4/73 Test #4: rocprim.internal_merge_path ................. Passed 0.01 sec
Start 5: rocprim.basic_test
5/73 Test #5: rocprim.basic_test .......................... Passed 0.01 sec
Start 6: rocprim.arg_index_iterator
6/73 Test #6: rocprim.arg_index_iterator .................. Passed 0.01 sec
Start 7: rocprim.temporary_storage_partitioning
7/73 Test #7: rocprim.temporary_storage_partitioning ...... Passed 0.01 sec
Start 8: rocprim.block_adjacent_difference
8/73 Test #8: rocprim.block_adjacent_difference ........... Passed 2.94 sec
Start 9: rocprim.block_discontinuity
9/73 Test #9: rocprim.block_discontinuity ................. Passed 21.10 sec
Start 10: rocprim.bit_cast
10/73 Test #10: rocprim.bit_cast ............................ Passed 0.01 sec
Start 11: rocprim.block_exchange
11/73 Test #11: rocprim.block_exchange ...................... Passed 2.20 sec
Start 12: rocprim.block_histogram
12/73 Test #12: rocprim.block_histogram ..................... Passed 0.71 sec
Start 13: rocprim.block_load_store
13/73 Test #13: rocprim.block_load_store .................... Passed 0.48 sec
Start 14: rocprim.block_sort_merge
14/73 Test #14: rocprim.block_sort_merge .................... Passed 0.02 sec
Start 15: rocprim.block_sort_merge_stable
15/73 Test #15: rocprim.block_sort_merge_stable ............. Passed 0.02 sec
Start 16: rocprim.block_radix_rank
16/73 Test #16: rocprim.block_radix_rank .................... Passed 0.02 sec
Start 17: rocprim.block_radix_sort
17/73 Test #17: rocprim.block_radix_sort .................... Passed 6.12 sec
Start 18: rocprim.block_reduce
18/73 Test #18: rocprim.block_reduce ........................ Passed 0.31 sec
Start 19: rocprim.block_run_length_decode
19/73 Test #19: rocprim.block_run_length_decode ............. Passed 0.68 sec
Start 20: rocprim.block_scan
20/73 Test #20: rocprim.block_scan .......................... Passed 0.03 sec
Start 21: rocprim.block_shuffle
21/73 Test #21: rocprim.block_shuffle ....................... Passed 3.63 sec
Start 22: rocprim.block_sort_bitonic
22/73 Test #22: rocprim.block_sort_bitonic .................. Passed 19.34 sec
Start 23: rocprim.config_dispatch
23/73 Test #23: rocprim.config_dispatch ..................... Passed 0.10 sec
Start 24: rocprim.constant_iterator
24/73 Test #24: rocprim.constant_iterator ................... Passed 0.09 sec
Start 25: rocprim.counting_iterator
25/73 Test #25: rocprim.counting_iterator ................... Passed 0.09 sec
Start 26: rocprim.device_batch_memcpy
26/73 Test #26: rocprim.device_batch_memcpy ................. Passed 1.42 sec
Start 27: rocprim.device_binary_search
27/73 Test #27: rocprim.device_binary_search ................ Passed 0.01 sec
Start 28: rocprim.device_adjacent_difference
28/73 Test #28: rocprim.device_adjacent_difference .......... Passed 0.01 sec
Start 29: rocprim.device_adjacent_find
29/73 Test #29: rocprim.device_adjacent_find ................ Passed 0.01 sec
Start 30: rocprim.device_find_end
30/73 Test #30: rocprim.device_find_end ..................... Passed 0.01 sec
Start 31: rocprim.device_histogram
31/73 Test #31: rocprim.device_histogram .................... Passed 8.77 sec
Start 32: rocprim.device_merge
32/73 Test #32: rocprim.device_merge ........................ Passed 16.23 sec
Start 33: rocprim.nth_element
33/73 Test #33: rocprim.nth_element ......................... Passed 0.01 sec
Start 34: rocprim.device_partial_sort
34/73 Test #34: rocprim.device_partial_sort ................. Passed 0.02 sec
Start 35: rocprim.device_reduce
35/73 Test #35: rocprim.device_reduce ....................... Passed 10.92 sec
Start 36: rocprim.device_run_length_encode
36/73 Test #36: rocprim.device_run_length_encode ............ Passed 14.34 sec
Start 37: rocprim.device_search
37/73 Test #37: rocprim.device_search ....................... Passed 0.01 sec
Start 38: rocprim.device_segmented_radix_sort
38/73 Test #38: rocprim.device_segmented_radix_sort ......... Passed 38.28 sec
Start 39: rocprim.device_search_n
39/73 Test #39: rocprim.device_search_n ..................... Passed 0.02 sec
Start 40: rocprim.device_segmented_reduce
40/73 Test #40: rocprim.device_segmented_reduce ............. Passed 7.19 sec
Start 41: rocprim.device_segmented_scan
41/73 Test #41: rocprim.device_segmented_scan ............... Passed 0.02 sec
Start 42: rocprim.device_transform
42/73 Test #42: rocprim.device_transform .................... Passed 17.64 sec
Start 43: rocprim.discard_iterator
43/73 Test #43: rocprim.discard_iterator .................... Passed 0.12 sec
Start 44: rocprim.radix_key_codec
44/73 Test #44: rocprim.radix_key_codec ..................... Passed 0.01 sec
Start 45: rocprim.predicate_iterator
45/73 Test #45: rocprim.predicate_iterator .................. Passed 0.08 sec
Start 46: rocprim.reverse_iterator
46/73 Test #46: rocprim.reverse_iterator .................... Pas10:13:26 [46/1844]
Start 47: rocprim.rocprim_tuple
47/73 Test #47: rocprim.rocprim_tuple ....................... Passed 0.01 sec
Start 48: rocprim.rocprim_types
48/73 Test #48: rocprim.rocprim_types ....................... Passed 0.01 sec
Start 49: rocprim.texture_cache_iterator
49/73 Test #49: rocprim.texture_cache_iterator .............. Passed 0.01 sec
Start 50: rocprim.thread
50/73 Test #50: rocprim.thread .............................. Passed 0.08 sec
Start 51: rocprim.thread_algos
51/73 Test #51: rocprim.thread_algos ........................ Passed 0.43 sec
Start 52: rocprim.tuple
52/73 Test #52: rocprim.tuple ............................... Passed 0.01 sec
Start 53: rocprim.utils_sort_checker
53/73 Test #53: rocprim.utils_sort_checker .................. Passed 0.01 sec
Start 54: rocprim.transform_iterator
54/73 Test #54: rocprim.transform_iterator .................. Passed 0.12 sec
Start 55: rocprim.type_traits_interface_cpp17
55/73 Test #55: rocprim.type_traits_interface_cpp17 ......... Passed 0.01 sec
Start 56: rocprim.type_traits_interface_gnupp17
56/73 Test #56: rocprim.type_traits_interface_gnupp17 ....... Passed 0.01 sec
Start 57: rocprim.type_traits_interface_cpp20
57/73 Test #57: rocprim.type_traits_interface_cpp20 ......... Passed 0.01 sec
Start 58: rocprim.type_traits_interface_gnupp20
58/73 Test #58: rocprim.type_traits_interface_gnupp20 ....... Passed 0.01 sec
Start 59: rocprim.no_half_operators
59/73 Test #59: rocprim.no_half_operators ................... Passed 0.01 sec
Start 60: rocprim.intrinsics
60/73 Test #60: rocprim.intrinsics .......................... Passed 0.29 sec
Start 61: rocprim.intrinsics_atomic
61/73 Test #61: rocprim.intrinsics_atomic ................... Pas10:13:27 [16/1844]
Start 62: rocprim.invoke_result
62/73 Test #62: rocprim.invoke_result ....................... Passed 0.01 sec
Start 63: rocprim.warp_exchange
63/73 Test #63: rocprim.warp_exchange ....................... Passed 0.09 sec
Start 64: rocprim.warp_load
64/73 Test #64: rocprim.warp_load ........................... Passed 0.09 sec
Start 65: rocprim.warp_reduce
65/73 Test #65: rocprim.warp_reduce ......................... Passed 0.17 sec
Start 66: rocprim.warp_scan
66/73 Test #66: rocprim.warp_scan ........................... Passed 0.26 sec
Start 67: rocprim.warp_scan_disable_dpp_disable_dpp
67/73 Test #67: rocprim.warp_scan_disable_dpp_disable_dpp ... Passed 0.26 sec
Start 68: rocprim.warp_sort
68/73 Test #68: rocprim.warp_sort ........................... Passed 0.10 sec
Start 69: rocprim.warp_store
69/73 Test #69: rocprim.warp_store .......................... Passed 0.01 sec
Start 70: rocprim.zip_iterator
70/73 Test #70: rocprim.zip_iterator ........................ Passed 0.01 sec
Start 71: rocprim.accumulator_t
71/73 Test #71: rocprim.accumulator_t ....................... Passed 0.01 sec
Start 72: hipgraph.basic
72/73 Test #72: hipgraph.basic .............................. Passed 0.01 sec
Start 73: hipgraph.algs
73/73 Test #73: hipgraph.algs ............................... Passed 0.01 sec
100% tests passed, 0 tests failed out of 73
Total Test time (real) = 175.34 sec
✅ test_rocprim.py PASSED
```
<detail>
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
The HIP Platform should have it's own shuffle functions, and when using HIP-Clang, we don't want to use HCC specific functions.