Skip to content

[hipSPARSELt] monorepo test#5

Closed
jayhawk-commits wants to merge 12 commits into
developfrom
joseph-monorepoTest-PR
Closed

[hipSPARSELt] monorepo test#5
jayhawk-commits wants to merge 12 commits into
developfrom
joseph-monorepoTest-PR

Conversation

@jayhawk-commits
Copy link
Copy Markdown
Collaborator

No description provided.

@jayhawk-commits jayhawk-commits deleted the joseph-monorepoTest-PR branch April 30, 2025 18:02
assistant-librarian Bot pushed a commit that referenced this pull request May 13, 2025
…ion (#5)

* added addition unit tests and implement missing function in match

* updated changelog
@jayhawk-commits jayhawk-commits self-assigned this May 17, 2025
@jayhawk-commits jayhawk-commits added the migration Tasks or issues tied to migration to this monorepo label May 17, 2025
ammallya pushed a commit that referenced this pull request Sep 24, 2025
…th multiple calls to adding the dependency. Add skip tests option (#5)

* move the google test dependency to the root cmakelists so we dont end up with multiple places to update

* add in skip test option to verify that including a dependency will not cause it to build unless the library is included elsewhere

* oops, got these backward

[ROCm/hipDNN commit: 7e3dbbb]
stanleytsang-amd pushed a commit that referenced this pull request Dec 12, 2025
## Motivation

Enable gfx1152 and gfx1153.

## Technical Details

1. combine arrays into tables and use local macros to reduce repetition
(for maintainability)
2. monkey-see-monkey-do wherever `gfx11...` was found

## Test Plan

Build existing ctests for, and run them on, gfx1152 and gfx1153.

## Test Result

### 

<details>
<summary>gfx1152 passed (click to see log)</summary>

```
INFO:root:++ Exec [/tmp/eble]$ ctest --test-dir /tmp/eble/rocm/bin/rocprim --output-o
n-failure --parallel 2 --exclude-regex 'rocprim.lookback_reproducibility|rocprim.link
ing|rocprim.device_merge_inplace|rocprim.device_merge_sort|rocprim.device_partition|r
ocprim.device_radix_sort|rocprim.device_scan|rocprim.device_select|rocprim.device_fin
d_first_of|rocprim.device_reduce_by_key' --timeout 60
Test project /tmp/eble/rocm/bin/rocprim
      Start  1: hip.device_api
      Start  2: hip.async_copy
 1/73 Test  #2: hip.async_copy ..............................   Passed    0.01 sec
      Start  3: hip.ordered_block_id
 2/73 Test  #1: hip.device_api ..............................   Passed    0.02 sec
      Start  4: rocprim.internal_merge_path
 3/73 Test  #4: rocprim.internal_merge_path .................   Passed    0.01 sec
      Start  5: rocprim.basic_test
 4/73 Test  #3: hip.ordered_block_id ........................   Passed    0.01 sec
      Start  6: rocprim.arg_index_iterator
 5/73 Test  #5: rocprim.basic_test ..........................   Passed    0.01 sec
      Start  7: rocprim.temporary_storage_partitioning
 6/73 Test  #6: rocprim.arg_index_iterator ..................   Passed    0.01 sec
      Start  8: rocprim.block_adjacent_difference
 7/73 Test  #7: rocprim.temporary_storage_partitioning ......   Passed    0.01 sec
      Start  9: rocprim.block_discontinuity
 8/73 Test  #8: rocprim.block_adjacent_difference ...........   Passed    2.34 sec
      Start 10: rocprim.bit_cast
 9/73 Test #10: rocprim.bit_cast ............................   Passed    0.02 sec
      Start 11: rocprim.block_exchange
10/73 Test #11: rocprim.block_exchange ......................   Passed    0.73 sec
      Start 12: rocprim.block_histogram
11/73 Test #12: rocprim.block_histogram .....................   Passed    0.54 sec
      Start 13: rocprim.block_load_store
12/73 Test #13: rocprim.block_load_store ....................   Passed    0.44 sec
      Start 14: rocprim.block_sort_merge
13/73 Test #14: rocprim.block_sort_merge ....................   Passed    0.02 sec
      Start 15: rocprim.block_sort_merge_stable
14/73 Test #15: rocprim.block_sort_merge_stable .............   Passed    0.02 sec
      Start 16: rocprim.block_radix_rank
15/73 Test #16: rocprim.block_radix_rank ....................   Passed    0.03 sec
      Start 17: rocprim.block_radix_sort
16/73 Test #17: rocprim.block_radix_sort ....................   Passed    4.79 sec
      Start 18: rocprim.block_reduce
17/73 Test #18: rocprim.block_reduce ........................   Passed    0.26 sec
      Start 19: rocprim.block_run_length_decode
18/73 Test #19: rocprim.block_run_length_decode .............   Passed    0.54 sec
      Start 20: rocprim.block_scan
19/73 Test #20: rocprim.block_scan ..........................   Passed    0.04 sec
      Start 21: rocprim.block_shuffle
20/73 Test #21: rocprim.block_shuffle .......................   Passed    2.70 sec
      Start 22: rocprim.block_sort_bitonic
21/73 Test  #9: rocprim.block_discontinuity .................   Passed   17.36 sec
      Start 23: rocprim.config_dispatch
22/73 Test #23: rocprim.config_dispatch .....................   Passed    0.09 sec
      Start 24: rocprim.constant_iterator
23/73 Test #24: rocprim.constant_iterator ...................   Passed    0.07 sec
      Start 25: rocprim.counting_iterator
24/73 Test #25: rocprim.counting_iterator ...................   Passed    0.07 sec
      Start 26: rocprim.device_batch_memcpy
25/73 Test #26: rocprim.device_batch_memcpy .................   Passed    1.20 sec
      Start 27: rocprim.device_binary_search
26/73 Test #27: rocprim.device_binary_search ................   Passed    0.02 sec
      Start 28: rocprim.device_adjacent_difference
27/73 Test #28: rocprim.device_adjacent_difference ..........   Passed    0.01 sec
      Start 29: rocprim.device_adjacent_find
28/73 Test #29: rocprim.device_adjacent_find ................   Passed    0.01 sec
      Start 30: rocprim.device_find_end
29/73 Test #30: rocprim.device_find_end .....................   Passed    0.01 sec
      Start 31: rocprim.device_histogram
30/73 Test #22: rocprim.block_sort_bitonic ..................   Passed   13.23 sec
      Start 32: rocprim.device_merge
31/73 Test #31: rocprim.device_histogram ....................   Passed    7.85 sec
      Start 33: rocprim.nth_element
32/73 Test #33: rocprim.nth_element .........................   Passed    0.03 sec
      Start 34: rocprim.device_partial_sort
33/73 Test #34: rocprim.device_partial_sort .................   Passed    0.02 sec
      Start 35: rocprim.device_reduce
34/73 Test #35: rocprim.device_reduce .......................   Passed    8.94 sec
      Start 36: rocprim.device_run_length_encode
35/73 Test #32: rocprim.device_merge ........................   Passed   14.05 sec
      Start 37: rocprim.device_search
36/73 Test #37: rocprim.device_search .......................   Passed    0.02 sec
      Start 38: rocprim.device_segmented_radix_sort
37/73 Test #36: rocprim.device_run_length_encode ............   Passed   13.92 sec
      Start 39: rocprim.device_search_n
38/73 Test #39: rocprim.device_search_n .....................   Passed    0.02 sec
      Start 40: rocprim.device_segmented_reduce
39/73 Test #40: rocprim.device_segmented_reduce .............   Passed    5.21 sec
      Start 41: rocprim.device_segmented_scan
40/73 Test #41: rocprim.device_segmented_scan ...............   Passed    0.02 sec
      Start 42: rocprim.device_transform
41/73 Test #42: rocprim.device_transform ....................   Passed   13.90 sec
      Start 43: rocprim.discard_iterator
42/73 Test #43: rocprim.discard_iterator ....................   Passed    0.07 sec
      Start 44: rocprim.radix_key_codec
43/73 Test #44: rocprim.radix_key_codec .....................   Pas09:54:11 [55/1943]
      Start 45: rocprim.predicate_iterator
44/73 Test #45: rocprim.predicate_iterator ..................   Passed    0.07 sec
      Start 46: rocprim.reverse_iterator
45/73 Test #46: rocprim.reverse_iterator ....................   Passed    0.09 sec
      Start 47: rocprim.rocprim_tuple
46/73 Test #47: rocprim.rocprim_tuple .......................   Passed    0.01 sec
      Start 48: rocprim.rocprim_types
47/73 Test #48: rocprim.rocprim_types .......................   Passed    0.01 sec
      Start 49: rocprim.texture_cache_iterator
48/73 Test #49: rocprim.texture_cache_iterator ..............   Passed    0.01 sec
      Start 50: rocprim.thread
49/73 Test #50: rocprim.thread ..............................   Passed    0.07 sec
      Start 51: rocprim.thread_algos
50/73 Test #51: rocprim.thread_algos ........................   Passed    0.35 sec
      Start 52: rocprim.tuple
51/73 Test #52: rocprim.tuple ...............................   Passed    0.02 sec
      Start 53: rocprim.utils_sort_checker
52/73 Test #53: rocprim.utils_sort_checker ..................   Passed    0.01 sec
      Start 54: rocprim.transform_iterator
53/73 Test #54: rocprim.transform_iterator ..................   Passed    0.11 sec
      Start 55: rocprim.type_traits_interface_cpp17
54/73 Test #55: rocprim.type_traits_interface_cpp17 .........   Passed    0.01 sec
      Start 56: rocprim.type_traits_interface_gnupp17
55/73 Test #56: rocprim.type_traits_interface_gnupp17 .......   Passed    0.01 sec
      Start 57: rocprim.type_traits_interface_cpp20
56/73 Test #57: rocprim.type_traits_interface_cpp20 .........   Passed    0.01 sec
      Start 58: rocprim.type_traits_interface_gnupp20
57/73 Test #58: rocprim.type_traits_interface_gnupp20 .......   Passed    0.01 sec
      Start 59: rocprim.no_half_operators
58/73 Test #59: rocprim.no_half_operators ...................   Passed    0.01 sec
      Start 60: rocprim.intrinsics
59/73 Test #60: rocprim.intrinsics ..........................   Passed    0.21 sec
      Start 61: rocprim.intrinsics_atomic
60/73 Test #61: rocprim.intrinsics_atomic ...................   Passed    0.02 sec
      Start 62: rocprim.invoke_result
61/73 Test #62: rocprim.invoke_result .......................   Passed    0.01 sec
      Start 63: rocprim.warp_exchange
62/73 Test #63: rocprim.warp_exchange .......................   Passed    0.08 sec
      Start 64: rocprim.warp_load
63/73 Test #64: rocprim.warp_load ...........................   Passed    0.08 sec
      Start 65: rocprim.warp_reduce
64/73 Test #65: rocprim.warp_reduce .........................   Passed    0.14 sec
      Start 66: rocprim.warp_scan
65/73 Test #66: rocprim.warp_scan ...........................   Passed    0.20 sec
      Start 67: rocprim.warp_scan_disable_dpp_disable_dpp
66/73 Test #67: rocprim.warp_scan_disable_dpp_disable_dpp ...   Passed    0.21 sec
      Start 68: rocprim.warp_sort
67/73 Test #68: rocprim.warp_sort ...........................   Passed    0.09 sec
      Start 69: rocprim.warp_store
68/73 Test #69: rocprim.warp_store ..........................   Passed    0.02 sec
      Start 70: rocprim.zip_iterator
69/73 Test #70: rocprim.zip_iterator ........................   Passed    0.02 sec
      Start 71: rocprim.accumulator_t
70/73 Test #71: rocprim.accumulator_t .......................   Passed    0.02 sec
      Start 72: hipgraph.basic
71/73 Test #72: hipgraph.basic ..............................   Passed    0.02 sec
      Start 73: hipgraph.algs
72/73 Test #73: hipgraph.algs ...............................   Passed    0.01 sec
73/73 Test #38: rocprim.device_segmented_radix_sort .........   Passed   31.97 sec

100% tests passed, 0 tests failed out of 73

Total Test time (real) =  71.80 sec
✅ test_rocprim.py PASSED
```

</details>



<details>
<summary>gfx1153 passed (click to see log)</summary>

```
INFO:root:++ Exec [/tmp/eble]$ ctest --test-dir /tmp/eble/rocm/bin/rocprim --output-o
n-failure --parallel 1 --exclude-regex 'rocprim.lookback_reproducibility|rocprim.link
ing|rocprim.device_merge_inplace|rocprim.device_merge_sort|rocprim.device_partition|r
ocprim.device_radix_sort|rocprim.device_scan|rocprim.device_select|rocprim.device_fin
d_first_of|rocprim.device_reduce_by_key' --timeout 60
Test project /tmp/eble/rocm/bin/rocprim
      Start  1: hip.device_api
 1/73 Test  #1: hip.device_api ..............................   Passed    0.01 sec
      Start  2: hip.async_copy
 2/73 Test  #2: hip.async_copy ..............................   Passed    0.01 sec
      Start  3: hip.ordered_block_id
 3/73 Test  #3: hip.ordered_block_id ........................   Passed    0.01 sec
      Start  4: rocprim.internal_merge_path
 4/73 Test  #4: rocprim.internal_merge_path .................   Passed    0.01 sec
      Start  5: rocprim.basic_test
 5/73 Test  #5: rocprim.basic_test ..........................   Passed    0.01 sec
      Start  6: rocprim.arg_index_iterator
 6/73 Test  #6: rocprim.arg_index_iterator ..................   Passed    0.01 sec
      Start  7: rocprim.temporary_storage_partitioning
 7/73 Test  #7: rocprim.temporary_storage_partitioning ......   Passed    0.01 sec
      Start  8: rocprim.block_adjacent_difference
 8/73 Test  #8: rocprim.block_adjacent_difference ...........   Passed    2.94 sec
      Start  9: rocprim.block_discontinuity
 9/73 Test  #9: rocprim.block_discontinuity .................   Passed   21.10 sec
      Start 10: rocprim.bit_cast
10/73 Test #10: rocprim.bit_cast ............................   Passed    0.01 sec
      Start 11: rocprim.block_exchange
11/73 Test #11: rocprim.block_exchange ......................   Passed    2.20 sec
      Start 12: rocprim.block_histogram
12/73 Test #12: rocprim.block_histogram .....................   Passed    0.71 sec
      Start 13: rocprim.block_load_store
13/73 Test #13: rocprim.block_load_store ....................   Passed    0.48 sec
      Start 14: rocprim.block_sort_merge
14/73 Test #14: rocprim.block_sort_merge ....................   Passed    0.02 sec
      Start 15: rocprim.block_sort_merge_stable
15/73 Test #15: rocprim.block_sort_merge_stable .............   Passed    0.02 sec
      Start 16: rocprim.block_radix_rank
16/73 Test #16: rocprim.block_radix_rank ....................   Passed    0.02 sec
      Start 17: rocprim.block_radix_sort
17/73 Test #17: rocprim.block_radix_sort ....................   Passed    6.12 sec
      Start 18: rocprim.block_reduce
18/73 Test #18: rocprim.block_reduce ........................   Passed    0.31 sec
      Start 19: rocprim.block_run_length_decode
19/73 Test #19: rocprim.block_run_length_decode .............   Passed    0.68 sec
      Start 20: rocprim.block_scan
20/73 Test #20: rocprim.block_scan ..........................   Passed    0.03 sec
      Start 21: rocprim.block_shuffle
21/73 Test #21: rocprim.block_shuffle .......................   Passed    3.63 sec
      Start 22: rocprim.block_sort_bitonic
22/73 Test #22: rocprim.block_sort_bitonic ..................   Passed   19.34 sec
      Start 23: rocprim.config_dispatch
23/73 Test #23: rocprim.config_dispatch .....................   Passed    0.10 sec
      Start 24: rocprim.constant_iterator
24/73 Test #24: rocprim.constant_iterator ...................   Passed    0.09 sec
      Start 25: rocprim.counting_iterator
25/73 Test #25: rocprim.counting_iterator ...................   Passed    0.09 sec
      Start 26: rocprim.device_batch_memcpy
26/73 Test #26: rocprim.device_batch_memcpy .................   Passed    1.42 sec
      Start 27: rocprim.device_binary_search
27/73 Test #27: rocprim.device_binary_search ................   Passed    0.01 sec
      Start 28: rocprim.device_adjacent_difference
28/73 Test #28: rocprim.device_adjacent_difference ..........   Passed    0.01 sec
      Start 29: rocprim.device_adjacent_find
29/73 Test #29: rocprim.device_adjacent_find ................   Passed    0.01 sec
      Start 30: rocprim.device_find_end
30/73 Test #30: rocprim.device_find_end .....................   Passed    0.01 sec
      Start 31: rocprim.device_histogram
31/73 Test #31: rocprim.device_histogram ....................   Passed    8.77 sec
      Start 32: rocprim.device_merge
32/73 Test #32: rocprim.device_merge ........................   Passed   16.23 sec
      Start 33: rocprim.nth_element
33/73 Test #33: rocprim.nth_element .........................   Passed    0.01 sec
      Start 34: rocprim.device_partial_sort
34/73 Test #34: rocprim.device_partial_sort .................   Passed    0.02 sec
      Start 35: rocprim.device_reduce
35/73 Test #35: rocprim.device_reduce .......................   Passed   10.92 sec
      Start 36: rocprim.device_run_length_encode
36/73 Test #36: rocprim.device_run_length_encode ............   Passed   14.34 sec
      Start 37: rocprim.device_search
37/73 Test #37: rocprim.device_search .......................   Passed    0.01 sec
      Start 38: rocprim.device_segmented_radix_sort
38/73 Test #38: rocprim.device_segmented_radix_sort .........   Passed   38.28 sec
      Start 39: rocprim.device_search_n
39/73 Test #39: rocprim.device_search_n .....................   Passed    0.02 sec
      Start 40: rocprim.device_segmented_reduce
40/73 Test #40: rocprim.device_segmented_reduce .............   Passed    7.19 sec
      Start 41: rocprim.device_segmented_scan
41/73 Test #41: rocprim.device_segmented_scan ...............   Passed    0.02 sec
      Start 42: rocprim.device_transform
42/73 Test #42: rocprim.device_transform ....................   Passed   17.64 sec
      Start 43: rocprim.discard_iterator
43/73 Test #43: rocprim.discard_iterator ....................   Passed    0.12 sec
      Start 44: rocprim.radix_key_codec
44/73 Test #44: rocprim.radix_key_codec .....................   Passed    0.01 sec
      Start 45: rocprim.predicate_iterator
45/73 Test #45: rocprim.predicate_iterator ..................   Passed    0.08 sec
      Start 46: rocprim.reverse_iterator
46/73 Test #46: rocprim.reverse_iterator ....................   Pas10:13:26 [46/1844]
      Start 47: rocprim.rocprim_tuple
47/73 Test #47: rocprim.rocprim_tuple .......................   Passed    0.01 sec
      Start 48: rocprim.rocprim_types
48/73 Test #48: rocprim.rocprim_types .......................   Passed    0.01 sec
      Start 49: rocprim.texture_cache_iterator
49/73 Test #49: rocprim.texture_cache_iterator ..............   Passed    0.01 sec
      Start 50: rocprim.thread
50/73 Test #50: rocprim.thread ..............................   Passed    0.08 sec
      Start 51: rocprim.thread_algos
51/73 Test #51: rocprim.thread_algos ........................   Passed    0.43 sec
      Start 52: rocprim.tuple
52/73 Test #52: rocprim.tuple ...............................   Passed    0.01 sec
      Start 53: rocprim.utils_sort_checker
53/73 Test #53: rocprim.utils_sort_checker ..................   Passed    0.01 sec
      Start 54: rocprim.transform_iterator
54/73 Test #54: rocprim.transform_iterator ..................   Passed    0.12 sec
      Start 55: rocprim.type_traits_interface_cpp17
55/73 Test #55: rocprim.type_traits_interface_cpp17 .........   Passed    0.01 sec
      Start 56: rocprim.type_traits_interface_gnupp17
56/73 Test #56: rocprim.type_traits_interface_gnupp17 .......   Passed    0.01 sec
      Start 57: rocprim.type_traits_interface_cpp20
57/73 Test #57: rocprim.type_traits_interface_cpp20 .........   Passed    0.01 sec
      Start 58: rocprim.type_traits_interface_gnupp20
58/73 Test #58: rocprim.type_traits_interface_gnupp20 .......   Passed    0.01 sec
      Start 59: rocprim.no_half_operators
59/73 Test #59: rocprim.no_half_operators ...................   Passed    0.01 sec
      Start 60: rocprim.intrinsics
60/73 Test #60: rocprim.intrinsics ..........................   Passed    0.29 sec
      Start 61: rocprim.intrinsics_atomic
61/73 Test #61: rocprim.intrinsics_atomic ...................   Pas10:13:27 [16/1844]
      Start 62: rocprim.invoke_result
62/73 Test #62: rocprim.invoke_result .......................   Passed    0.01 sec
      Start 63: rocprim.warp_exchange
63/73 Test #63: rocprim.warp_exchange .......................   Passed    0.09 sec
      Start 64: rocprim.warp_load
64/73 Test #64: rocprim.warp_load ...........................   Passed    0.09 sec
      Start 65: rocprim.warp_reduce
65/73 Test #65: rocprim.warp_reduce .........................   Passed    0.17 sec
      Start 66: rocprim.warp_scan
66/73 Test #66: rocprim.warp_scan ...........................   Passed    0.26 sec
      Start 67: rocprim.warp_scan_disable_dpp_disable_dpp
67/73 Test #67: rocprim.warp_scan_disable_dpp_disable_dpp ...   Passed    0.26 sec
      Start 68: rocprim.warp_sort
68/73 Test #68: rocprim.warp_sort ...........................   Passed    0.10 sec
      Start 69: rocprim.warp_store
69/73 Test #69: rocprim.warp_store ..........................   Passed    0.01 sec
      Start 70: rocprim.zip_iterator
70/73 Test #70: rocprim.zip_iterator ........................   Passed    0.01 sec
      Start 71: rocprim.accumulator_t
71/73 Test #71: rocprim.accumulator_t .......................   Passed    0.01 sec
      Start 72: hipgraph.basic
72/73 Test #72: hipgraph.basic ..............................   Passed    0.01 sec
      Start 73: hipgraph.algs
73/73 Test #73: hipgraph.algs ...............................   Passed    0.01 sec

100% tests passed, 0 tests failed out of 73

Total Test time (real) = 175.34 sec
✅ test_rocprim.py PASSED
```

<detail>


## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
BrianHarrisonAMD added a commit that referenced this pull request Jan 30, 2026
#3710)

## Motivation

Optimizing the tensor filling functions started a discussion about
optimizing tensor iteration in general:
#3471 (comment)

## Technical Details

After some deliberation, the approach taken here (using std::variant
inside the iterator to represent the different types of indexing)
reflects both the desire the improve iteration in the case of packed
tensors while also maintaining the existing API.

A fully templated approach would be more optimal but would require API
changes to the ITensor class itself, whether making it templated or
changing the definition of its iterator-related methods at the very
least.

## Test Plan

Ran ninja check inside the build folder of hipDNN.

## Test Result

```
[185/187] Running all tests via ctest
Test project /therock/output/build/ml-libs/hipDNN/build
    Start 1: hipdnn_data_sdk_tests
1/7 Test #1: hipdnn_data_sdk_tests ............   Passed    1.30 sec
    Start 2: hipdnn_backend_tests
2/7 Test #2: hipdnn_backend_tests .............   Passed    1.29 sec
    Start 3: hipdnn_frontend_tests
3/7 Test #3: hipdnn_frontend_tests ............   Passed    0.03 sec
    Start 4: hipdnn_test_sdk_tests
4/7 Test #4: hipdnn_test_sdk_tests ............   Passed    4.32 sec
    Start 5: hipdnn_plugin_sdk_tests
5/7 Test #5: hipdnn_plugin_sdk_tests ..........   Passed    0.03 sec
    Start 6: public_hipdnn_backend_tests
6/7 Test #6: public_hipdnn_backend_tests ......   Passed    0.33 sec
    Start 7: public_hipdnn_frontend_tests
7/7 Test #7: public_hipdnn_frontend_tests .....   Passed    0.26 sec

100% tests passed, 0 tests failed out of 7

Label Time Summary:
integration_test    =   0.59 sec*proc (2 tests)
unit_test           =   6.96 sec*proc (5 tests)

Total Test time (real) =   7.56 sec
```

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com>
SamuelReeder added a commit that referenced this pull request Feb 23, 2026
- Fix bfloat16 data corruption: use torch for proper bfloat16 conversion
  instead of incorrectly mapping to float16 (different binary formats)
- Fix hardcoded DataType.FLOAT: read io/intermediate/compute data types
  from graph JSON instead of forcing everything to FLOAT
- Fix E2E timing sync: always set up torch.cuda.synchronize regardless
  of GPU timer backend, matching the PyTorch executor pattern
- Fix A/B testing: compare all output tensors, not just the first
- Fix buffer size: compute from strides when available to prevent OOB
- Fix std calculation: use sample std (ddof=1) instead of population std
- Warn on unknown data type fallback in tensor_info element_size

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
BrianHarrisonAMD pushed a commit that referenced this pull request Mar 4, 2026
## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
This PR addresses hipDNN issue #4951, which requests adding missing
frontend integration test coverage for the Matmul operation.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->
This PR includes three changes:

### 1. FrontendGraphFactory support for Matmu


Added MATMUL to OperationType.  
Added switch‑case dispatch in FrontendGraphFactory::create().  
Implemented createMatmulGraph() using:

```cpp
graph.matmul(a, b, matmulAttrs);
```

with simple 2×3 and 3×4 matrix inputs for deterministic testing.


### 2. Added new integration test: IntegrationMatmul.cpp

Following the structure of IntegrationConvForward.cpp, the test:

- is parameterized with:
  - good plugin  
  - execute‑fail plugin  
  - no‑engines plugin  
- tests both auto‑assigned and manual UIDs  
- builds a small Matmul graph using float tensors  
- exercises the entire frontend execution pipeline:

```
validate() → build_operation_graph() → create_execution_plans()
→ check_support() → build_plans() → get_workspace_size() → execute()
```

- uses SKIP_IF_NO_DEVICES() for GPU‑dependent execution  
- creates variant packs using device memory from the test tensor bundle
- verifies expected failures for execute‑fail and no‑engines plugins  

### 3. CMake update

Added IntegrationMatmul.cpp to tests/frontend/CMakeLists.txt under
public_hipdnn_frontend_tests.


## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->
All tests were built and executed inside the official TheRock docker
environment.

## Test Result

<!-- Briefly summarize test outcomes. -->

```
[1/2] Validating test names with --gtest_list_tests test collection

Test Name Validation Report
============================================================
Total tests found: 2901
Valid test names: 2901
Invalid test names: 0
```

```
[1/2] Running all tests via ctest
Test project /therock/output/build/ml-libs/hipDNN/build
    Start 1: hipdnn_data_sdk_tests
1/7 Test #1: hipdnn_data_sdk_tests ............   Passed    0.96 sec
    Start 2: hipdnn_backend_tests
2/7 Test #2: hipdnn_backend_tests .............   Passed    1.38 sec
    Start 3: hipdnn_frontend_tests
3/7 Test #3: hipdnn_frontend_tests ............   Passed    0.05 sec
    Start 4: hipdnn_test_sdk_tests
4/7 Test #4: hipdnn_test_sdk_tests ............   Passed    8.19 sec
    Start 5: hipdnn_plugin_sdk_tests
5/7 Test #5: hipdnn_plugin_sdk_tests ..........   Passed    0.03 sec
    Start 6: public_hipdnn_backend_tests
6/7 Test #6: public_hipdnn_backend_tests ......   Passed    0.32 sec
    Start 7: public_hipdnn_frontend_tests
7/7 Test #7: public_hipdnn_frontend_tests .....   Passed    0.35 sec
```

```
100% tests passed, 0 tests failed out of 7

Label Time Summary:
integration_test    =   0.67 sec*proc (2 tests)
unit_test           =  10.61 sec*proc (5 tests)

Total Test time (real) =  11.29 sec
```

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Signed-off-by: jovanau <u.jovana2@gmail.com>
NolanHannaAMD added a commit that referenced this pull request Mar 9, 2026
…being used after being freed. (#5220)

## Motivation

A `heap-use-after-free` error was triggered by AddressSanitizer on test
`CPU_Dump_NAN_FP32.testDump`.

## Technical Details

Root Cause Analysis:
The AddressSanitizer error occurred because the HIPOCProgramImpl
constructor was not storing the binary data passed to it. When
LoadProgram called LoadBinary and created a HIPOCProgram with the
returned vector, the temporary vector would go out of scope, but COMGR
still needed to access the binary data later, causing a use-after-free.

- The fix ensures that the HIPOCProgramImpl object owns the binary data
for its entire lifetime
- Both constructors now consistently store the binary data in the
`binary` member variable (std::vector)
- The uint8_t constructor converts the data to char format using
iterator range construction
- This prevents the use-after-free that occurred when COMGR tried to
access freed memory


## Test Plan

Test output before change:
```
HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*"
PRNG seed: 12345678
Note: Google Test filter = *CPU_Dump_NAN_FP32*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CPU_Dump_NAN_FP32
[ RUN      ] CPU_Dump_NAN_FP32.testDump
=================================================================
==3639==ERROR: AddressSanitizer: heap-use-after-free on address 0x7e0f08c50200 at pc 0x7f5f8d7a6554 bp 0x7ffcb7c4a730 sp 0x7ffcb7c49ee8
READ of size 26088 at 0x7e0f08c50200 thread T0
    #0 0x7f5f8d7a6553 in memcpy /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:117:5
    #1 0x7f5f23d61d78 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9
    #2 0x7f5f23d61d78 in COMGR::DataObject::setData(llvm::StringRef) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:334:17
    #3 0x7f5f23d61d78 in amd_comgr_set_data /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:606:24
    #4 0x7f5f221dc1d3 in amd::Comgr::set_data(amd_comgr_data_s, unsigned long, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/comgrctx.hpp:252:12
    #5 0x7f5f221dc1d3 in amd::device::Program::getSymbolsFromCodeObj(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>*, amd_comgr_symbol_type_s) const /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/devprogram.cpp:2061:14
    #6 0x7f5f219e6f7c in hip::DynCO::populateDynGlobalVars() /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:216:22
    #7 0x7f5f219e8e6a in hip::DynCO::getDynFunc(ihipModuleSymbol_t**, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:125:22
    #8 0x7f5f21f842ba in hip::PlatformState::GetDynFunc(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_platform.cpp:884:22
    #9 0x7f5f21ec2d71 in hip::hipModuleGetFunction(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_module.cpp:89:47
    #10 0x7f5f2212c588 in hipModuleGetFunction /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_table_interface.cpp:1926:10
    #11 0x7f5f7478d806 in miopen::HIPOCKernel::HIPOCKernel(miopen::HIPOCProgram, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::vector<unsigned long, std::allocator<unsigned long>>, std::vector<unsigned long, std::allocator<unsigned long>>) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/include/miopen/hipoc_kernel.hpp:225:25
    #12 0x7f5f766febb7 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:161:18
    #13 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34
    #14 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12
    #15 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8
    #16 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37
    #17 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    #18 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52
    #19 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2728:50
    #20 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2718:6
    #21 0x55e87ef04e64 in testing::TestInfo::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2874:14
    #22 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3052:33
    #23 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3006:6
    #24 0x55e87ef0d20b in testing::internal::UnitTestImpl::RunAllTests() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:6004:47
    #25 0x55e87ef1a1de in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    #26 0x55e87ef1a1de in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52
    #27 0x55e87ef051b5 in testing::UnitTest::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:5583:55
    #28 0x55e87eee3f9b in RUN_ALL_TESTS() /data/nhanna/repos/TheRock/build/third-party/googletest/dist/include/gtest/gtest.h:2334:73
    #29 0x55e87eee3f9b in main /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/main_hip.cpp:34:12
    #30 0x7f5f20a587e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: 889235a2805b8308b2d0274921bbe1890e9a1986)
    #31 0x55e87b0bcf2d in _start (/data/nhanna/repos/TheRock/build/ml-libs/MIOpen/build/bin/miopen_gtest+0x126bf2d)

0x7e0f08c50200 is located 0 bytes inside of 26088-byte region [0x7e0f08c50200,0x7e0f08c567e8)
freed by thread T0 here:
    #0 0x7f5f8d7b8ba2 in operator delete(void*, unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:190:3
    #1 0x7f5f76b7317d in std::__new_allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:172:2
    #2 0x7f5f76b7317d in std::allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:210:25
    #3 0x7f5f76b7317d in std::allocator_traits<std::allocator<char>>::deallocate(std::allocator<char>&, char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:517:13
    #4 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::_M_deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:390:4
    #5 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::~_Vector_base() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:369:2
    #6 0x7f5f76b7317d in std::vector<char, std::allocator<char>>::~vector() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:738:7
    #7 0x7f5f76b7317d in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:633:5
    #8 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30
    #9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26
    #10 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34
    #11 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12
    #12 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8
    #13 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37
    #14 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    #15 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52

previously allocated by thread T0 here:
    #0 0x7f5f8d7b7f9d in operator new(unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:109:35
    #1 0x7f5f76b720a5 in std::__new_allocator<char>::allocate(unsigned long, void const*) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:151:27
    #2 0x7f5f76b720a5 in std::allocator<char>::allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:198:32
    #3 0x7f5f76b720a5 in std::allocator_traits<std::allocator<char>>::allocate(std::allocator<char>&, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:482:20
    #4 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:381:20
    #5 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_create_storage(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:398:33
    #6 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_Vector_base(unsigned long, std::allocator<char> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:335:9
    #7 0x7f5f76b720a5 in std::vector<char, std::allocator<char>>::vector(std::vector<char, std::allocator<char>> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:602:9
    #8 0x7f5f76b720a5 in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:623:27
    #9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30
    #10 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26
    #11 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34
    #12 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12
    #13 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8
    #14 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37
    #15 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    #16 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52

SUMMARY: AddressSanitizer: heap-use-after-free /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*)
Shadow bytes around the buggy address:
  0x7e0f08c4ff80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50180: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x7e0f08c50200:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50280: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50300: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50380: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50400: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50480: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==3639==ABORTING
```

## Test Result

Test output after change:
```
HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*"
PRNG seed: 12345678
Note: Google Test filter = *CPU_Dump_NAN_FP32*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CPU_Dump_NAN_FP32
[ RUN      ] CPU_Dump_NAN_FP32.testDump
[       OK ] CPU_Dump_NAN_FP32.testDump (51 ms)
[----------] 1 test from CPU_Dump_NAN_FP32 (51 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (52 ms total)
[  PASSED  ] 1 test.
```

## Cline Analysis

### Test Coverage Analysis:

__1. LoadProgram Code Path (std::vector constructor):__

- __Primary Test__:
`rocm-libraries/projects/miopen/test/gtest/db_sync.cpp`
- __Function__: `BuildKernel()` calls `handle.LoadProgram(program_file,
program_args, "")`
- __Coverage__: This test extensively exercises the LoadProgram →
LoadBinary → HIPOCProgramImpl constructor path
- __Scope__: Tests multiple GPU architectures (gfx908, gfx90a, gfx942,
gfx1030) with different CU counts
- __Frequency__: Runs on thousands of kernel configurations in the
database sync tests

__2. Solution Binary Serialization (std::vector usage):__

- __Primary Test__:
`rocm-libraries/projects/miopen/test/gtest/find_2_conv.cpp`
- __Function__: `miopenSaveSolution()` and `miopenLoadSolution()` with
`std::vector<char> solution_binary`
- __Coverage__: Tests the save/load cycle of solution binaries
- __Scope__: Tests all convolution directions (Forward, BackwardData,
BackwardWeights)

__3. Additional Coverage:__

- __Cache Tests__: `rocm-libraries/projects/miopen/test/gtest/cache.cpp`
tests compression/decompression with `std::vector<char>`
- __Dropout Tests__: Uses `std::vector<unsigned char>` for reserve space
(related pattern)

__Test Quality Assessment:__

✅ __Both constructors are well-tested__:

- The `std::vector<char>` constructor is heavily exercised through
database sync tests
- The `std::vector<uint8_t>` constructor would be tested through any
code paths that use uint8_t binary data

✅ __Real-world scenarios covered__:

- Database synchronization (production kernel loading)
- Solution serialization (runtime binary handling)
- Multi-threaded execution (db_sync uses up to 32 threads)

✅ __Comprehensive architecture coverage__:

- Tests run on multiple GPU architectures
- Different compute unit configurations tested

__Confidence Level__: Very High


### Performance Analysis:

Regarding the performance impact of this fix, it's actually quite
minimal and represents good engineering practice:

__Memory Impact:__

- __Additional Memory Usage__: Each HIPOCProgramImpl object now stores a
copy of the binary data in its `binary` member variable
- __Typical Size__: GPU code objects are usually relatively small
(typically a few KB to a few MB depending on kernel complexity)
- __Lifetime__: The memory is only held for the lifetime of the
HIPOCProgram object, which is typically short-lived during kernel
loading

__Performance Characteristics:__

- __One-time Copy Cost__: There's a single memory copy operation during
construction (std::vector copy or iterator range construction)
- __No Runtime Overhead__: Once constructed, there's no additional
performance cost during kernel execution
- __Memory Safety Benefit__: Eliminates potential crashes and undefined
behavior, which far outweighs the small memory cost

__Context in MIOpen:__

- This occurs during the kernel loading phase, not during actual ML
inference/training
- Kernel loading is already an expensive operation involving
compilation, module creation, etc.
- The additional memory copy is negligible compared to the overall
kernel loading time

__Trade-off Analysis:__

- __Cost__: Small increase in memory usage during kernel loading
- __Benefit__: Eliminates memory safety bugs that could cause crashes or
data corruption
- __Net Result__: Significantly positive - reliability and correctness
are much more valuable than the minimal memory overhead

In practice, this fix follows the RAII (Resource Acquisition Is
Initialization) principle and ensures proper ownership semantics, which
is standard best practice in modern C++. The performance impact should
be unnoticeable in real-world usage.
sebvince added a commit to sebvince/rocm-libraries that referenced this pull request Mar 16, 2026
jovanau added a commit to jovanau/rocm-libraries that referenced this pull request Mar 19, 2026
## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
This PR addresses hipDNN issue ROCm#4951, which requests adding missing
frontend integration test coverage for the Matmul operation.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->
This PR includes three changes:

### 1. FrontendGraphFactory support for Matmu


Added MATMUL to OperationType.  
Added switch‑case dispatch in FrontendGraphFactory::create().  
Implemented createMatmulGraph() using:

```cpp
graph.matmul(a, b, matmulAttrs);
```

with simple 2×3 and 3×4 matrix inputs for deterministic testing.


### 2. Added new integration test: IntegrationMatmul.cpp

Following the structure of IntegrationConvForward.cpp, the test:

- is parameterized with:
  - good plugin  
  - execute‑fail plugin  
  - no‑engines plugin  
- tests both auto‑assigned and manual UIDs  
- builds a small Matmul graph using float tensors  
- exercises the entire frontend execution pipeline:

```
validate() → build_operation_graph() → create_execution_plans()
→ check_support() → build_plans() → get_workspace_size() → execute()
```

- uses SKIP_IF_NO_DEVICES() for GPU‑dependent execution  
- creates variant packs using device memory from the test tensor bundle
- verifies expected failures for execute‑fail and no‑engines plugins  

### 3. CMake update

Added IntegrationMatmul.cpp to tests/frontend/CMakeLists.txt under
public_hipdnn_frontend_tests.


## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->
All tests were built and executed inside the official TheRock docker
environment.

## Test Result

<!-- Briefly summarize test outcomes. -->

```
[1/2] Validating test names with --gtest_list_tests test collection

Test Name Validation Report
============================================================
Total tests found: 2901
Valid test names: 2901
Invalid test names: 0
```

```
[1/2] Running all tests via ctest
Test project /therock/output/build/ml-libs/hipDNN/build
    Start 1: hipdnn_data_sdk_tests
1/7 Test ROCm#1: hipdnn_data_sdk_tests ............   Passed    0.96 sec
    Start 2: hipdnn_backend_tests
2/7 Test ROCm#2: hipdnn_backend_tests .............   Passed    1.38 sec
    Start 3: hipdnn_frontend_tests
3/7 Test ROCm#3: hipdnn_frontend_tests ............   Passed    0.05 sec
    Start 4: hipdnn_test_sdk_tests
4/7 Test ROCm#4: hipdnn_test_sdk_tests ............   Passed    8.19 sec
    Start 5: hipdnn_plugin_sdk_tests
5/7 Test ROCm#5: hipdnn_plugin_sdk_tests ..........   Passed    0.03 sec
    Start 6: public_hipdnn_backend_tests
6/7 Test ROCm#6: public_hipdnn_backend_tests ......   Passed    0.32 sec
    Start 7: public_hipdnn_frontend_tests
7/7 Test ROCm#7: public_hipdnn_frontend_tests .....   Passed    0.35 sec
```

```
100% tests passed, 0 tests failed out of 7

Label Time Summary:
integration_test    =   0.67 sec*proc (2 tests)
unit_test           =  10.61 sec*proc (5 tests)

Total Test time (real) =  11.29 sec
```

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Signed-off-by: jovanau <u.jovana2@gmail.com>
jovanau pushed a commit to jovanau/rocm-libraries that referenced this pull request Mar 19, 2026
…being used after being freed. (ROCm#5220)

## Motivation

A `heap-use-after-free` error was triggered by AddressSanitizer on test
`CPU_Dump_NAN_FP32.testDump`.

## Technical Details

Root Cause Analysis:
The AddressSanitizer error occurred because the HIPOCProgramImpl
constructor was not storing the binary data passed to it. When
LoadProgram called LoadBinary and created a HIPOCProgram with the
returned vector, the temporary vector would go out of scope, but COMGR
still needed to access the binary data later, causing a use-after-free.

- The fix ensures that the HIPOCProgramImpl object owns the binary data
for its entire lifetime
- Both constructors now consistently store the binary data in the
`binary` member variable (std::vector)
- The uint8_t constructor converts the data to char format using
iterator range construction
- This prevents the use-after-free that occurred when COMGR tried to
access freed memory


## Test Plan

Test output before change:
```
HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*"
PRNG seed: 12345678
Note: Google Test filter = *CPU_Dump_NAN_FP32*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CPU_Dump_NAN_FP32
[ RUN      ] CPU_Dump_NAN_FP32.testDump
=================================================================
==3639==ERROR: AddressSanitizer: heap-use-after-free on address 0x7e0f08c50200 at pc 0x7f5f8d7a6554 bp 0x7ffcb7c4a730 sp 0x7ffcb7c49ee8
READ of size 26088 at 0x7e0f08c50200 thread T0
    #0 0x7f5f8d7a6553 in memcpy /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:117:5
    ROCm#1 0x7f5f23d61d78 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9
    ROCm#2 0x7f5f23d61d78 in COMGR::DataObject::setData(llvm::StringRef) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:334:17
    ROCm#3 0x7f5f23d61d78 in amd_comgr_set_data /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:606:24
    ROCm#4 0x7f5f221dc1d3 in amd::Comgr::set_data(amd_comgr_data_s, unsigned long, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/comgrctx.hpp:252:12
    ROCm#5 0x7f5f221dc1d3 in amd::device::Program::getSymbolsFromCodeObj(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>*, amd_comgr_symbol_type_s) const /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/devprogram.cpp:2061:14
    ROCm#6 0x7f5f219e6f7c in hip::DynCO::populateDynGlobalVars() /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:216:22
    ROCm#7 0x7f5f219e8e6a in hip::DynCO::getDynFunc(ihipModuleSymbol_t**, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:125:22
    ROCm#8 0x7f5f21f842ba in hip::PlatformState::GetDynFunc(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_platform.cpp:884:22
    ROCm#9 0x7f5f21ec2d71 in hip::hipModuleGetFunction(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_module.cpp:89:47
    ROCm#10 0x7f5f2212c588 in hipModuleGetFunction /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_table_interface.cpp:1926:10
    ROCm#11 0x7f5f7478d806 in miopen::HIPOCKernel::HIPOCKernel(miopen::HIPOCProgram, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::vector<unsigned long, std::allocator<unsigned long>>, std::vector<unsigned long, std::allocator<unsigned long>>) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/include/miopen/hipoc_kernel.hpp:225:25
    ROCm#12 0x7f5f766febb7 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:161:18
    ROCm#13 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34
    ROCm#14 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12
    ROCm#15 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8
    ROCm#16 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37
    ROCm#17 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    ROCm#18 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52
    ROCm#19 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2728:50
    ROCm#20 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2718:6
    ROCm#21 0x55e87ef04e64 in testing::TestInfo::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2874:14
    ROCm#22 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3052:33
    ROCm#23 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3006:6
    ROCm#24 0x55e87ef0d20b in testing::internal::UnitTestImpl::RunAllTests() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:6004:47
    ROCm#25 0x55e87ef1a1de in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    ROCm#26 0x55e87ef1a1de in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52
    ROCm#27 0x55e87ef051b5 in testing::UnitTest::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:5583:55
    ROCm#28 0x55e87eee3f9b in RUN_ALL_TESTS() /data/nhanna/repos/TheRock/build/third-party/googletest/dist/include/gtest/gtest.h:2334:73
    ROCm#29 0x55e87eee3f9b in main /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/main_hip.cpp:34:12
    ROCm#30 0x7f5f20a587e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: 889235a2805b8308b2d0274921bbe1890e9a1986)
    ROCm#31 0x55e87b0bcf2d in _start (/data/nhanna/repos/TheRock/build/ml-libs/MIOpen/build/bin/miopen_gtest+0x126bf2d)

0x7e0f08c50200 is located 0 bytes inside of 26088-byte region [0x7e0f08c50200,0x7e0f08c567e8)
freed by thread T0 here:
    #0 0x7f5f8d7b8ba2 in operator delete(void*, unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:190:3
    ROCm#1 0x7f5f76b7317d in std::__new_allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:172:2
    ROCm#2 0x7f5f76b7317d in std::allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:210:25
    ROCm#3 0x7f5f76b7317d in std::allocator_traits<std::allocator<char>>::deallocate(std::allocator<char>&, char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:517:13
    ROCm#4 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::_M_deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:390:4
    ROCm#5 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::~_Vector_base() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:369:2
    ROCm#6 0x7f5f76b7317d in std::vector<char, std::allocator<char>>::~vector() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:738:7
    ROCm#7 0x7f5f76b7317d in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:633:5
    ROCm#8 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30
    ROCm#9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26
    ROCm#10 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34
    ROCm#11 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12
    ROCm#12 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8
    ROCm#13 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37
    ROCm#14 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    ROCm#15 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52

previously allocated by thread T0 here:
    #0 0x7f5f8d7b7f9d in operator new(unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:109:35
    ROCm#1 0x7f5f76b720a5 in std::__new_allocator<char>::allocate(unsigned long, void const*) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:151:27
    ROCm#2 0x7f5f76b720a5 in std::allocator<char>::allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:198:32
    ROCm#3 0x7f5f76b720a5 in std::allocator_traits<std::allocator<char>>::allocate(std::allocator<char>&, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:482:20
    ROCm#4 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:381:20
    ROCm#5 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_create_storage(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:398:33
    ROCm#6 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_Vector_base(unsigned long, std::allocator<char> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:335:9
    ROCm#7 0x7f5f76b720a5 in std::vector<char, std::allocator<char>>::vector(std::vector<char, std::allocator<char>> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:602:9
    ROCm#8 0x7f5f76b720a5 in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:623:27
    ROCm#9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30
    ROCm#10 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26
    ROCm#11 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34
    ROCm#12 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12
    ROCm#13 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8
    ROCm#14 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37
    ROCm#15 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    ROCm#16 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52

SUMMARY: AddressSanitizer: heap-use-after-free /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*)
Shadow bytes around the buggy address:
  0x7e0f08c4ff80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50180: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x7e0f08c50200:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50280: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50300: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50380: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50400: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50480: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==3639==ABORTING
```

## Test Result

Test output after change:
```
HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*"
PRNG seed: 12345678
Note: Google Test filter = *CPU_Dump_NAN_FP32*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CPU_Dump_NAN_FP32
[ RUN      ] CPU_Dump_NAN_FP32.testDump
[       OK ] CPU_Dump_NAN_FP32.testDump (51 ms)
[----------] 1 test from CPU_Dump_NAN_FP32 (51 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (52 ms total)
[  PASSED  ] 1 test.
```

## Cline Analysis

### Test Coverage Analysis:

__1. LoadProgram Code Path (std::vector constructor):__

- __Primary Test__:
`rocm-libraries/projects/miopen/test/gtest/db_sync.cpp`
- __Function__: `BuildKernel()` calls `handle.LoadProgram(program_file,
program_args, "")`
- __Coverage__: This test extensively exercises the LoadProgram →
LoadBinary → HIPOCProgramImpl constructor path
- __Scope__: Tests multiple GPU architectures (gfx908, gfx90a, gfx942,
gfx1030) with different CU counts
- __Frequency__: Runs on thousands of kernel configurations in the
database sync tests

__2. Solution Binary Serialization (std::vector usage):__

- __Primary Test__:
`rocm-libraries/projects/miopen/test/gtest/find_2_conv.cpp`
- __Function__: `miopenSaveSolution()` and `miopenLoadSolution()` with
`std::vector<char> solution_binary`
- __Coverage__: Tests the save/load cycle of solution binaries
- __Scope__: Tests all convolution directions (Forward, BackwardData,
BackwardWeights)

__3. Additional Coverage:__

- __Cache Tests__: `rocm-libraries/projects/miopen/test/gtest/cache.cpp`
tests compression/decompression with `std::vector<char>`
- __Dropout Tests__: Uses `std::vector<unsigned char>` for reserve space
(related pattern)

__Test Quality Assessment:__

✅ __Both constructors are well-tested__:

- The `std::vector<char>` constructor is heavily exercised through
database sync tests
- The `std::vector<uint8_t>` constructor would be tested through any
code paths that use uint8_t binary data

✅ __Real-world scenarios covered__:

- Database synchronization (production kernel loading)
- Solution serialization (runtime binary handling)
- Multi-threaded execution (db_sync uses up to 32 threads)

✅ __Comprehensive architecture coverage__:

- Tests run on multiple GPU architectures
- Different compute unit configurations tested

__Confidence Level__: Very High


### Performance Analysis:

Regarding the performance impact of this fix, it's actually quite
minimal and represents good engineering practice:

__Memory Impact:__

- __Additional Memory Usage__: Each HIPOCProgramImpl object now stores a
copy of the binary data in its `binary` member variable
- __Typical Size__: GPU code objects are usually relatively small
(typically a few KB to a few MB depending on kernel complexity)
- __Lifetime__: The memory is only held for the lifetime of the
HIPOCProgram object, which is typically short-lived during kernel
loading

__Performance Characteristics:__

- __One-time Copy Cost__: There's a single memory copy operation during
construction (std::vector copy or iterator range construction)
- __No Runtime Overhead__: Once constructed, there's no additional
performance cost during kernel execution
- __Memory Safety Benefit__: Eliminates potential crashes and undefined
behavior, which far outweighs the small memory cost

__Context in MIOpen:__

- This occurs during the kernel loading phase, not during actual ML
inference/training
- Kernel loading is already an expensive operation involving
compilation, module creation, etc.
- The additional memory copy is negligible compared to the overall
kernel loading time

__Trade-off Analysis:__

- __Cost__: Small increase in memory usage during kernel loading
- __Benefit__: Eliminates memory safety bugs that could cause crashes or
data corruption
- __Net Result__: Significantly positive - reliability and correctness
are much more valuable than the minimal memory overhead

In practice, this fix follows the RAII (Resource Acquisition Is
Initialization) principle and ensures proper ownership semantics, which
is standard best practice in modern C++. The performance impact should
be unnoticeable in real-world usage.
Alex-Vasile added a commit that referenced this pull request Apr 10, 2026
…#3, #5, stacked on #1, #2, #5]

Generalize the fast path to support double-precision accumulation:

- Template ShadowBuffer<AccumT>: storage, pointer, and element access
  are all AccumT. Float/Double inputs zero-copy when AccumT matches;
  sub-float types go through float then widen.
- Template loadTo<AccumT, SrcType> and storeFrom<AccumT, DstType>
  (renamed from loadToFloat/storeFromFloat).
- Rename solveCPUFastInF32 → solveCPUFast<AccumT, MathOpAccumT>:
  all tile registers, inner reduction, epilogue, alpha/beta extraction,
  bias reading, and activation args use AccumT.
- Add Double to isFastPathEligible's supported input/output types.
- SolveGemmCPU dispatch: route Double to solveCPUFast<double>.
- Add 10 f64 fast-path tests (transpose combos, Beta, Bias,
  AllFeatures, TN_AllFeatures, ScaleAB Scalar/Vector).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jmachado-amd pushed a commit to jmachado-amd/rocm-libraries that referenced this pull request Apr 14, 2026
Initial implementation of GEEV's tests
Alex-Vasile added a commit that referenced this pull request Apr 21, 2026
…#3, #5, stacked on #1, #2, #5]

Generalize the fast path to support double-precision accumulation:

- Template ShadowBuffer<AccumT>: storage, pointer, and element access
  are all AccumT. Float/Double inputs zero-copy when AccumT matches;
  sub-float types go through float then widen.
- Template loadTo<AccumT, SrcType> and storeFrom<AccumT, DstType>
  (renamed from loadToFloat/storeFromFloat).
- Rename solveCPUFastInF32 → solveCPUFast<AccumT, MathOpAccumT>:
  all tile registers, inner reduction, epilogue, alpha/beta extraction,
  bias reading, and activation args use AccumT.
- Add Double to isFastPathEligible's supported input/output types.
- SolveGemmCPU dispatch: route Double to solveCPUFast<double>.
- Add 10 f64 fast-path tests (transpose combos, Beta, Bias,
  AllFeatures, TN_AllFeatures, ScaleAB Scalar/Vector).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 30, 2026
Records the 5 approaches considered for symbolic-vs-numeric register
robustness:
  1. Name-resolution table (brittle, complex)
  2. Symbolic-only normalization (doesn't solve actual problem)
  3. Numeric-only resolution (assembly-time dependency)
  4. Equivalence-class comparison (loses precision)
  5. Render-string identity (matches GPU view, robust)

And the rationale for picking #5.

Documents the known limitation: same logical reg with different
identifiers across captures still differs. Doesn't arise in practice
because both captures consume the same writer state; future work if
needed would add approach #1 (name-resolution table) on top.
Alex-Vasile added a commit that referenced this pull request May 1, 2026
#5)

Completes the coverage-gap inventory for ScheduleCapture._reads. Pins
the LR LDS-address gap (DSLoad src is LocalReadAddrA, modified by LRS
VXorB32 — invisible RAW today) and the LW LDS-address gap (DSStore
dstAddr is LocalWriteAddrA, same VXorB32 producer pattern). Both fail
loudly when Sub-task 10's wrapper-based extractors close the gap.
Alex-Vasile added a commit that referenced this pull request May 5, 2026
Adds _iter_note(producer, consumer) in ScheduleCapture.py: returns
" (of next iteration)" when consumer.position.loop_index ==
producer.position.loop_index + 1. Generalizes the prior MissingWaitFailure
inline check (which hardcoded BODY_LABEL_TO_LOOP_INDEX[ML_PREV] -> [ML])
to any i -> i+1 boundary; loop_index is the canonical cross-body
iteration counter so the numeric +1 test is the right discriminator.

MissingWaitFailure (#2) refactored to use the helper.
WaitTooLateFailure (#4), WaitInsufficientFailure (#5), and
MissingBarrierFailure (#6) now also append the suffix when the
producer/consumer pair crosses an iteration boundary. Suffix attaches
right after the consumer's `@ idx=N` mention so the message reads:

  MFMA[name] @ idx=10 (of next iteration) is guaranteed by an SWaitCnt @ idx=12 ...
  MFMA[name] @ idx=10 (of next iteration)'s producer LRA0 @ idx=5 ...
  ... between the SWaitCnt and GRA @ idx=2 (of next iteration).

3 new cross-iter pinning tests + 3 same-iter regression assertions
(`assert "(of next iteration)" not in msg`) so a future regression that
incorrectly fires the suffix on same-iter pairs is caught.
Alex-Vasile added a commit that referenced this pull request May 5, 2026
Adds _node_with_pos(node, capture) — combines _node_label (per-category-
stream [N] index, plain MFMA omits) with bare '@ idx=M' position render.
Single helper for the canonical Failure node reference shape, replacing
three different prior styles:

  - #1, #2: manual `_node_label + " " + "@ idx=N"` concatenation
  - #4, #5: `category[name] format_position(...)` (rendered the FULL name
    inside brackets, e.g. `LRA0[LRA0[0]]`, plus a cross-category list
    suffix that duplicates [N]'s purpose)
  - #6: `category format_position(...)` (no brackets)

All five formatters now route through _node_with_pos. Plain MFMA stays
bracket-less per _node_label's MFMA discriminator; PackMFMAs (categories
PackA*/PackB*) keep [N] because CMS reschedules them.

Waits in #5 stay as bare '@ idx=N' — the surrounding 'SWaitCnt' word
already names the kind; rendering the SYNC category as `SYNC[N] @ idx=M`
would just duplicate that.

Skipped:
  - #10 SCCConflict: brackets carry rocisa class name (e.g. [SCSelectB32])
    not [N] index; semantic conflict, separate audit.
  - #7 WrongInterleaving / #8 TimingTooClose: use `name` field for Pack
    identity (MiddlePack_a/_b/_c, CVT0_a/_b); replacing with [N] would
    lose the a/b/c discriminator.
  - #13 ConstraintViolation: slated for deletion in bead `pcz`.

3 new pinning tests verifying [N] actually appears when capture is given
(one per Failure: #4 LRA0[1], #5 LRA0[1] + plain MFMA bracket-less,
#6 GRA[1] in trailing reference). Test count: 564 passed (+3).
b-shi added a commit that referenced this pull request May 6, 2026
…edge dispatch (#7064)

## Motivation

Fix correctness failures in the `UseSubtileImpl` NonEdge store path for
gfx950 BF16 and MXFP4 subtile kernels. These failures were caused by
several interrelated bugs in the interleaved (GLS=0) store codegen that
manifested when M-guard branches skipped stores at runtime, and by an
overly strict edge/NonEdge dispatch check that forced subtile-aligned
workgroups into the unoptimized edge path.

## Technical Details

**1. vmcnt hazard in interleaved B1 NonEdge store path**
(`GlobalWriteBatch.py`)

On gfx950 (`SeparateVscnt=False`), loads and stores share the same vmcnt
counter. The interleaved store path computed `vmcnt = vlcnt + vscnt`
where `vscnt = self.storesIssued`. When M-guard branches skip stores at
runtime, the actual vmcnt counter has fewer outstanding operations than
codegen assumed, making `s_waitcnt vmcnt(N)` too permissive — the
hardware doesn't wait long enough for C-loads to complete before fmacs
consume them. Fixed by setting `vscnt = 0` for `UseSubtileImpl` with
`GroupLoadStore=False`. This is more conservative but correct regardless
of how many stores were skipped at runtime.

**2. SrdD increment skipped by M-guard branch** (`GlobalWriteBatch.py`)

The SrdD `incToNextRow` was emitted inside the M-guard-skippable region
of the store loop. When the M-guard branch skipped the last store in an
N-group, the SrdD increment was also skipped, causing subsequent N-group
stores to write to the wrong row address. Fixed by deferring the
increment and emitting it after the N-group end label.

**3. N-group end label placed after next N-group's fmacs**
(`GlobalWriteBatch.py`)

When `GroupLoadStore=False`, fmacs and stores are interleaved in the
same module. The fmacs for N-group K+1 were emitted before the N-group K
end label was placed. The M-guard branch (targeting the end label) would
skip both the last store of N-group K *and* the fmacs for N-group K+1,
leaving the K+1 accumulators at zero. Fixed by flushing the pending
N-group end label and deferred SrdD increment before emitting the next
N-group's fmacs.

**4. Paired store blockIdxM guard mismatch** (`GlobalWriteBatch.py`)

`SubtileMGuard` counts valid M-blocks in `MatrixInstM` (16-row) units,
but `blockIdxM` for paired and orphan stores was computed in 32-row
units (`(tt0-1)//2` or `(tt0*16)//32`). This mismatch caused incorrect
OOB guard decisions. Fixed by using `blockIdxM = tt0` (16-row index) to
match MGuard units.

**5. Scalar fallback for partial paired stores** (`GlobalWriteBatch.py`)

Added a scalar `dwordx2` fallback when only the lower M-block in a pair
is valid (`MGuard > tt0-1` but not `MGuard > tt0`). Previously, the
paired `dwordx4` store would execute for both blocks even when the upper
block was OOB. This case arises when the tile remainder has an odd
number of `MatrixInstM`-sized blocks (e.g., remainder=48 = 3 blocks of
16 rows).

**6. sba=1 orphan store for large macro tiles** (`GlobalWriteBatch.py`)

When `MIWaveTile[0]` is large enough that batch boundaries split an
(sba=0, sba=1) pair, the sba=1 element had no partner and was silently
dropped. Added scalar store handling for this orphan case.

**7. Relaxed edge/NonEdge dispatch alignment**
(`KernelWriterAssembly.py`)

The `checkIsEdgeSubtile` M-dimension alignment was `waveGroupM` (e.g.,
48 for MIWT3), requiring the tile remainder to be a multiple of the full
wave group height. Reduced to `MatrixInstM` (16), so any remainder that
is a multiple of 16 rows takes the optimized NonEdge path. The NonEdge
path's MGuard + scalar fallback (fix #5) handles partial wave groups
correctly.

All changes are guarded by `UseSubtileImpl` — no impact on non-subtile
kernels.

## Test Plan

- Run `subtile_bf16.yaml` (BF16 BBS/BSS, multiple MIWT configs, SK3,
PGR0/PGR2) with tile-aligned sizes (M % 32 == 0, N % 16 == 0)
- Run `subtile_mxfp4.yaml` (MXFP4 F4BS/F4HS/F4SS, bias, activations,
ScaleAlphaVec, PGR0/PGR2) with tile-aligned sizes (M % 32 == 0, N % 32
== 0)
- Verified edge stores are not exercised for tile-aligned sizes by
temporarily disabling edge path stores and confirming all tests still
pass
- Verified with rocgdb breakpoints that MGuard/NGuard values, C-load
data, and accumulator values are correct at N-group boundaries

## Test Result

- `subtile_bf16.yaml` (tile-aligned sizes): all tests PASSED
- `subtile_mxfp4.yaml` (tile-aligned sizes): all tests PASSED
- Edge-stores-disabled verification: all tests PASSED (confirms
tile-aligned sizes use NonEdge path exclusively)

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
bghimireamd added a commit that referenced this pull request May 7, 2026
…, move inspection tool

- Remove reviewer name from RFC body (line 543)
- Remove generate_diagrams.py from repo (keep locally)
- Remove dead graph_level_correctness diagram code from script
- Move Bundle Inspection Tool from Detailed Design to Future Work item #6
  (was marked v2/not-v1 but sitting in Detailed Design — confusing)
- Add metadata sidecar to Future Work item #5
- Rewrite infrastructure table with consistent Read/Split/Compare pattern

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bghimireamd added a commit that referenced this pull request May 7, 2026
…o Future Work

- Change "partially working for batchnorm" to "initial infrastructure is in place"
- Add Future Work item #5: external data validation (Python-only comparison
  of client-submitted bundles against golden references)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bghimireamd added a commit that referenced this pull request May 10, 2026
- Generator auto-derives the output path (Operation/Layout/DataType)
  from graph content — developer supplies only tier and bundle name
- Add CLI example: --tier smoke --name Small → full path computed
- Update Folder Convention table: Path is auto-derived, not manual
- Future Work #5: auto-tier classification based on tensor element
  counts, matching getSmall/getMedium/getLargeEdge/getLargeStress

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
aledudek pushed a commit that referenced this pull request May 20, 2026
…edge dispatch (#7064)

## Motivation

Fix correctness failures in the `UseSubtileImpl` NonEdge store path for
gfx950 BF16 and MXFP4 subtile kernels. These failures were caused by
several interrelated bugs in the interleaved (GLS=0) store codegen that
manifested when M-guard branches skipped stores at runtime, and by an
overly strict edge/NonEdge dispatch check that forced subtile-aligned
workgroups into the unoptimized edge path.

## Technical Details

**1. vmcnt hazard in interleaved B1 NonEdge store path**
(`GlobalWriteBatch.py`)

On gfx950 (`SeparateVscnt=False`), loads and stores share the same vmcnt
counter. The interleaved store path computed `vmcnt = vlcnt + vscnt`
where `vscnt = self.storesIssued`. When M-guard branches skip stores at
runtime, the actual vmcnt counter has fewer outstanding operations than
codegen assumed, making `s_waitcnt vmcnt(N)` too permissive — the
hardware doesn't wait long enough for C-loads to complete before fmacs
consume them. Fixed by setting `vscnt = 0` for `UseSubtileImpl` with
`GroupLoadStore=False`. This is more conservative but correct regardless
of how many stores were skipped at runtime.

**2. SrdD increment skipped by M-guard branch** (`GlobalWriteBatch.py`)

The SrdD `incToNextRow` was emitted inside the M-guard-skippable region
of the store loop. When the M-guard branch skipped the last store in an
N-group, the SrdD increment was also skipped, causing subsequent N-group
stores to write to the wrong row address. Fixed by deferring the
increment and emitting it after the N-group end label.

**3. N-group end label placed after next N-group's fmacs**
(`GlobalWriteBatch.py`)

When `GroupLoadStore=False`, fmacs and stores are interleaved in the
same module. The fmacs for N-group K+1 were emitted before the N-group K
end label was placed. The M-guard branch (targeting the end label) would
skip both the last store of N-group K *and* the fmacs for N-group K+1,
leaving the K+1 accumulators at zero. Fixed by flushing the pending
N-group end label and deferred SrdD increment before emitting the next
N-group's fmacs.

**4. Paired store blockIdxM guard mismatch** (`GlobalWriteBatch.py`)

`SubtileMGuard` counts valid M-blocks in `MatrixInstM` (16-row) units,
but `blockIdxM` for paired and orphan stores was computed in 32-row
units (`(tt0-1)//2` or `(tt0*16)//32`). This mismatch caused incorrect
OOB guard decisions. Fixed by using `blockIdxM = tt0` (16-row index) to
match MGuard units.

**5. Scalar fallback for partial paired stores** (`GlobalWriteBatch.py`)

Added a scalar `dwordx2` fallback when only the lower M-block in a pair
is valid (`MGuard > tt0-1` but not `MGuard > tt0`). Previously, the
paired `dwordx4` store would execute for both blocks even when the upper
block was OOB. This case arises when the tile remainder has an odd
number of `MatrixInstM`-sized blocks (e.g., remainder=48 = 3 blocks of
16 rows).

**6. sba=1 orphan store for large macro tiles** (`GlobalWriteBatch.py`)

When `MIWaveTile[0]` is large enough that batch boundaries split an
(sba=0, sba=1) pair, the sba=1 element had no partner and was silently
dropped. Added scalar store handling for this orphan case.

**7. Relaxed edge/NonEdge dispatch alignment**
(`KernelWriterAssembly.py`)

The `checkIsEdgeSubtile` M-dimension alignment was `waveGroupM` (e.g.,
48 for MIWT3), requiring the tile remainder to be a multiple of the full
wave group height. Reduced to `MatrixInstM` (16), so any remainder that
is a multiple of 16 rows takes the optimized NonEdge path. The NonEdge
path's MGuard + scalar fallback (fix #5) handles partial wave groups
correctly.

All changes are guarded by `UseSubtileImpl` — no impact on non-subtile
kernels.

## Test Plan

- Run `subtile_bf16.yaml` (BF16 BBS/BSS, multiple MIWT configs, SK3,
PGR0/PGR2) with tile-aligned sizes (M % 32 == 0, N % 16 == 0)
- Run `subtile_mxfp4.yaml` (MXFP4 F4BS/F4HS/F4SS, bias, activations,
ScaleAlphaVec, PGR0/PGR2) with tile-aligned sizes (M % 32 == 0, N % 32
== 0)
- Verified edge stores are not exercised for tile-aligned sizes by
temporarily disabling edge path stores and confirming all tests still
pass
- Verified with rocgdb breakpoints that MGuard/NGuard values, C-load
data, and accumulator values are correct at N-group boundaries

## Test Result

- `subtile_bf16.yaml` (tile-aligned sizes): all tests PASSED
- `subtile_mxfp4.yaml` (tile-aligned sizes): all tests PASSED
- Edge-stores-disabled verification: all tests PASSED (confirms
tile-aligned sizes use NonEdge path exclusively)

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

migration Tasks or issues tied to migration to this monorepo project: rocblas

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant