Skip to content

[GitHub Actions] python-based workflows based on files changed in a pull request#8

Merged
jayhawk-commits merged 31 commits into
developfrom
joseph-workflowImprovements
May 9, 2025
Merged

[GitHub Actions] python-based workflows based on files changed in a pull request#8
jayhawk-commits merged 31 commits into
developfrom
joseph-workflowImprovements

Conversation

@jayhawk-commits
Copy link
Copy Markdown
Collaborator

@jayhawk-commits jayhawk-commits commented May 1, 2025

Key Workflows

Pull Request Fanout

  • When a pull request is opened or modified on the monorepo, determine which projects were changed. Fanout and create pull requests on those project's repos.
  • Those fanout pull requests will get the changed files for those individual projects and trigger the existing CICD pipelines on those repos.
  • Copying the results of the fanned out pull requests back to this monorepo pull request will be done in a separate change, as this change is already large to review and test.
  • For interfacing between the monorepo and subrepos, the pull request branches follow the naming convention: monorepo-pr-<monorepo-pr-number>-<subrepo-name>

Auto-Label

  • When a pull request is opened or modified on the monorepo, apply category labels based on files affected.
  • When a label is added to a monorepo pull request, apply the label to the fanned out pull requests in the subrepos if they have the label.
  • This latter mirroring is only one way from monorepo to individual repos. When a label is removed from the monorepo pull request, nothing happens.
  • As some teams use labels to determine what CICD pipelines are run, this mirroring is helpful for migration phase.
  • No new labels are created on the repo-level with this job. Creation of the expected labels are handled elsewhere.
  • Need to learn more about team practices to see if we need to mirror label removal on the monorepo pull request.

Pull Request Close Fanout

  • When a pull request is closed (including merges), close all the pull requests that were fanned out and delete those branches.
  • This is done through the naming convention of the fanned out branches.

Key Considerations

Sparse-Checkouts

  • Sparse-checkouts are leveraged as much as possible to clone as little code as possible.
  • The git subtree operations require fetch-depth of 0.

gh command-line interface and Authentication

python structure

  • Common code is stripped out into separate files.
  • The json defining the projects in the monorepo is now defined by a model.
  • All gh usage is done through a client class.
  • python script file names use underscores to follow PEP.
  • All python scripts are in one folder for now. We can figure out how we want to organize scripts as more get added.

Test Sequences

Environment:

  • A sandbox monorepo with forks of hipcub and rocthrust was created.
  • Simple randomized-result workflow was added to these project forks.
  • Prerequisite labels were added for testing.
  • Follow the sequence linked pull request events and comments describing them.
  • You can view action links here: https://github.com/amd-jmacaran/libs-mono-test/actions
  • Since my personal access token was used, my github account is attached to automated actions.
  • Assume all the operations done on the subrepos were automatically performed.

Test Sequence One:

  • Only one project in the monorepo is changed in a pull request.
  • Wait for workflows after pull request creation. See if a pull request is created on the subrepo.
  • Add label to the monorepo pull request. Monitor the subrepo pull request.
  • Add another label to the monorepo pull request. Monitor the subrepo pull request.
  • Close the monorepo pull request. Monitor the subrepo pull request.

Test Sequence Two:

  • Two projects in the monorepo are changed in a pull request.
  • Have a label added before hitting the create pull request button.
  • Wait for workflows after pull request creation. See if pull requests are created on the subrepo with corresponding labels.
  • Add label specific to one project to the monorepo pull request. Monitor the subrepo pull requests.
  • Remove changes for one project. Monitor the monorepo pull request's category labels. This subrepo is now ignored, so nothing happens on that pull request for now.
  • Close the monorepo pull request. Monitor the two subrepo pull requests.

Test Sequence Three:

  • Only one project in the monorepo is changed in a pull request.
  • Wait for workflows after pull request creation. See if a pull request is created on the subrepo.
  • Change additional files in the project in the monorepo pull request. Monitor the subrepo pull request.
  • Undo some of the changes in the project in the monorepo pull request. Monitor the subrepo pull request.

Two workflows and corresponding python scripts that take a look at files changed in a pull request on the monorepo. Then based on that output, perform tasks tied to the old individual repos. One workflow is applying labels. The other workflow is to fan out and create pull requests. There is some placeholder remaining on the second workflow.
Copy link
Copy Markdown
Collaborator

@geomin12 geomin12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some housekeeping / organizational items! looks good and will re-review upon more work done on the second part

Comment thread .github/workflows/pr-auto-label.yml Outdated
Comment thread .github/workflows/pr-auto-label.yml Outdated
Comment thread .github/scripts/pr-auto-label.py Outdated
Comment thread .github/scripts/pr-auto-label.py Outdated
Comment thread .github/scripts/pr-auto-label.py Outdated
Comment thread .github/scripts/pr-detect-changed-subtrees.py Outdated
Comment thread .github/scripts/pr-detect-changed-subtrees.py Outdated
Comment thread .github/scripts/pr-detect-changed-subtrees.py Outdated
Comment thread .github/scripts/pr-fanout.py Outdated
Comment thread .github/scripts/pr-fanout.py Outdated
Comment thread .github/workflows/pr-auto-label.yml Outdated
Comment thread .github/scripts/pr-auto-label.py Outdated
Comment thread .github/scripts/pr-fanout.py Outdated
Comment thread .github/scripts/pr-auto-label.py Outdated
- Refactored script code into smaller functions.
- Converted GitHub rest api calls to use gh cli.
- Moved out all gh cli usage to a separate class.
- Try to make the scripts as similar to each other, when applicable.
- Leverage dependabot to handle GitHub Actions versioning.
@jayhawk-commits
Copy link
Copy Markdown
Collaborator Author

Applied most of the provided feedback, but still in the middle of fixing things. I am pushing to remote as I plan on moving between work spaces today (home versus office). I will share some test logs when I am ready for further feedback.

- Use pydantic to define the entries in the json file and for validation when the json file is consumed.
- reponame field no longer used by any existing scripts.
- gh cli doesn't support writing GitHub checks, so REST API created to support this.
- Implement the functions that synchronize the custom labels and GitHub Checks between the monorepo PR and the fanned out PRs.
ammallya pushed a commit that referenced this pull request Sep 24, 2025
…#8)

* fix clang-format commands to use find correctly.  Original issue with find was cmake escaping things oddly.

* clang-tidy works with this setup

* clang tidy now work as i expect.  The headers are included using a positive filter, negtive filters dont work.

* updated clang-tidy and copilot instructions

* fix all repo violations, update clang-tidy for things were not going to bother dealing with.

* add option to turn on/off clangtidy.  default is off

* un nesting namespaces, running full format on repo

* ensure all files are checked by clang tidy.  Add one more exclusion for the generated export header that I cant figure out how to exclude.

* add more copilot instructions

* fix incorrect name
ammallya pushed a commit that referenced this pull request Sep 24, 2025
…#8)

* fix clang-format commands to use find correctly.  Original issue with find was cmake escaping things oddly.

* clang-tidy works with this setup

* clang tidy now work as i expect.  The headers are included using a positive filter, negtive filters dont work.

* updated clang-tidy and copilot instructions

* fix all repo violations, update clang-tidy for things were not going to bother dealing with.

* add option to turn on/off clangtidy.  default is off

* un nesting namespaces, running full format on repo

* ensure all files are checked by clang tidy.  Add one more exclusion for the generated export header that I cant figure out how to exclude.

* add more copilot instructions

* fix incorrect name

[ROCm/hipDNN commit: f8dce7f]
stanleytsang-amd pushed a commit that referenced this pull request Dec 12, 2025
## Motivation

Enable gfx1152 and gfx1153.

## Technical Details

1. combine arrays into tables and use local macros to reduce repetition
(for maintainability)
2. monkey-see-monkey-do wherever `gfx11...` was found

## Test Plan

Build existing ctests for, and run them on, gfx1152 and gfx1153.

## Test Result

### 

<details>
<summary>gfx1152 passed (click to see log)</summary>

```
INFO:root:++ Exec [/tmp/eble]$ ctest --test-dir /tmp/eble/rocm/bin/rocprim --output-o
n-failure --parallel 2 --exclude-regex 'rocprim.lookback_reproducibility|rocprim.link
ing|rocprim.device_merge_inplace|rocprim.device_merge_sort|rocprim.device_partition|r
ocprim.device_radix_sort|rocprim.device_scan|rocprim.device_select|rocprim.device_fin
d_first_of|rocprim.device_reduce_by_key' --timeout 60
Test project /tmp/eble/rocm/bin/rocprim
      Start  1: hip.device_api
      Start  2: hip.async_copy
 1/73 Test  #2: hip.async_copy ..............................   Passed    0.01 sec
      Start  3: hip.ordered_block_id
 2/73 Test  #1: hip.device_api ..............................   Passed    0.02 sec
      Start  4: rocprim.internal_merge_path
 3/73 Test  #4: rocprim.internal_merge_path .................   Passed    0.01 sec
      Start  5: rocprim.basic_test
 4/73 Test  #3: hip.ordered_block_id ........................   Passed    0.01 sec
      Start  6: rocprim.arg_index_iterator
 5/73 Test  #5: rocprim.basic_test ..........................   Passed    0.01 sec
      Start  7: rocprim.temporary_storage_partitioning
 6/73 Test  #6: rocprim.arg_index_iterator ..................   Passed    0.01 sec
      Start  8: rocprim.block_adjacent_difference
 7/73 Test  #7: rocprim.temporary_storage_partitioning ......   Passed    0.01 sec
      Start  9: rocprim.block_discontinuity
 8/73 Test  #8: rocprim.block_adjacent_difference ...........   Passed    2.34 sec
      Start 10: rocprim.bit_cast
 9/73 Test #10: rocprim.bit_cast ............................   Passed    0.02 sec
      Start 11: rocprim.block_exchange
10/73 Test #11: rocprim.block_exchange ......................   Passed    0.73 sec
      Start 12: rocprim.block_histogram
11/73 Test #12: rocprim.block_histogram .....................   Passed    0.54 sec
      Start 13: rocprim.block_load_store
12/73 Test #13: rocprim.block_load_store ....................   Passed    0.44 sec
      Start 14: rocprim.block_sort_merge
13/73 Test #14: rocprim.block_sort_merge ....................   Passed    0.02 sec
      Start 15: rocprim.block_sort_merge_stable
14/73 Test #15: rocprim.block_sort_merge_stable .............   Passed    0.02 sec
      Start 16: rocprim.block_radix_rank
15/73 Test #16: rocprim.block_radix_rank ....................   Passed    0.03 sec
      Start 17: rocprim.block_radix_sort
16/73 Test #17: rocprim.block_radix_sort ....................   Passed    4.79 sec
      Start 18: rocprim.block_reduce
17/73 Test #18: rocprim.block_reduce ........................   Passed    0.26 sec
      Start 19: rocprim.block_run_length_decode
18/73 Test #19: rocprim.block_run_length_decode .............   Passed    0.54 sec
      Start 20: rocprim.block_scan
19/73 Test #20: rocprim.block_scan ..........................   Passed    0.04 sec
      Start 21: rocprim.block_shuffle
20/73 Test #21: rocprim.block_shuffle .......................   Passed    2.70 sec
      Start 22: rocprim.block_sort_bitonic
21/73 Test  #9: rocprim.block_discontinuity .................   Passed   17.36 sec
      Start 23: rocprim.config_dispatch
22/73 Test #23: rocprim.config_dispatch .....................   Passed    0.09 sec
      Start 24: rocprim.constant_iterator
23/73 Test #24: rocprim.constant_iterator ...................   Passed    0.07 sec
      Start 25: rocprim.counting_iterator
24/73 Test #25: rocprim.counting_iterator ...................   Passed    0.07 sec
      Start 26: rocprim.device_batch_memcpy
25/73 Test #26: rocprim.device_batch_memcpy .................   Passed    1.20 sec
      Start 27: rocprim.device_binary_search
26/73 Test #27: rocprim.device_binary_search ................   Passed    0.02 sec
      Start 28: rocprim.device_adjacent_difference
27/73 Test #28: rocprim.device_adjacent_difference ..........   Passed    0.01 sec
      Start 29: rocprim.device_adjacent_find
28/73 Test #29: rocprim.device_adjacent_find ................   Passed    0.01 sec
      Start 30: rocprim.device_find_end
29/73 Test #30: rocprim.device_find_end .....................   Passed    0.01 sec
      Start 31: rocprim.device_histogram
30/73 Test #22: rocprim.block_sort_bitonic ..................   Passed   13.23 sec
      Start 32: rocprim.device_merge
31/73 Test #31: rocprim.device_histogram ....................   Passed    7.85 sec
      Start 33: rocprim.nth_element
32/73 Test #33: rocprim.nth_element .........................   Passed    0.03 sec
      Start 34: rocprim.device_partial_sort
33/73 Test #34: rocprim.device_partial_sort .................   Passed    0.02 sec
      Start 35: rocprim.device_reduce
34/73 Test #35: rocprim.device_reduce .......................   Passed    8.94 sec
      Start 36: rocprim.device_run_length_encode
35/73 Test #32: rocprim.device_merge ........................   Passed   14.05 sec
      Start 37: rocprim.device_search
36/73 Test #37: rocprim.device_search .......................   Passed    0.02 sec
      Start 38: rocprim.device_segmented_radix_sort
37/73 Test #36: rocprim.device_run_length_encode ............   Passed   13.92 sec
      Start 39: rocprim.device_search_n
38/73 Test #39: rocprim.device_search_n .....................   Passed    0.02 sec
      Start 40: rocprim.device_segmented_reduce
39/73 Test #40: rocprim.device_segmented_reduce .............   Passed    5.21 sec
      Start 41: rocprim.device_segmented_scan
40/73 Test #41: rocprim.device_segmented_scan ...............   Passed    0.02 sec
      Start 42: rocprim.device_transform
41/73 Test #42: rocprim.device_transform ....................   Passed   13.90 sec
      Start 43: rocprim.discard_iterator
42/73 Test #43: rocprim.discard_iterator ....................   Passed    0.07 sec
      Start 44: rocprim.radix_key_codec
43/73 Test #44: rocprim.radix_key_codec .....................   Pas09:54:11 [55/1943]
      Start 45: rocprim.predicate_iterator
44/73 Test #45: rocprim.predicate_iterator ..................   Passed    0.07 sec
      Start 46: rocprim.reverse_iterator
45/73 Test #46: rocprim.reverse_iterator ....................   Passed    0.09 sec
      Start 47: rocprim.rocprim_tuple
46/73 Test #47: rocprim.rocprim_tuple .......................   Passed    0.01 sec
      Start 48: rocprim.rocprim_types
47/73 Test #48: rocprim.rocprim_types .......................   Passed    0.01 sec
      Start 49: rocprim.texture_cache_iterator
48/73 Test #49: rocprim.texture_cache_iterator ..............   Passed    0.01 sec
      Start 50: rocprim.thread
49/73 Test #50: rocprim.thread ..............................   Passed    0.07 sec
      Start 51: rocprim.thread_algos
50/73 Test #51: rocprim.thread_algos ........................   Passed    0.35 sec
      Start 52: rocprim.tuple
51/73 Test #52: rocprim.tuple ...............................   Passed    0.02 sec
      Start 53: rocprim.utils_sort_checker
52/73 Test #53: rocprim.utils_sort_checker ..................   Passed    0.01 sec
      Start 54: rocprim.transform_iterator
53/73 Test #54: rocprim.transform_iterator ..................   Passed    0.11 sec
      Start 55: rocprim.type_traits_interface_cpp17
54/73 Test #55: rocprim.type_traits_interface_cpp17 .........   Passed    0.01 sec
      Start 56: rocprim.type_traits_interface_gnupp17
55/73 Test #56: rocprim.type_traits_interface_gnupp17 .......   Passed    0.01 sec
      Start 57: rocprim.type_traits_interface_cpp20
56/73 Test #57: rocprim.type_traits_interface_cpp20 .........   Passed    0.01 sec
      Start 58: rocprim.type_traits_interface_gnupp20
57/73 Test #58: rocprim.type_traits_interface_gnupp20 .......   Passed    0.01 sec
      Start 59: rocprim.no_half_operators
58/73 Test #59: rocprim.no_half_operators ...................   Passed    0.01 sec
      Start 60: rocprim.intrinsics
59/73 Test #60: rocprim.intrinsics ..........................   Passed    0.21 sec
      Start 61: rocprim.intrinsics_atomic
60/73 Test #61: rocprim.intrinsics_atomic ...................   Passed    0.02 sec
      Start 62: rocprim.invoke_result
61/73 Test #62: rocprim.invoke_result .......................   Passed    0.01 sec
      Start 63: rocprim.warp_exchange
62/73 Test #63: rocprim.warp_exchange .......................   Passed    0.08 sec
      Start 64: rocprim.warp_load
63/73 Test #64: rocprim.warp_load ...........................   Passed    0.08 sec
      Start 65: rocprim.warp_reduce
64/73 Test #65: rocprim.warp_reduce .........................   Passed    0.14 sec
      Start 66: rocprim.warp_scan
65/73 Test #66: rocprim.warp_scan ...........................   Passed    0.20 sec
      Start 67: rocprim.warp_scan_disable_dpp_disable_dpp
66/73 Test #67: rocprim.warp_scan_disable_dpp_disable_dpp ...   Passed    0.21 sec
      Start 68: rocprim.warp_sort
67/73 Test #68: rocprim.warp_sort ...........................   Passed    0.09 sec
      Start 69: rocprim.warp_store
68/73 Test #69: rocprim.warp_store ..........................   Passed    0.02 sec
      Start 70: rocprim.zip_iterator
69/73 Test #70: rocprim.zip_iterator ........................   Passed    0.02 sec
      Start 71: rocprim.accumulator_t
70/73 Test #71: rocprim.accumulator_t .......................   Passed    0.02 sec
      Start 72: hipgraph.basic
71/73 Test #72: hipgraph.basic ..............................   Passed    0.02 sec
      Start 73: hipgraph.algs
72/73 Test #73: hipgraph.algs ...............................   Passed    0.01 sec
73/73 Test #38: rocprim.device_segmented_radix_sort .........   Passed   31.97 sec

100% tests passed, 0 tests failed out of 73

Total Test time (real) =  71.80 sec
✅ test_rocprim.py PASSED
```

</details>



<details>
<summary>gfx1153 passed (click to see log)</summary>

```
INFO:root:++ Exec [/tmp/eble]$ ctest --test-dir /tmp/eble/rocm/bin/rocprim --output-o
n-failure --parallel 1 --exclude-regex 'rocprim.lookback_reproducibility|rocprim.link
ing|rocprim.device_merge_inplace|rocprim.device_merge_sort|rocprim.device_partition|r
ocprim.device_radix_sort|rocprim.device_scan|rocprim.device_select|rocprim.device_fin
d_first_of|rocprim.device_reduce_by_key' --timeout 60
Test project /tmp/eble/rocm/bin/rocprim
      Start  1: hip.device_api
 1/73 Test  #1: hip.device_api ..............................   Passed    0.01 sec
      Start  2: hip.async_copy
 2/73 Test  #2: hip.async_copy ..............................   Passed    0.01 sec
      Start  3: hip.ordered_block_id
 3/73 Test  #3: hip.ordered_block_id ........................   Passed    0.01 sec
      Start  4: rocprim.internal_merge_path
 4/73 Test  #4: rocprim.internal_merge_path .................   Passed    0.01 sec
      Start  5: rocprim.basic_test
 5/73 Test  #5: rocprim.basic_test ..........................   Passed    0.01 sec
      Start  6: rocprim.arg_index_iterator
 6/73 Test  #6: rocprim.arg_index_iterator ..................   Passed    0.01 sec
      Start  7: rocprim.temporary_storage_partitioning
 7/73 Test  #7: rocprim.temporary_storage_partitioning ......   Passed    0.01 sec
      Start  8: rocprim.block_adjacent_difference
 8/73 Test  #8: rocprim.block_adjacent_difference ...........   Passed    2.94 sec
      Start  9: rocprim.block_discontinuity
 9/73 Test  #9: rocprim.block_discontinuity .................   Passed   21.10 sec
      Start 10: rocprim.bit_cast
10/73 Test #10: rocprim.bit_cast ............................   Passed    0.01 sec
      Start 11: rocprim.block_exchange
11/73 Test #11: rocprim.block_exchange ......................   Passed    2.20 sec
      Start 12: rocprim.block_histogram
12/73 Test #12: rocprim.block_histogram .....................   Passed    0.71 sec
      Start 13: rocprim.block_load_store
13/73 Test #13: rocprim.block_load_store ....................   Passed    0.48 sec
      Start 14: rocprim.block_sort_merge
14/73 Test #14: rocprim.block_sort_merge ....................   Passed    0.02 sec
      Start 15: rocprim.block_sort_merge_stable
15/73 Test #15: rocprim.block_sort_merge_stable .............   Passed    0.02 sec
      Start 16: rocprim.block_radix_rank
16/73 Test #16: rocprim.block_radix_rank ....................   Passed    0.02 sec
      Start 17: rocprim.block_radix_sort
17/73 Test #17: rocprim.block_radix_sort ....................   Passed    6.12 sec
      Start 18: rocprim.block_reduce
18/73 Test #18: rocprim.block_reduce ........................   Passed    0.31 sec
      Start 19: rocprim.block_run_length_decode
19/73 Test #19: rocprim.block_run_length_decode .............   Passed    0.68 sec
      Start 20: rocprim.block_scan
20/73 Test #20: rocprim.block_scan ..........................   Passed    0.03 sec
      Start 21: rocprim.block_shuffle
21/73 Test #21: rocprim.block_shuffle .......................   Passed    3.63 sec
      Start 22: rocprim.block_sort_bitonic
22/73 Test #22: rocprim.block_sort_bitonic ..................   Passed   19.34 sec
      Start 23: rocprim.config_dispatch
23/73 Test #23: rocprim.config_dispatch .....................   Passed    0.10 sec
      Start 24: rocprim.constant_iterator
24/73 Test #24: rocprim.constant_iterator ...................   Passed    0.09 sec
      Start 25: rocprim.counting_iterator
25/73 Test #25: rocprim.counting_iterator ...................   Passed    0.09 sec
      Start 26: rocprim.device_batch_memcpy
26/73 Test #26: rocprim.device_batch_memcpy .................   Passed    1.42 sec
      Start 27: rocprim.device_binary_search
27/73 Test #27: rocprim.device_binary_search ................   Passed    0.01 sec
      Start 28: rocprim.device_adjacent_difference
28/73 Test #28: rocprim.device_adjacent_difference ..........   Passed    0.01 sec
      Start 29: rocprim.device_adjacent_find
29/73 Test #29: rocprim.device_adjacent_find ................   Passed    0.01 sec
      Start 30: rocprim.device_find_end
30/73 Test #30: rocprim.device_find_end .....................   Passed    0.01 sec
      Start 31: rocprim.device_histogram
31/73 Test #31: rocprim.device_histogram ....................   Passed    8.77 sec
      Start 32: rocprim.device_merge
32/73 Test #32: rocprim.device_merge ........................   Passed   16.23 sec
      Start 33: rocprim.nth_element
33/73 Test #33: rocprim.nth_element .........................   Passed    0.01 sec
      Start 34: rocprim.device_partial_sort
34/73 Test #34: rocprim.device_partial_sort .................   Passed    0.02 sec
      Start 35: rocprim.device_reduce
35/73 Test #35: rocprim.device_reduce .......................   Passed   10.92 sec
      Start 36: rocprim.device_run_length_encode
36/73 Test #36: rocprim.device_run_length_encode ............   Passed   14.34 sec
      Start 37: rocprim.device_search
37/73 Test #37: rocprim.device_search .......................   Passed    0.01 sec
      Start 38: rocprim.device_segmented_radix_sort
38/73 Test #38: rocprim.device_segmented_radix_sort .........   Passed   38.28 sec
      Start 39: rocprim.device_search_n
39/73 Test #39: rocprim.device_search_n .....................   Passed    0.02 sec
      Start 40: rocprim.device_segmented_reduce
40/73 Test #40: rocprim.device_segmented_reduce .............   Passed    7.19 sec
      Start 41: rocprim.device_segmented_scan
41/73 Test #41: rocprim.device_segmented_scan ...............   Passed    0.02 sec
      Start 42: rocprim.device_transform
42/73 Test #42: rocprim.device_transform ....................   Passed   17.64 sec
      Start 43: rocprim.discard_iterator
43/73 Test #43: rocprim.discard_iterator ....................   Passed    0.12 sec
      Start 44: rocprim.radix_key_codec
44/73 Test #44: rocprim.radix_key_codec .....................   Passed    0.01 sec
      Start 45: rocprim.predicate_iterator
45/73 Test #45: rocprim.predicate_iterator ..................   Passed    0.08 sec
      Start 46: rocprim.reverse_iterator
46/73 Test #46: rocprim.reverse_iterator ....................   Pas10:13:26 [46/1844]
      Start 47: rocprim.rocprim_tuple
47/73 Test #47: rocprim.rocprim_tuple .......................   Passed    0.01 sec
      Start 48: rocprim.rocprim_types
48/73 Test #48: rocprim.rocprim_types .......................   Passed    0.01 sec
      Start 49: rocprim.texture_cache_iterator
49/73 Test #49: rocprim.texture_cache_iterator ..............   Passed    0.01 sec
      Start 50: rocprim.thread
50/73 Test #50: rocprim.thread ..............................   Passed    0.08 sec
      Start 51: rocprim.thread_algos
51/73 Test #51: rocprim.thread_algos ........................   Passed    0.43 sec
      Start 52: rocprim.tuple
52/73 Test #52: rocprim.tuple ...............................   Passed    0.01 sec
      Start 53: rocprim.utils_sort_checker
53/73 Test #53: rocprim.utils_sort_checker ..................   Passed    0.01 sec
      Start 54: rocprim.transform_iterator
54/73 Test #54: rocprim.transform_iterator ..................   Passed    0.12 sec
      Start 55: rocprim.type_traits_interface_cpp17
55/73 Test #55: rocprim.type_traits_interface_cpp17 .........   Passed    0.01 sec
      Start 56: rocprim.type_traits_interface_gnupp17
56/73 Test #56: rocprim.type_traits_interface_gnupp17 .......   Passed    0.01 sec
      Start 57: rocprim.type_traits_interface_cpp20
57/73 Test #57: rocprim.type_traits_interface_cpp20 .........   Passed    0.01 sec
      Start 58: rocprim.type_traits_interface_gnupp20
58/73 Test #58: rocprim.type_traits_interface_gnupp20 .......   Passed    0.01 sec
      Start 59: rocprim.no_half_operators
59/73 Test #59: rocprim.no_half_operators ...................   Passed    0.01 sec
      Start 60: rocprim.intrinsics
60/73 Test #60: rocprim.intrinsics ..........................   Passed    0.29 sec
      Start 61: rocprim.intrinsics_atomic
61/73 Test #61: rocprim.intrinsics_atomic ...................   Pas10:13:27 [16/1844]
      Start 62: rocprim.invoke_result
62/73 Test #62: rocprim.invoke_result .......................   Passed    0.01 sec
      Start 63: rocprim.warp_exchange
63/73 Test #63: rocprim.warp_exchange .......................   Passed    0.09 sec
      Start 64: rocprim.warp_load
64/73 Test #64: rocprim.warp_load ...........................   Passed    0.09 sec
      Start 65: rocprim.warp_reduce
65/73 Test #65: rocprim.warp_reduce .........................   Passed    0.17 sec
      Start 66: rocprim.warp_scan
66/73 Test #66: rocprim.warp_scan ...........................   Passed    0.26 sec
      Start 67: rocprim.warp_scan_disable_dpp_disable_dpp
67/73 Test #67: rocprim.warp_scan_disable_dpp_disable_dpp ...   Passed    0.26 sec
      Start 68: rocprim.warp_sort
68/73 Test #68: rocprim.warp_sort ...........................   Passed    0.10 sec
      Start 69: rocprim.warp_store
69/73 Test #69: rocprim.warp_store ..........................   Passed    0.01 sec
      Start 70: rocprim.zip_iterator
70/73 Test #70: rocprim.zip_iterator ........................   Passed    0.01 sec
      Start 71: rocprim.accumulator_t
71/73 Test #71: rocprim.accumulator_t .......................   Passed    0.01 sec
      Start 72: hipgraph.basic
72/73 Test #72: hipgraph.basic ..............................   Passed    0.01 sec
      Start 73: hipgraph.algs
73/73 Test #73: hipgraph.algs ...............................   Passed    0.01 sec

100% tests passed, 0 tests failed out of 73

Total Test time (real) = 175.34 sec
✅ test_rocprim.py PASSED
```

<detail>


## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
evetsso pushed a commit to evetsso/rocm-libraries that referenced this pull request Dec 31, 2025
…EV-536565

SWDEV-536565 Add missing includes to fix build
NolanHannaAMD added a commit that referenced this pull request Mar 9, 2026
…being used after being freed. (#5220)

## Motivation

A `heap-use-after-free` error was triggered by AddressSanitizer on test
`CPU_Dump_NAN_FP32.testDump`.

## Technical Details

Root Cause Analysis:
The AddressSanitizer error occurred because the HIPOCProgramImpl
constructor was not storing the binary data passed to it. When
LoadProgram called LoadBinary and created a HIPOCProgram with the
returned vector, the temporary vector would go out of scope, but COMGR
still needed to access the binary data later, causing a use-after-free.

- The fix ensures that the HIPOCProgramImpl object owns the binary data
for its entire lifetime
- Both constructors now consistently store the binary data in the
`binary` member variable (std::vector)
- The uint8_t constructor converts the data to char format using
iterator range construction
- This prevents the use-after-free that occurred when COMGR tried to
access freed memory


## Test Plan

Test output before change:
```
HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*"
PRNG seed: 12345678
Note: Google Test filter = *CPU_Dump_NAN_FP32*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CPU_Dump_NAN_FP32
[ RUN      ] CPU_Dump_NAN_FP32.testDump
=================================================================
==3639==ERROR: AddressSanitizer: heap-use-after-free on address 0x7e0f08c50200 at pc 0x7f5f8d7a6554 bp 0x7ffcb7c4a730 sp 0x7ffcb7c49ee8
READ of size 26088 at 0x7e0f08c50200 thread T0
    #0 0x7f5f8d7a6553 in memcpy /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:117:5
    #1 0x7f5f23d61d78 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9
    #2 0x7f5f23d61d78 in COMGR::DataObject::setData(llvm::StringRef) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:334:17
    #3 0x7f5f23d61d78 in amd_comgr_set_data /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:606:24
    #4 0x7f5f221dc1d3 in amd::Comgr::set_data(amd_comgr_data_s, unsigned long, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/comgrctx.hpp:252:12
    #5 0x7f5f221dc1d3 in amd::device::Program::getSymbolsFromCodeObj(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>*, amd_comgr_symbol_type_s) const /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/devprogram.cpp:2061:14
    #6 0x7f5f219e6f7c in hip::DynCO::populateDynGlobalVars() /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:216:22
    #7 0x7f5f219e8e6a in hip::DynCO::getDynFunc(ihipModuleSymbol_t**, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:125:22
    #8 0x7f5f21f842ba in hip::PlatformState::GetDynFunc(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_platform.cpp:884:22
    #9 0x7f5f21ec2d71 in hip::hipModuleGetFunction(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_module.cpp:89:47
    #10 0x7f5f2212c588 in hipModuleGetFunction /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_table_interface.cpp:1926:10
    #11 0x7f5f7478d806 in miopen::HIPOCKernel::HIPOCKernel(miopen::HIPOCProgram, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::vector<unsigned long, std::allocator<unsigned long>>, std::vector<unsigned long, std::allocator<unsigned long>>) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/include/miopen/hipoc_kernel.hpp:225:25
    #12 0x7f5f766febb7 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:161:18
    #13 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34
    #14 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12
    #15 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8
    #16 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37
    #17 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    #18 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52
    #19 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2728:50
    #20 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2718:6
    #21 0x55e87ef04e64 in testing::TestInfo::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2874:14
    #22 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3052:33
    #23 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3006:6
    #24 0x55e87ef0d20b in testing::internal::UnitTestImpl::RunAllTests() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:6004:47
    #25 0x55e87ef1a1de in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    #26 0x55e87ef1a1de in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52
    #27 0x55e87ef051b5 in testing::UnitTest::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:5583:55
    #28 0x55e87eee3f9b in RUN_ALL_TESTS() /data/nhanna/repos/TheRock/build/third-party/googletest/dist/include/gtest/gtest.h:2334:73
    #29 0x55e87eee3f9b in main /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/main_hip.cpp:34:12
    #30 0x7f5f20a587e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: 889235a2805b8308b2d0274921bbe1890e9a1986)
    #31 0x55e87b0bcf2d in _start (/data/nhanna/repos/TheRock/build/ml-libs/MIOpen/build/bin/miopen_gtest+0x126bf2d)

0x7e0f08c50200 is located 0 bytes inside of 26088-byte region [0x7e0f08c50200,0x7e0f08c567e8)
freed by thread T0 here:
    #0 0x7f5f8d7b8ba2 in operator delete(void*, unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:190:3
    #1 0x7f5f76b7317d in std::__new_allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:172:2
    #2 0x7f5f76b7317d in std::allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:210:25
    #3 0x7f5f76b7317d in std::allocator_traits<std::allocator<char>>::deallocate(std::allocator<char>&, char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:517:13
    #4 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::_M_deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:390:4
    #5 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::~_Vector_base() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:369:2
    #6 0x7f5f76b7317d in std::vector<char, std::allocator<char>>::~vector() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:738:7
    #7 0x7f5f76b7317d in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:633:5
    #8 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30
    #9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26
    #10 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34
    #11 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12
    #12 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8
    #13 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37
    #14 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    #15 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52

previously allocated by thread T0 here:
    #0 0x7f5f8d7b7f9d in operator new(unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:109:35
    #1 0x7f5f76b720a5 in std::__new_allocator<char>::allocate(unsigned long, void const*) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:151:27
    #2 0x7f5f76b720a5 in std::allocator<char>::allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:198:32
    #3 0x7f5f76b720a5 in std::allocator_traits<std::allocator<char>>::allocate(std::allocator<char>&, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:482:20
    #4 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:381:20
    #5 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_create_storage(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:398:33
    #6 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_Vector_base(unsigned long, std::allocator<char> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:335:9
    #7 0x7f5f76b720a5 in std::vector<char, std::allocator<char>>::vector(std::vector<char, std::allocator<char>> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:602:9
    #8 0x7f5f76b720a5 in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:623:27
    #9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30
    #10 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26
    #11 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34
    #12 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12
    #13 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8
    #14 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37
    #15 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    #16 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52

SUMMARY: AddressSanitizer: heap-use-after-free /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*)
Shadow bytes around the buggy address:
  0x7e0f08c4ff80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50180: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x7e0f08c50200:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50280: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50300: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50380: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50400: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50480: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==3639==ABORTING
```

## Test Result

Test output after change:
```
HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*"
PRNG seed: 12345678
Note: Google Test filter = *CPU_Dump_NAN_FP32*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CPU_Dump_NAN_FP32
[ RUN      ] CPU_Dump_NAN_FP32.testDump
[       OK ] CPU_Dump_NAN_FP32.testDump (51 ms)
[----------] 1 test from CPU_Dump_NAN_FP32 (51 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (52 ms total)
[  PASSED  ] 1 test.
```

## Cline Analysis

### Test Coverage Analysis:

__1. LoadProgram Code Path (std::vector constructor):__

- __Primary Test__:
`rocm-libraries/projects/miopen/test/gtest/db_sync.cpp`
- __Function__: `BuildKernel()` calls `handle.LoadProgram(program_file,
program_args, "")`
- __Coverage__: This test extensively exercises the LoadProgram →
LoadBinary → HIPOCProgramImpl constructor path
- __Scope__: Tests multiple GPU architectures (gfx908, gfx90a, gfx942,
gfx1030) with different CU counts
- __Frequency__: Runs on thousands of kernel configurations in the
database sync tests

__2. Solution Binary Serialization (std::vector usage):__

- __Primary Test__:
`rocm-libraries/projects/miopen/test/gtest/find_2_conv.cpp`
- __Function__: `miopenSaveSolution()` and `miopenLoadSolution()` with
`std::vector<char> solution_binary`
- __Coverage__: Tests the save/load cycle of solution binaries
- __Scope__: Tests all convolution directions (Forward, BackwardData,
BackwardWeights)

__3. Additional Coverage:__

- __Cache Tests__: `rocm-libraries/projects/miopen/test/gtest/cache.cpp`
tests compression/decompression with `std::vector<char>`
- __Dropout Tests__: Uses `std::vector<unsigned char>` for reserve space
(related pattern)

__Test Quality Assessment:__

✅ __Both constructors are well-tested__:

- The `std::vector<char>` constructor is heavily exercised through
database sync tests
- The `std::vector<uint8_t>` constructor would be tested through any
code paths that use uint8_t binary data

✅ __Real-world scenarios covered__:

- Database synchronization (production kernel loading)
- Solution serialization (runtime binary handling)
- Multi-threaded execution (db_sync uses up to 32 threads)

✅ __Comprehensive architecture coverage__:

- Tests run on multiple GPU architectures
- Different compute unit configurations tested

__Confidence Level__: Very High


### Performance Analysis:

Regarding the performance impact of this fix, it's actually quite
minimal and represents good engineering practice:

__Memory Impact:__

- __Additional Memory Usage__: Each HIPOCProgramImpl object now stores a
copy of the binary data in its `binary` member variable
- __Typical Size__: GPU code objects are usually relatively small
(typically a few KB to a few MB depending on kernel complexity)
- __Lifetime__: The memory is only held for the lifetime of the
HIPOCProgram object, which is typically short-lived during kernel
loading

__Performance Characteristics:__

- __One-time Copy Cost__: There's a single memory copy operation during
construction (std::vector copy or iterator range construction)
- __No Runtime Overhead__: Once constructed, there's no additional
performance cost during kernel execution
- __Memory Safety Benefit__: Eliminates potential crashes and undefined
behavior, which far outweighs the small memory cost

__Context in MIOpen:__

- This occurs during the kernel loading phase, not during actual ML
inference/training
- Kernel loading is already an expensive operation involving
compilation, module creation, etc.
- The additional memory copy is negligible compared to the overall
kernel loading time

__Trade-off Analysis:__

- __Cost__: Small increase in memory usage during kernel loading
- __Benefit__: Eliminates memory safety bugs that could cause crashes or
data corruption
- __Net Result__: Significantly positive - reliability and correctness
are much more valuable than the minimal memory overhead

In practice, this fix follows the RAII (Resource Acquisition Is
Initialization) principle and ensures proper ownership semantics, which
is standard best practice in modern C++. The performance impact should
be unnoticeable in real-world usage.
jovanau pushed a commit to jovanau/rocm-libraries that referenced this pull request Mar 19, 2026
…being used after being freed. (ROCm#5220)

## Motivation

A `heap-use-after-free` error was triggered by AddressSanitizer on test
`CPU_Dump_NAN_FP32.testDump`.

## Technical Details

Root Cause Analysis:
The AddressSanitizer error occurred because the HIPOCProgramImpl
constructor was not storing the binary data passed to it. When
LoadProgram called LoadBinary and created a HIPOCProgram with the
returned vector, the temporary vector would go out of scope, but COMGR
still needed to access the binary data later, causing a use-after-free.

- The fix ensures that the HIPOCProgramImpl object owns the binary data
for its entire lifetime
- Both constructors now consistently store the binary data in the
`binary` member variable (std::vector)
- The uint8_t constructor converts the data to char format using
iterator range construction
- This prevents the use-after-free that occurred when COMGR tried to
access freed memory


## Test Plan

Test output before change:
```
HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*"
PRNG seed: 12345678
Note: Google Test filter = *CPU_Dump_NAN_FP32*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CPU_Dump_NAN_FP32
[ RUN      ] CPU_Dump_NAN_FP32.testDump
=================================================================
==3639==ERROR: AddressSanitizer: heap-use-after-free on address 0x7e0f08c50200 at pc 0x7f5f8d7a6554 bp 0x7ffcb7c4a730 sp 0x7ffcb7c49ee8
READ of size 26088 at 0x7e0f08c50200 thread T0
    #0 0x7f5f8d7a6553 in memcpy /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:117:5
    ROCm#1 0x7f5f23d61d78 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9
    ROCm#2 0x7f5f23d61d78 in COMGR::DataObject::setData(llvm::StringRef) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:334:17
    ROCm#3 0x7f5f23d61d78 in amd_comgr_set_data /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:606:24
    ROCm#4 0x7f5f221dc1d3 in amd::Comgr::set_data(amd_comgr_data_s, unsigned long, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/comgrctx.hpp:252:12
    ROCm#5 0x7f5f221dc1d3 in amd::device::Program::getSymbolsFromCodeObj(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>*, amd_comgr_symbol_type_s) const /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/devprogram.cpp:2061:14
    ROCm#6 0x7f5f219e6f7c in hip::DynCO::populateDynGlobalVars() /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:216:22
    ROCm#7 0x7f5f219e8e6a in hip::DynCO::getDynFunc(ihipModuleSymbol_t**, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:125:22
    ROCm#8 0x7f5f21f842ba in hip::PlatformState::GetDynFunc(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_platform.cpp:884:22
    ROCm#9 0x7f5f21ec2d71 in hip::hipModuleGetFunction(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_module.cpp:89:47
    ROCm#10 0x7f5f2212c588 in hipModuleGetFunction /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_table_interface.cpp:1926:10
    ROCm#11 0x7f5f7478d806 in miopen::HIPOCKernel::HIPOCKernel(miopen::HIPOCProgram, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::vector<unsigned long, std::allocator<unsigned long>>, std::vector<unsigned long, std::allocator<unsigned long>>) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/include/miopen/hipoc_kernel.hpp:225:25
    ROCm#12 0x7f5f766febb7 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:161:18
    ROCm#13 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34
    ROCm#14 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12
    ROCm#15 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8
    ROCm#16 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37
    ROCm#17 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    ROCm#18 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52
    ROCm#19 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2728:50
    ROCm#20 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2718:6
    ROCm#21 0x55e87ef04e64 in testing::TestInfo::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2874:14
    ROCm#22 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3052:33
    ROCm#23 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3006:6
    ROCm#24 0x55e87ef0d20b in testing::internal::UnitTestImpl::RunAllTests() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:6004:47
    ROCm#25 0x55e87ef1a1de in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    ROCm#26 0x55e87ef1a1de in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52
    ROCm#27 0x55e87ef051b5 in testing::UnitTest::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:5583:55
    ROCm#28 0x55e87eee3f9b in RUN_ALL_TESTS() /data/nhanna/repos/TheRock/build/third-party/googletest/dist/include/gtest/gtest.h:2334:73
    ROCm#29 0x55e87eee3f9b in main /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/main_hip.cpp:34:12
    ROCm#30 0x7f5f20a587e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: 889235a2805b8308b2d0274921bbe1890e9a1986)
    ROCm#31 0x55e87b0bcf2d in _start (/data/nhanna/repos/TheRock/build/ml-libs/MIOpen/build/bin/miopen_gtest+0x126bf2d)

0x7e0f08c50200 is located 0 bytes inside of 26088-byte region [0x7e0f08c50200,0x7e0f08c567e8)
freed by thread T0 here:
    #0 0x7f5f8d7b8ba2 in operator delete(void*, unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:190:3
    ROCm#1 0x7f5f76b7317d in std::__new_allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:172:2
    ROCm#2 0x7f5f76b7317d in std::allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:210:25
    ROCm#3 0x7f5f76b7317d in std::allocator_traits<std::allocator<char>>::deallocate(std::allocator<char>&, char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:517:13
    ROCm#4 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::_M_deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:390:4
    ROCm#5 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::~_Vector_base() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:369:2
    ROCm#6 0x7f5f76b7317d in std::vector<char, std::allocator<char>>::~vector() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:738:7
    ROCm#7 0x7f5f76b7317d in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:633:5
    ROCm#8 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30
    ROCm#9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26
    ROCm#10 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34
    ROCm#11 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12
    ROCm#12 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8
    ROCm#13 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37
    ROCm#14 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    ROCm#15 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52

previously allocated by thread T0 here:
    #0 0x7f5f8d7b7f9d in operator new(unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:109:35
    ROCm#1 0x7f5f76b720a5 in std::__new_allocator<char>::allocate(unsigned long, void const*) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:151:27
    ROCm#2 0x7f5f76b720a5 in std::allocator<char>::allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:198:32
    ROCm#3 0x7f5f76b720a5 in std::allocator_traits<std::allocator<char>>::allocate(std::allocator<char>&, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:482:20
    ROCm#4 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:381:20
    ROCm#5 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_create_storage(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:398:33
    ROCm#6 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_Vector_base(unsigned long, std::allocator<char> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:335:9
    ROCm#7 0x7f5f76b720a5 in std::vector<char, std::allocator<char>>::vector(std::vector<char, std::allocator<char>> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:602:9
    ROCm#8 0x7f5f76b720a5 in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:623:27
    ROCm#9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30
    ROCm#10 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26
    ROCm#11 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34
    ROCm#12 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12
    ROCm#13 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8
    ROCm#14 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37
    ROCm#15 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    ROCm#16 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52

SUMMARY: AddressSanitizer: heap-use-after-free /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*)
Shadow bytes around the buggy address:
  0x7e0f08c4ff80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50180: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x7e0f08c50200:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50280: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50300: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50380: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50400: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50480: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==3639==ABORTING
```

## Test Result

Test output after change:
```
HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*"
PRNG seed: 12345678
Note: Google Test filter = *CPU_Dump_NAN_FP32*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CPU_Dump_NAN_FP32
[ RUN      ] CPU_Dump_NAN_FP32.testDump
[       OK ] CPU_Dump_NAN_FP32.testDump (51 ms)
[----------] 1 test from CPU_Dump_NAN_FP32 (51 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (52 ms total)
[  PASSED  ] 1 test.
```

## Cline Analysis

### Test Coverage Analysis:

__1. LoadProgram Code Path (std::vector constructor):__

- __Primary Test__:
`rocm-libraries/projects/miopen/test/gtest/db_sync.cpp`
- __Function__: `BuildKernel()` calls `handle.LoadProgram(program_file,
program_args, "")`
- __Coverage__: This test extensively exercises the LoadProgram →
LoadBinary → HIPOCProgramImpl constructor path
- __Scope__: Tests multiple GPU architectures (gfx908, gfx90a, gfx942,
gfx1030) with different CU counts
- __Frequency__: Runs on thousands of kernel configurations in the
database sync tests

__2. Solution Binary Serialization (std::vector usage):__

- __Primary Test__:
`rocm-libraries/projects/miopen/test/gtest/find_2_conv.cpp`
- __Function__: `miopenSaveSolution()` and `miopenLoadSolution()` with
`std::vector<char> solution_binary`
- __Coverage__: Tests the save/load cycle of solution binaries
- __Scope__: Tests all convolution directions (Forward, BackwardData,
BackwardWeights)

__3. Additional Coverage:__

- __Cache Tests__: `rocm-libraries/projects/miopen/test/gtest/cache.cpp`
tests compression/decompression with `std::vector<char>`
- __Dropout Tests__: Uses `std::vector<unsigned char>` for reserve space
(related pattern)

__Test Quality Assessment:__

✅ __Both constructors are well-tested__:

- The `std::vector<char>` constructor is heavily exercised through
database sync tests
- The `std::vector<uint8_t>` constructor would be tested through any
code paths that use uint8_t binary data

✅ __Real-world scenarios covered__:

- Database synchronization (production kernel loading)
- Solution serialization (runtime binary handling)
- Multi-threaded execution (db_sync uses up to 32 threads)

✅ __Comprehensive architecture coverage__:

- Tests run on multiple GPU architectures
- Different compute unit configurations tested

__Confidence Level__: Very High


### Performance Analysis:

Regarding the performance impact of this fix, it's actually quite
minimal and represents good engineering practice:

__Memory Impact:__

- __Additional Memory Usage__: Each HIPOCProgramImpl object now stores a
copy of the binary data in its `binary` member variable
- __Typical Size__: GPU code objects are usually relatively small
(typically a few KB to a few MB depending on kernel complexity)
- __Lifetime__: The memory is only held for the lifetime of the
HIPOCProgram object, which is typically short-lived during kernel
loading

__Performance Characteristics:__

- __One-time Copy Cost__: There's a single memory copy operation during
construction (std::vector copy or iterator range construction)
- __No Runtime Overhead__: Once constructed, there's no additional
performance cost during kernel execution
- __Memory Safety Benefit__: Eliminates potential crashes and undefined
behavior, which far outweighs the small memory cost

__Context in MIOpen:__

- This occurs during the kernel loading phase, not during actual ML
inference/training
- Kernel loading is already an expensive operation involving
compilation, module creation, etc.
- The additional memory copy is negligible compared to the overall
kernel loading time

__Trade-off Analysis:__

- __Cost__: Small increase in memory usage during kernel loading
- __Benefit__: Eliminates memory safety bugs that could cause crashes or
data corruption
- __Net Result__: Significantly positive - reliability and correctness
are much more valuable than the minimal memory overhead

In practice, this fix follows the RAII (Resource Acquisition Is
Initialization) principle and ensures proper ownership semantics, which
is standard best practice in modern C++. The performance impact should
be unnoticeable in real-world usage.
sebvince added a commit to sebvince/rocm-libraries that referenced this pull request Mar 23, 2026
* Add fp4 mfma support

* Allow using Zeros, Ones and Identity for MX types

* Display scales for MX types

* Fix non subtileImpl path bug

* Fix display issue on MX types (PrintTensor option)
nakajee pushed a commit to nakajee/rocm-libraries that referenced this pull request Mar 31, 2026
* Add fp4 mfma support

* Allow using Zeros, Ones and Identity for MX types

* Display scales for MX types

* Fix non subtileImpl path bug

* Fix display issue on MX types (PrintTensor option)
sebvince added a commit to sebvince/rocm-libraries that referenced this pull request Apr 3, 2026
* Add fp4 mfma support

* Allow using Zeros, Ones and Identity for MX types

* Display scales for MX types

* Fix non subtileImpl path bug

* Fix display issue on MX types (PrintTensor option)
sebvince added a commit to sebvince/rocm-libraries that referenced this pull request Apr 9, 2026
* Add fp4 mfma support

* Allow using Zeros, Ones and Identity for MX types

* Display scales for MX types

* Fix non subtileImpl path bug

* Fix display issue on MX types (PrintTensor option)
Alex-Vasile added a commit that referenced this pull request Apr 10, 2026
Add --type tf32 CLI option: uses float storage with XFloat32 math-op
truncation (10-bit mantissa). The slow reference path already handles
f32XdlMathOp == XFloat32 via ReferenceSolution<TypedGemm_S_S_S, float,
XFloat32>.

Changes:
- Add isTF32 parameter to runGemm, sets f32XdlMathOp on contraction
- Golden reference uses columnMajorGemm<float, XFloat32> for TF32
- TF32 validation tolerance set to 1.0f (13 mantissa bits lost)
- Console output shows MathOp=XFloat32 for TF32 runs
- 10 new slow-path tf32 tests (transpose combos, beta, bias, features,
  scaleAB)

Fast path still rejects XFloat32 — fast-path tf32 tests come next.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 10, 2026
#8, stacked on #1, #3]

Template solveCPUFastInF32 on MathOpAccumT so the inner reduction
applies XFloat32 truncation when MathOpAccumT=XFloat32. Remove the
XFloat32 rejection guard from isFastPathEligible. Update SolveGemmCPU
dispatch to branch on f32XdlMathOp.

When MathOpAccumT=float (default), casts are no-ops with identical
codegen to the previous implementation.

Changes:
- solveCPUFastInF32 gains template<MathOpAccumT=float>
- innerReduction uses float(MathOpAccumT(val)) cast chain
- isFastPathEligible no longer rejects XFloat32
- SolveGemmCPU dispatches solveCPUFastInF32<XFloat32> for TF32
- 10 new fast-path tf32 tests (20 total tf32 tests now)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 21, 2026
Add --type tf32 CLI option: uses float storage with XFloat32 math-op
truncation (10-bit mantissa). The slow reference path already handles
f32XdlMathOp == XFloat32 via ReferenceSolution<TypedGemm_S_S_S, float,
XFloat32>.

Changes:
- Add isTF32 parameter to runGemm, sets f32XdlMathOp on contraction
- Golden reference uses columnMajorGemm<float, XFloat32> for TF32
- TF32 validation tolerance set to 1.0f (13 mantissa bits lost)
- Console output shows MathOp=XFloat32 for TF32 runs
- 10 new slow-path tf32 tests (transpose combos, beta, bias, features,
  scaleAB)

Fast path still rejects XFloat32 — fast-path tf32 tests come next.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 21, 2026
#8, stacked on #1, #3]

Template solveCPUFastInF32 on MathOpAccumT so the inner reduction
applies XFloat32 truncation when MathOpAccumT=XFloat32. Remove the
XFloat32 rejection guard from isFastPathEligible. Update SolveGemmCPU
dispatch to branch on f32XdlMathOp.

When MathOpAccumT=float (default), casts are no-ops with identical
codegen to the previous implementation.

Changes:
- solveCPUFastInF32 gains template<MathOpAccumT=float>
- innerReduction uses float(MathOpAccumT(val)) cast chain
- isFastPathEligible no longer rejects XFloat32
- SolveGemmCPU dispatches solveCPUFastInF32<XFloat32> for TF32
- 10 new fast-path tf32 tests (20 total tf32 tests now)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jmachado-amd pushed a commit to jmachado-amd/rocm-libraries that referenced this pull request Apr 28, 2026
…ternalproject

Link to a locally built OpenBLAS static library
JH-Leon-KIM-AMD pushed a commit that referenced this pull request Apr 29, 2026
This entire device template (DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1)
emits MFMA intrinsics, which exist only on CDNA (gfx9). The shared
is_xdl_wmma_supported() helper returns true for FP16/BF16 with 16x16 on
gfx11/gfx12 because it is also used by WMMA device templates. As a result
the existing 16x16 XDL instances passed IsSupportedArgument on RDNA, were
launched, and produced garbage (no MFMA hardware).

This was exposed by RUN_ALL_TESTS=ON in build #8, which caused all FP16/BF16
TestGroupedConvndBwdData2d cases to fail on gfx11/gfx1201. FP32 silently
"passed" only because is_xdl_wmma_supported<float,...>() returns false.

Fix: in IsSupportedArgument, unconditionally reject the XDL template on
gfx11/gfx12. The corresponding WMMA path lives in the sibling file
device_grouped_conv_bwd_data_multiple_d_wmma_cshuffle.hpp and is unaffected.

This replaces the narrower CDEBlockTransferScalarPerVector_NPerBlock==1
guard added previously, which only covered the new NoShuffle instances
and did not protect the pre-existing 16x16 XDL instances.

Made-with: Cursor
bnemanich added a commit that referenced this pull request May 3, 2026
# Add gfx950 MXFP4 Subtile-based kernel implementation
## Summary
This PR is a follow-up to #6499 ([hipblaslt] Add support for gfx950
mxfp4)
and adds the **Subtile-based kernel implementation
(`UseSubtileImpl=1`)**
for hipBLASLt on **gfx950**. It introduces a new tile-decomposed code
generation path optimized for **MXFP4** and **BF16** GEMMs, plus the
solution-selection plumbing, validation, Origami logic yamls, and unit
tests
needed to make it production-usable.
## Motivation
PR #6499 brought MX data type support online for gfx950, but the
existing
TensileLite codegen path leaves significant performance on the table for
MXFP4-heavy workloads. The Subtile path restructures global-read /
local-read / MFMA / store scheduling at a finer granularity, which
**greatly improves MXFP4 GEMM performance when using
`HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT`** (added to the
hipBLASLt CHANGELOG).
## What's included
### 1. New Subtile-based kernel components (Tensile)
New modules under `projects/hipblaslt/tensilelite/Tensile/Components/`:
* `SubtileBasedKernel.py` (~1850 LOC) — entry point and orchestration of
  the subtile codegen path; replaces large portions of the standard
  prefetch / unroll / store flow when `UseSubtileImpl=1`.
* `SubtileBasedLogicalScheduler.py` (~2415 LOC) — logical scheduler that
  builds the subtile-grained instruction graph (GR loads, LR offsets,
  MFMA tiles, scale loads, stores) from kernel parameters.
* `SubtileBasedInstructionScheduler.py` (~433 LOC) — converts the
logical
  schedule to an emit order respecting wave / register / hazard
  constraints.
* `SubtileBasedInstructionEmitter.py` (~216 LOC) — instruction emission
  helpers shared by the subtile components.
### 2. Kernel writer / common changes
* **`KernelWriter.py`**, **`KernelWriterAssembly.py`**: integration
points
  for the subtile path — prefetch, GR offset calculation, LR offset
  calculation, post-loop, MFMA macro accounting, optimized `storeD`,
  LDS buffer swap, MX FP4 scale emit, `SrdMXSA/B+2` handling, sgpr
  allocation / overflow guards, computeLoadSrd fix.
* **`SolutionStructs/Solution.py`**, **`SolutionStructs/Problem.py`**:
  introduces the `UseSubtileImpl` parameter, MX-related reject
  conditions for non-Subtile paths on gfx950, and additional valid GEMM
  type combinations for MX inputs.
* **`Common/ValidParameters.py`**, **`Common/RequiredParameters.py`**,
  **`Common/GlobalParameters.py`**: `UseSubtileImpl` registration and
  defaults.
* **`Components/StreamK.py`**: subtile-aware StreamK fixup (incl. import
  union with the `BufferLoadB32` cache-coherence change from #6837).
* **`Components/GlobalWriteBatch.py`**: optimized global write batching
  for the subtile path (~670 LOC of changes).
* **`Components/ComputeStoreVgprs.py`**, **`Components/LSU.py`**,
  **`Components/WorkGroupMappingAlgos.py`**, **`AsmStoreState.py`**,
  **`KernelWriterModules.py`**: minor adjustments needed by the subtile
  pipeline.
### 3. rocisa / host / client
* **`rocisa/rocisa/include/container.hpp`**: helpers needed by the new
  emitter.
* **`tensile_host.cpp`**, **`include/Tensile/TensorDescriptor.hpp`**:
  small fixups for the subtile path and gfx950 build.
* **`client/include/DataInitialization.hpp`**,
**`client/src/DataInitialization.cpp`**,
**`client/src/Reference.cpp`**, **`client/src/ReferenceValidator.cpp`**,
  **`client/include/TypedId.hpp`**: MX scale init and reference paths
  used by the new tests.
* **`clients/common/include/testing_matmul.hpp`**,
  **`clients/common/include/norm.hpp`**,
  **`clients/common/include/hipblaslt_datatype2string.hpp`**,
  **`clients/common/src/mxDataGen.cpp`**: wiring for batched (>1)
  testing and MX init.
### 4. Origami / solution selection (gfx950 MXFP4)
New auto-tuned logic yamls under

`projects/hipblaslt/library/.../Tensile/Logic/asm_full/gfx950/gfx950/Origami/`
covering the FP4 SS / HS / BS variants in three layouts:
* `Origami/` (default)
* `Origami/Origami_nta4/` (no-transpose-A FP4)
* `Origami/Origami_ntb4/` (no-transpose-B FP4)
(9 new `gfx950_Cijk_Alik_Bljk_F4{SS,HS,BS}_MXA32_MXB32_*_UserArgs.yaml`
files in total.)
### 5. New tests
**End-to-end gfx950 GEMM yamls** in
`Tensile/Tests/common/gemm/gfx950/`:
* `subtile_bf16.yaml`, `subtile_mxfp4.yaml`
* `mx32f4_tn.yaml`, `mx32f8_tn.yaml`
* `mxfp4_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
* `mxfp4_fp8_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
* `fp8_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
**StreamK + MX:** `Tensile/Tests/common/streamk/sk_mx32f4_quick.yaml`,
`sk_mx32f8_quick.yaml`.
**New unit tests** (`Tensile/Tests/unit/`):
* `test_SubtileBasedLogicalScheduler.py` (~1735 LOC)
* `test_SubtileBasedSchedulerRef.py` (~596 LOC)
* `test_gr_lr_roundtrip.py` (~571 LOC)
* `test_storeD_roundtrip.py` (~2420 LOC)
* `test_graTileAssignment.py` (~354 LOC)
* `test_lraTileAssignment.py` (~360 LOC)
* `conftest.py`, `gpu_test_helpers.py` shared fixtures (~601 LOC)
**New gtest:** `tensilelite/tests/MXScalePadding_test.cpp`.
### 6. Misc / hardening
* Reject conditions: gfx950 MX + non-Subtile, DepthU constraints,
GroupGEMM
not yet supported with StreamK + MX, AssertSummationElementMultiple=256
  for subtile MXFP4, missing-mxblock check for non-MX types.
* Skip rocRoller for FP4-A/FP4-B with pre-swizzled scale layout (#42).
* `forceDenorm=False` in `generateMXInput` (#11).
* Several rebase fixes, copyright/year header updates, and
review-comment
  fixes to `KernelWriter` / `KernelWriterAssembly`.
### 7. CHANGELOG
Greatly improved MXFP4 GEMM performance when using
HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT

## How to use
Set `UseSubtileImpl: 1` on a gfx950 MX-FP4 solution (see the new
`subtile_mxfp4.yaml` / `mx32f4_tn.yaml` for canonical configs). The path
is
opt-in — non-MX and non-gfx950 kernels are unaffected.
## Backwards compatibility / risk
* All new behavior is gated on `UseSubtileImpl=1` and gfx950. Existing
  solutions on other architectures or non-MX paths are unchanged.
* `GroupGEMM + StreamK + MX` is intentionally rejected for now (TODO).
* New Origami yamls only add solutions; nothing existing is modified.
## Test plan
* New gtests + unit tests run automatically in CI (Tensilelite Python
  unit suite, `MXDataGen_test`, `MXScalePadding_test`).
* New end-to-end gfx950 GEMM and StreamK yamls are added to the common
  test buckets.
* Manual: run the gfx950 MXFP4 subtile suites
  (`pytest -k gfx950` after building Tensile, plus
  `tensilelite-client --yaml subtile_mxfp4.yaml` for sanity).
## Notes for reviewers
* This branch was rebased onto current `develop` (post-#6499) by
skipping
  the `users/nakajee/gfx950_mx_rebase_merge` history (which #6499
squash-merged) and replaying only the subtile-specific work as a single
  squashed commit. The actual code changes in this PR are limited to the
  files listed above (24 added, 56 modified; ~+170k / −2.6k including
  generated logic yamls).
* The largest reviewable diffs are:
*
`Tensile/Components/SubtileBased{Kernel,LogicalScheduler,InstructionScheduler,InstructionEmitter}.py`
(new files)
  * `Tensile/KernelWriter.py`, `Tensile/KernelWriterAssembly.py`
  * `Tensile/SolutionStructs/{Problem,Solution}.py`
  * `Tensile/Components/{GlobalWriteBatch,StreamK}.py`
  * `clients/common/include/testing_matmul.hpp`
  * `client/src/DataInitialization.cpp`

* Description of all commits that were squashed for this feature branch:

Subtile implementation for gfx950 MX FP4

--- 272f88d: Add sample subtile impl ---
Author: brianshi <brianshi@amd.com>

--- 60ecede: GR Offset calculation (#1) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- be69c1d: Enable post-loop code generation, and add some
subroutines ---
Author: b-shi <brianshi@amd.com>

--- 646d102: LR offset calculation (#2) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 71f4bca: Add GR load emit logic, and misc fixes (#3) ---
Author: b-shi <brianshi@amd.com>

--- 1fd0db9: Emit LR + init ACCVGPR (#4) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 9d406b9: Add loop and ptr update code ---
Author: b-shi <brianshi@amd.com>

--- b6127bc: Update GR/LR offset calculation to fully support 2x2,
1x4, 4x1 waveConfigs (#7) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 89ec87c: Account for valuC macro value in SK WS store code ---
Author: b-shi <brianshi@amd.com>

--- 6edf53d: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 34e79fc: Enable fp4 (#8) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- d5a5c57: [Tensilelite] Add MX FP4 scale offset computation for
subtile-based kernel (#6) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- 7a8a85a: Add lds buffer swap logic ---
Author: b-shi <brianshi@amd.com>

--- d24a8fe: Add optimized storeD code (#9) ---
Author: b-shi <brianshi@amd.com>

--- a45c20c: Fix MX scale tensor initialization: set
forceDenorm=false in generateMXInput (#11) ---
Author: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>

--- f945268: [Tensilelite] Enable the MX FP4 scale emit code in the
subtile-based kernel (#10) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- cf37df4: Use fixed value for SrdMXSA/B+2 (#14) ---
Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com>

--- f0c8dbc: Merge subtile_mx_f4_schedule to subtile_mx branch (#16)
---
Author: b-shi <brianshi@amd.com>

--- 543796f: Enable DU > 256, and reduce sgpr allocation (#18) ---
Author: b-shi <brianshi@amd.com>

--- c65bdb0: Add missing mxblock check for non-mx data types ---
Author: b-shi <brianshi@amd.com>

--- d64d226: Introduce UseSubtileImpl parameter (#20) ---
Author: b-shi <brianshi@amd.com>

Squash commits 20-35 from subtile_mx branch

--- e4780da: Enable FixSrd2 for A/B (#23) ---
Author: b-shi <brianshi@amd.com>

* Enable FixSrd2 for A/B

* Address comments from PR

---------

--- e4c64a7: Add nt libs ---
Author: b-shi <brianshi@amd.com>

--- cd13ec1: [Tensilelite] Pad MX scale tensor dimensions for
unaligned problem sizes (#21) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

* Add scale padding

* Add tests

* Remove redundant pre-swizzle path

* Remove code from
conflict

* Fix reverted mxdatagen path for tensile tests

* Add diverse test cases for scale padding in MXScalePadding_test and
subtile.yaml
- Expanded test cases to include non-multiple-of-32, even
non-multiple-of-16, and odd dimensions.

--- d87938f: Split subtile.yaml into subtile_bf16.yaml and
subtile_mxfp4.yaml (#22) ---
Author: James Newling <james.newling@gmail.com>

Replace the 'monolithic' subtile.yaml with two focused test files.
All original test coverage is preserved. Two new FP4 groups added.

BF16 coverage (subtile_bf16.yaml, tests are essentially unchanged):

  # | Description        | Dest | MIs | PGR | DU      | SK  | Sizes
  --+--------------------+------+-----+-----+---------+-----+------
  0 | BF16 TN main       | b    |  19 |   0 | 64      | 0,3 |  11
  1 | BF16 TN large DU   | b    |   4 |   0 | 128,192 | 0,3 |   7
  2 | BSS (f32 output)   | s    |   6 |   0 | 64      | 0,3 |   9
  3 | BF16 bias          | b    |   2 |   0 | 64      | 0   |   1

FP4 coverage (subtile_mxfp4.yaml):

  # | Description        | Dest | MIs | PGR | DU  | SK  | Sizes | Status
--+--------------------+------+-----+-----+-----+-----+-------+--------
0 | FP4 TN main | b | 15 | 0 | 256 | 0,3 | 23 | from original
1 | FP4 TN large DU | b | 4 | 0 | 512 | 0,3 | 13 | from original
2 | F4SS (f32 output) | s | 5 | 0 | 256 | 0,3 | 13 | from original
3 | FP4 bias | b | 2 | 0 | 256 | 0 | 1 | from original
  4 | FP4 PGR=2          | b    |  13 |   2 | 256 | 0   |   5   | new
  5 | FP4 expanded MIWT  | b    |  24 |   0 | 256 | 0   |   5   | new
6 | PGR=2 WG 4x1/1x4 | | 6 | 2 | 256 | 0 | 1 | known failures
(commented)

Run times on gfx950 (8x MI350X):

  File               | NEV=-1 | NEV=0
  -------------------+--------+------
  subtile_bf16.yaml  |    23s |   23s
  subtile_mxfp4.yaml |    37s |   40s

Where NEV is number of elements to validate. I (James) have checked
these numbers,
and weirdly it is true that NEV=0 is a bit faster than NEV=-1 for mxfp4.

--- af04f0d: Dependency based instruction scheduling (#19) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Revert to single partition

* Start using dependencies

* as is

* start using separate EmittedModules

* remove reduntant wait

* Add _extractPathsFromBeforeDeps

* Continue simplification

* Simplifying

* Add more rules

* cleanup

* Add fp4 test

* fix test

* Add tests

* Remove after field on emittedmodule

* Refactoring instructionSchedule

* Add comments

* cleanup modules vs ops

* Refactoring print functions

* Test cleanup

* Add more tests

* Replace subgroup by partition

* Remove unused unroll param

* Add high level notes

* Simplify NLL and NGLL GR removal

* Add some comments

* Force instruction insertion if no slots available

* Fix test after rebase

* Move scale before A/B and track inflight count

* Fine-grain vmcnt calculation

* Separate counts for scaleA and B

* Avoid using m0 update and buffer_lod on same MFMA slot to avoid scalar
instruction serialization

* Fix test

* Add vmcnt test

* Fix duplicated loads for 1x4 and 4x1

* Fix placement in reverse order

* Fix regression on PGR0

* add fallback to numMFMA=1

--- 3ec902b: Add some 1x4 and 4x1 origami solutions ---
Author: b-shi <brianshi@amd.com>

--- c5000d3: Fix typo ---
Author: b-shi <brianshi@amd.com>

--- 226ed84: [hipblaslt] Refactor Srd2 calculation for useFixedSrd2
(#30) ---
Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com>

--- abf19d4: [Tensilelite] UseSubtileImpl: subtile-aligned edge check
for store path (#29) ---
Author: b-shi <brianshi@amd.com>

* [Tensilelite] UseSubtileImpl: subtile-aligned edge check, OOB guard,
and refactoring

- Replace Size%MT edge check with subtile-aligned check: NonEdge paired
  store when trailing rows/cols are a multiple of the subtile block size
(waveGroupM rows for M, 16 cols for N). Non-last workgroups always take
NonEdge.
- Add per-wave OOB guard (subtileM32ValidBlocksSgpr /
subtileN16ValidBlocksSgpr)
  to skip stores outside valid M/N tile bounds in the NonEdge path.
- Refactor duplicated OOB guard into _emitSubtileOobGuard helper;
refactor
M/N guard SGPR computation into _emitSubtileMGuard / _emitSubtileNGuard.
- Fix orphan scalar store blockIdxM (was tt0, now
(tt0*MatrixInstM)//mBlockSize).
- Add quick-exit and edge/non-edge header comments to generated ASM.

* Add some bias tests, combine M/N guard to single routine

* Add OOB check for C loads, update storeD unit tests to check OOB,
simplify quick exit checks

* Address more PR comments: add M group skip, and skip to store end.
simplified loadC OOB mask

---------

--- 637881a: Fix unit tests & remove legacy code for subtile
interleaving (#33) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Fix gr_lr_roundtrip test

* Use non-interleaved version as ref code

* Fix scheduler test

* Removed legacy interleaved mode for LR/GR offset calculation

--- e9cb889: Fix MX FP4 scale buffer allocation and initialization
for batched GEMM (#25) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

* Fix bacth count issue

* Add batch count tests

* Fix bacth count issue

* Address PR review: clarify FP4-specific byte stride and add
non-aligned batched tests

- Updated comments on dataBatchBytes computation to clarify FP4 packing
  assumption (2 elements/byte) and flag that non-FP4 block-scaling types
  would require updating this conversion.
- Added batched test cases with non-multiple-of-32 M/N dimensions:
  FP4 DU=256: [48,48,2] and [33,65,2]
  FP4 DU=512: [63,63,2]
  BF16: [50,100,2]

---------

--- a43247b: Update some test yamls (#31) ---
Author: b-shi <brianshi@amd.com>

--- e2f69c8: Add f4bs origami library with activation function
support. Refactor sgpr allocation to reduce sgpr usage in post loop.
Store code-path reorganization (#32) ---
Author: b-shi <brianshi@amd.com>

* Free swap/localwritebase sgprs before post-loop

* Defer sgpr allocation to remove holds in sgpr pool.

Add Origami library logic files for Cijk_Alik_Bljk_F4BS_MXA32_MXB32
(base, nta4, ntb4 variants).

* Remove uneeded alignment and comment

* Add more epilogue tests

* Remove older origami library for f4bs

* Reorder post-loop code blocks to after persistant loop Misc fixes

* Fix build issues, relax longjump sgpr requirements

* Fix GSU0 branch logic

---------

--- 3f034bf: Add F4HS and F4SS Origami library logic for FP4→F16 and
FP4→F32 GEMM (#35) ---
Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com>

* Add F4HS and F4SS Origami library logic for FP4→F16 and FP4→F32 GEMM

- Add 6 new yaml files (F4HS, F4SS) across Origami, Origami_nta4,
Origami_ntb4
- Update F4BS yaml files: AssertSummationElementMultiple 32→256 for
K%256 enforcement
- Add ("F4", "F4", "H", "S") to _validGEMMTypes and _HPATypes in
Problem.py

* Add F4HS test cases to subtile_mxfp4.yaml

Add two new benchmark problem blocks for FP4→F16 (F4HS):
- No-bias block: same wavetile and problem size coverage as F4SS
- Bias epilogue block: BiasDataTypeList [s, h], relu/none activations

* Add F4HS (FP4->Half) type support to Tensile client

Add TypedGemm_F4_H_S typedef and corresponding reference CPU solver
case so F4HS (FP4 input, Float16 output, Float compute) problems
can be validated by the benchmark client.

---------

--- d0bc8fd: Rewrite subtile-based scheduler. Fix DU>64 & enable very
large MT (#36) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Initial support for DU>256

* Renaming

* add option to do DU=512 in the tests

* blocked K-major for scale

* Change scaleSet swap logic

* Update print functions

* Put scales after values for avoid race conditions

* Fix tests

* more test

* tweak printschedule display

* Add PGR2 in the yaml tests

* Add new scaleGROp

* comment out failing tests

* Revert "comment out failing tests"

This reverts commit 1f5802c.

* Draft new logical scheduler

* Refactoring

* Add more test on step1

* Add more tests on step1

* add bf16 320x320 test

* reduce step1 code

* Simplify step1 logic

* validate some step1 test

* Fix partition 2x2 test

* more step1 test

* 320x320 BF16 test

* Add test DU512 + partition2x2

* Simplify step1 code

* Add step2 tests

* Fix multi-partition step2

* Add step2 du512, 2x2 partition test

* Use common algo for all numPartitions

* Draft for step3 tests

* remove useless tests

* New GR algo (draft)

* [Step3] Add more test

* Iteration on GR

* Display ordered GR list with granularities

* More test

* Add some comments

* Disable by default debug logs

* Getting rid of step naming

* Start remove AnnotatedOp (still there in group pass)

* Split dependency Ops

* Add todo on place_GRs pass

* Valid test_annotate_deps_1x1_partition_DU256

* Test output looking better (still WIP)

* single dep for LR tooo

* Add remove_cross_deps pass

* Fix bugs in dependency pass

* insert_gr_lr_inc pass

* Add group_lr_gr pass

* Add emit pass

* Quick port of instruction Emit code

* Move emit function to separate file

* Refactoring instructionEmitter

* Port vgprTile tracking

* Reworking second pass (WIP)

* Display unrolling requirement

* Unrolling check on 2nd pass

* Generic validation for assign_vgpr pass

* Fix unroll

* Add inst schedule in standalone mode

* Use lrGran for vgprTile size calculation

* Fix bug in emit pass (missing depencency)

* PreMFMA path + non-duplication scale load

* missing globalReadLDSBufferSwap for GR_INC scales

* add wairlr_sync on all LR->GR dep

* add waitgr_sync op

* remove_unnecessary_gr_deps

* Change LR dispatch algo a bit to avoid too many waitgr_sync

* Avoid duplicated loads in emitter

* Fix bug on gr_emit code

* GrInc pass. fix duplicated insertion for B

* Fix missing LR_inc for SA/SB

* preloop, NLL, NGLL

* Simplify preloop

* minor changes

* Move unroll logic to scheduler

* minor changes

* Fix unroll id bug on NLL / NGLL

* Disable post GRINC for now

* Remove commented code

* Handle 1x4, 4x1 gr read gran

* Fix vmcnt computation

* Use correct grCount mapping

* Revert in emit logic on buffer_load for PGR0 needs

* Add bf16 version in standalone test

* Fix LR_Inc insertion on DU>64

* Add subIterK/Partition comment to codegen

* Fix issue in GrInc placement

* Remove last_mt

* Fix LR MT index bug with muli-partition

* Disable early LDS size check when subtileImpl is on

* Add pass to remove redundant LR deps + fixed issue on dependency
annotation pass

* Remove more LR redundant deps

* Only insert wait_lr_sync on deps

* Simple algo to select partition config

* Remove HC value for partitions...

* Take into account all inflight GR (all tensors)

* Fix tests and regressions on gr counts

* Fix grCount merge calculation

* Better display of dependencies

* Add remove_wait_lr_sync after grouping

* Add temporary non reg file

* Change merge logic on GR grouping pass

* Fix non necessary wait_lr_sync

* Downgrade some waitlr_sync to sync + added 384x256 no reg test

* non reg test 320x320

* Add larger MT

* non reg test for fp4 256x256

* Moving out instructionScheduler

* Remove old scheduler

* Renaming scheduler

* Re-work test

* Add larger MT test cases

* Rename non-ref test

* Re-add standalone mode

* Refactor DepOp

* Remove dead code

* Remove MFMATileSize class

* Remove from_til_info

* Avoid redundant tensor list creation

* Remove hardcode granularities in vgrpTile allocation pass. Simplify
code.

* Re-enable  # PGR=2 WG 4x1/1x4, K > DU tests

* Remove unused GRScaleOp

* DepRef renaming

* Get rid of MT string representation

* Remove TODO

* EmmitedModule simplication

* Use explicit pass dependencies

* Renaming LogicalScheduler

* Remove old test_InterleavingScheduler.py file

* Commenting failing test for now

* Remove debug logs

* Disable lds padding when using UseSubtileImpl

--- e8e8c09: Fix LR-GR dependency issue when DU>64 (#40) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Fix and simplify logic for remove_unnecessary_lr_deps

* Add new ref tests for 128x128x(128,64)

--- 4aa441a: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 5ba911e: Skip rocRoller for FP4-A/FP4-B + pre-swizzled scale
layout (#42) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- 9c74998: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 842b149: Addressed review comments for KernelWriter and
KernelWriterAssembly ---
Author: Koji Nakajima <knakajim@amd.com>

--- dce43b1: Fix computeLoadSrd issue ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- c075bbf: Fix preSolution CPU re-sync regressing
subtile_mxfp4.yaml ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- ced840f: Fix computeLoadSrd issue (#43) ---
Author: bnemanich <brad.nemanich@amd.com>

--- bc2f6dd: Small update for gfx950 mx tests + more - enable
UseSubtileImpl for all gfx950 non subtile mx tests - skip all gfx950
mxfp8 - use MXScaleFormat=1 as default - set
AssertSummationElementMultiple=256 for subtile mxfp4 - fix
isSwizzledSubtile in computeLoadSrd ---
Author: Koji Nakajima <knakajim@amd.com>

--- 5c794b7: Fix gsuasb.yaml failures ---
Author: b-shi <brianshi@amd.com>

--- 727f8db: tensilelite: add solution reject conditions for
UseSubtileImpl=1 (#38) ---
Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com>

--- 8928fbb: Add more reject conditions for Subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- 6e63ab6: Fix kringshift test failures ---
Author: b-shi <brianshi@amd.com>

--- b3e9724: Update reject condtion for DepthU in subtile case. Plus,
update DepthU setting for gfx950 mx test cases ---
Author: Koji Nakajima <knakajim@amd.com>

--- 5ab6009: Fix build errors ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 4a4edca: Update more mxfp4 tensilelite test cases ---
Author: Koji Nakajima <knakajim@amd.com>

--- bbbc553: Update change log ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 6476c04: Add more reject conditions for gfx950 subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- c5828c4: Updated gfx950 mxfp4 test cases - add StreamK setting -
skip groupgemm tests for now (groupgemm does not support streamK) ---
Author: Koji Nakajima <knakajim@amd.com>

--- f1fc2f1: Fix hipblaslt build error of gfx950 ---
Author: Koji Nakajima <knakajim@amd.com>

--- 70cea1b: Updated subtile_mxfp4.yaml (add StreamK) ---
Author: Koji Nakajima <knakajim@amd.com>

--- c1c9b2a: Add uninit lsc,lsp, etc.. fields for subtile ---
Author: b-shi <brianshi@amd.com>

--- c0c1f72: Fixed merge error in testing_matmul.hpp ---
Author: Koji Nakajima <knakajim@amd.com>

--- 191e0cb: Add missed batch_count >1 changes ---
Author: archana-ramalingam <Archana.Ramalingam@amd.com>

--- 01c52f8: Addressed PR comments ---
Author: Koji Nakajima <knakajim@amd.com>

--- 4e89c91: Reduce mxfp4 test time ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 3dac20f: Prevent overflow for wgmxcc sgpr allocation ---
Author: b-shi <brianshi@amd.com>

--- 18dec79: Fix error with problem type ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 9e69ffd: Add a reject conditoin for gfx950 mx + non Subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- 0eed3ba: Add more valid GEMM types ---
Author: Brad Nemanich <brad.nemanich@amd.com>

--- 8b5514e: Fix missing b build error ---
Author: archana-ramalingam <Archana.Ramalingam@amd.com>

--- f981ff5: Fix 1250 tests ---
Author: Brad Nemanich <brad.nemanich@amd.com>

--- d1e69d9: Add more FP4 tests ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- e3a688f: Add MXScaleFormat: 1 to all gfx950 mx test yaml ---
Author: Koji Nakajima <knakajim@amd.com>

--- aaef3f5: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml
---
Author: Koji Nakajima <knakajim@amd.com>

--- 861ef8e: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml
(nta4,ntb4) ---
Author: Koji Nakajima <knakajim@amd.com>

Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com>
Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>
Co-authored-by: Brian Shi <Brian.Shi@amd.com>
Co-authored-by: James Newling <James.Newling@amd.com>
Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com>
Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com>
Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com>
Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com>
Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>
Co-authored-by: Brian Shi <Brian.Shi@amd.com>
Co-authored-by: James Newling <James.Newling@amd.com>
Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com>
Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com>
Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com>
Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>
Alex-Vasile added a commit that referenced this pull request May 5, 2026
Adds _node_with_pos(node, capture) — combines _node_label (per-category-
stream [N] index, plain MFMA omits) with bare '@ idx=M' position render.
Single helper for the canonical Failure node reference shape, replacing
three different prior styles:

  - #1, #2: manual `_node_label + " " + "@ idx=N"` concatenation
  - #4, #5: `category[name] format_position(...)` (rendered the FULL name
    inside brackets, e.g. `LRA0[LRA0[0]]`, plus a cross-category list
    suffix that duplicates [N]'s purpose)
  - #6: `category format_position(...)` (no brackets)

All five formatters now route through _node_with_pos. Plain MFMA stays
bracket-less per _node_label's MFMA discriminator; PackMFMAs (categories
PackA*/PackB*) keep [N] because CMS reschedules them.

Waits in #5 stay as bare '@ idx=N' — the surrounding 'SWaitCnt' word
already names the kind; rendering the SYNC category as `SYNC[N] @ idx=M`
would just duplicate that.

Skipped:
  - #10 SCCConflict: brackets carry rocisa class name (e.g. [SCSelectB32])
    not [N] index; semantic conflict, separate audit.
  - #7 WrongInterleaving / #8 TimingTooClose: use `name` field for Pack
    identity (MiddlePack_a/_b/_c, CVT0_a/_b); replacing with [N] would
    lose the a/b/c discriminator.
  - #13 ConstraintViolation: slated for deletion in bead `pcz`.

3 new pinning tests verifying [N] actually appears when capture is given
(one per Failure: #4 LRA0[1], #5 LRA0[1] + plain MFMA bracket-less,
#6 GRA[1] in trailing reference). Test count: 564 passed (+3).
Alex-Vasile added a commit that referenced this pull request May 5, 2026
…index

Routes #7 WrongInterleavingFailure (pack/expected_next/actual_next) and
#8 TimingTooCloseFailure (producer/consumer) through _node_with_pos for
the canonical 'category[N] @ idx=M' rendering, dropping the per-formatter
_idx() closures that read node.name + issued_at directly.

Removes two redundant artifacts:
  1. The duplicated vmfma_index in the prior message — GraphNode.name is
     `f"{category}@{vmfma}.{sub}"` so today's output renders as
     `PackA0@5.2 @ idx=5 has wrong interleaving...` (vmfma printed twice).
     Now: `PackA0[N] @ idx=5 has wrong interleaving...` (single source).
  2. The dual-shape (`issued_at` vs `position`) detection. The structural-
     side ValidatorInstruction emitter was deleted in `ola.4` phase 2
     (Pack rule); both Failures are now graph-side only. _node_with_pos's
     own getattr(node, 'position', None) or node.issued_at fallback
     handles any residual ValidatorInstruction caller without per-formatter
     closures.

Tests rewritten:
  - test_wrong_interleaving_failure_format: drops the MiddlePack_a/b/c
    name-string assertions (vestigial — name was the OLD GraphNode label,
    no longer in the message). Pins the exact new rendering with empty
    capture.
  - test_timing_too_close_failure_format: same shape; pins exact new
    output.
  - 2 new tests added: test_wrong_interleaving_failure_format_with_capture_brackets
    and test_timing_too_close_failure_format_with_capture_brackets — verify
    [N] indexing kicks in when capture is real.

Suite: 565 passed (was 563, +2 new bracket tests) / 2 skipped / 1 xfailed.
Alex-Vasile added a commit that referenced this pull request May 7, 2026
…rows in 8t9 audit

DZL_SCOPE_REASSESSMENT.md §1.1 SCC: adds source-level confirmation
that CommonInstruction (parent of every SCC-touching SOPx class) has
only dst/dst1/srcs fields, with getDstParams()/getSrcParams() bodies
shown verbatim from instruction.hpp:382. Rules out the q9j-style
"binding gap" hypothesis at the source level, not just by Python
dir() probing.

ROCISA_WORKAROUNDS_AUDIT.md: adds a "Status update (2026-05-07)"
preamble that flags which rows have been resolved or reframed by
subsequent investigations:
- Row #3 (_OPERAND_RULES, ~600 LoC estimate) → reassessed by q9j
  reassessment to ~12 lines C++ binding + ~260 LoC validator
  migration (the C++ API already exists).
- Rows #7, #8, #19 (VCC machinery) → REMOVED by uraq; no rocisa-side
  replacement planned. VCC dataflow tracking is permanently dropped
  from the validator's scope.
- MFMA acc special-casing → RECLASSIFIED to q9j Category A (acc and
  acc2 are already in the C++ getDstParams/getSrcParams partition).
- Category C narrowed from SCC/VCC/m0/acc to SCC/m0 only.

The audit's table and prose are preserved as the original artifact;
the preamble lets a reader find the current canonical reframing for
each affected row without rewriting the historical record.
aledudek pushed a commit that referenced this pull request May 20, 2026
# Add gfx950 MXFP4 Subtile-based kernel implementation
## Summary
This PR is a follow-up to #6499 ([hipblaslt] Add support for gfx950
mxfp4)
and adds the **Subtile-based kernel implementation
(`UseSubtileImpl=1`)**
for hipBLASLt on **gfx950**. It introduces a new tile-decomposed code
generation path optimized for **MXFP4** and **BF16** GEMMs, plus the
solution-selection plumbing, validation, Origami logic yamls, and unit
tests
needed to make it production-usable.
## Motivation
PR #6499 brought MX data type support online for gfx950, but the
existing
TensileLite codegen path leaves significant performance on the table for
MXFP4-heavy workloads. The Subtile path restructures global-read /
local-read / MFMA / store scheduling at a finer granularity, which
**greatly improves MXFP4 GEMM performance when using
`HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT`** (added to the
hipBLASLt CHANGELOG).
## What's included
### 1. New Subtile-based kernel components (Tensile)
New modules under `projects/hipblaslt/tensilelite/Tensile/Components/`:
* `SubtileBasedKernel.py` (~1850 LOC) — entry point and orchestration of
  the subtile codegen path; replaces large portions of the standard
  prefetch / unroll / store flow when `UseSubtileImpl=1`.
* `SubtileBasedLogicalScheduler.py` (~2415 LOC) — logical scheduler that
  builds the subtile-grained instruction graph (GR loads, LR offsets,
  MFMA tiles, scale loads, stores) from kernel parameters.
* `SubtileBasedInstructionScheduler.py` (~433 LOC) — converts the
logical
  schedule to an emit order respecting wave / register / hazard
  constraints.
* `SubtileBasedInstructionEmitter.py` (~216 LOC) — instruction emission
  helpers shared by the subtile components.
### 2. Kernel writer / common changes
* **`KernelWriter.py`**, **`KernelWriterAssembly.py`**: integration
points
  for the subtile path — prefetch, GR offset calculation, LR offset
  calculation, post-loop, MFMA macro accounting, optimized `storeD`,
  LDS buffer swap, MX FP4 scale emit, `SrdMXSA/B+2` handling, sgpr
  allocation / overflow guards, computeLoadSrd fix.
* **`SolutionStructs/Solution.py`**, **`SolutionStructs/Problem.py`**:
  introduces the `UseSubtileImpl` parameter, MX-related reject
  conditions for non-Subtile paths on gfx950, and additional valid GEMM
  type combinations for MX inputs.
* **`Common/ValidParameters.py`**, **`Common/RequiredParameters.py`**,
  **`Common/GlobalParameters.py`**: `UseSubtileImpl` registration and
  defaults.
* **`Components/StreamK.py`**: subtile-aware StreamK fixup (incl. import
  union with the `BufferLoadB32` cache-coherence change from #6837).
* **`Components/GlobalWriteBatch.py`**: optimized global write batching
  for the subtile path (~670 LOC of changes).
* **`Components/ComputeStoreVgprs.py`**, **`Components/LSU.py`**,
  **`Components/WorkGroupMappingAlgos.py`**, **`AsmStoreState.py`**,
  **`KernelWriterModules.py`**: minor adjustments needed by the subtile
  pipeline.
### 3. rocisa / host / client
* **`rocisa/rocisa/include/container.hpp`**: helpers needed by the new
  emitter.
* **`tensile_host.cpp`**, **`include/Tensile/TensorDescriptor.hpp`**:
  small fixups for the subtile path and gfx950 build.
* **`client/include/DataInitialization.hpp`**,
**`client/src/DataInitialization.cpp`**,
**`client/src/Reference.cpp`**, **`client/src/ReferenceValidator.cpp`**,
  **`client/include/TypedId.hpp`**: MX scale init and reference paths
  used by the new tests.
* **`clients/common/include/testing_matmul.hpp`**,
  **`clients/common/include/norm.hpp`**,
  **`clients/common/include/hipblaslt_datatype2string.hpp`**,
  **`clients/common/src/mxDataGen.cpp`**: wiring for batched (>1)
  testing and MX init.
### 4. Origami / solution selection (gfx950 MXFP4)
New auto-tuned logic yamls under

`projects/hipblaslt/library/.../Tensile/Logic/asm_full/gfx950/gfx950/Origami/`
covering the FP4 SS / HS / BS variants in three layouts:
* `Origami/` (default)
* `Origami/Origami_nta4/` (no-transpose-A FP4)
* `Origami/Origami_ntb4/` (no-transpose-B FP4)
(9 new `gfx950_Cijk_Alik_Bljk_F4{SS,HS,BS}_MXA32_MXB32_*_UserArgs.yaml`
files in total.)
### 5. New tests
**End-to-end gfx950 GEMM yamls** in
`Tensile/Tests/common/gemm/gfx950/`:
* `subtile_bf16.yaml`, `subtile_mxfp4.yaml`
* `mx32f4_tn.yaml`, `mx32f8_tn.yaml`
* `mxfp4_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
* `mxfp4_fp8_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
* `fp8_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
**StreamK + MX:** `Tensile/Tests/common/streamk/sk_mx32f4_quick.yaml`,
`sk_mx32f8_quick.yaml`.
**New unit tests** (`Tensile/Tests/unit/`):
* `test_SubtileBasedLogicalScheduler.py` (~1735 LOC)
* `test_SubtileBasedSchedulerRef.py` (~596 LOC)
* `test_gr_lr_roundtrip.py` (~571 LOC)
* `test_storeD_roundtrip.py` (~2420 LOC)
* `test_graTileAssignment.py` (~354 LOC)
* `test_lraTileAssignment.py` (~360 LOC)
* `conftest.py`, `gpu_test_helpers.py` shared fixtures (~601 LOC)
**New gtest:** `tensilelite/tests/MXScalePadding_test.cpp`.
### 6. Misc / hardening
* Reject conditions: gfx950 MX + non-Subtile, DepthU constraints,
GroupGEMM
not yet supported with StreamK + MX, AssertSummationElementMultiple=256
  for subtile MXFP4, missing-mxblock check for non-MX types.
* Skip rocRoller for FP4-A/FP4-B with pre-swizzled scale layout (#42).
* `forceDenorm=False` in `generateMXInput` (#11).
* Several rebase fixes, copyright/year header updates, and
review-comment
  fixes to `KernelWriter` / `KernelWriterAssembly`.
### 7. CHANGELOG
Greatly improved MXFP4 GEMM performance when using
HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT

## How to use
Set `UseSubtileImpl: 1` on a gfx950 MX-FP4 solution (see the new
`subtile_mxfp4.yaml` / `mx32f4_tn.yaml` for canonical configs). The path
is
opt-in — non-MX and non-gfx950 kernels are unaffected.
## Backwards compatibility / risk
* All new behavior is gated on `UseSubtileImpl=1` and gfx950. Existing
  solutions on other architectures or non-MX paths are unchanged.
* `GroupGEMM + StreamK + MX` is intentionally rejected for now (TODO).
* New Origami yamls only add solutions; nothing existing is modified.
## Test plan
* New gtests + unit tests run automatically in CI (Tensilelite Python
  unit suite, `MXDataGen_test`, `MXScalePadding_test`).
* New end-to-end gfx950 GEMM and StreamK yamls are added to the common
  test buckets.
* Manual: run the gfx950 MXFP4 subtile suites
  (`pytest -k gfx950` after building Tensile, plus
  `tensilelite-client --yaml subtile_mxfp4.yaml` for sanity).
## Notes for reviewers
* This branch was rebased onto current `develop` (post-#6499) by
skipping
  the `users/nakajee/gfx950_mx_rebase_merge` history (which #6499
squash-merged) and replaying only the subtile-specific work as a single
  squashed commit. The actual code changes in this PR are limited to the
  files listed above (24 added, 56 modified; ~+170k / −2.6k including
  generated logic yamls).
* The largest reviewable diffs are:
*
`Tensile/Components/SubtileBased{Kernel,LogicalScheduler,InstructionScheduler,InstructionEmitter}.py`
(new files)
  * `Tensile/KernelWriter.py`, `Tensile/KernelWriterAssembly.py`
  * `Tensile/SolutionStructs/{Problem,Solution}.py`
  * `Tensile/Components/{GlobalWriteBatch,StreamK}.py`
  * `clients/common/include/testing_matmul.hpp`
  * `client/src/DataInitialization.cpp`

* Description of all commits that were squashed for this feature branch:

Subtile implementation for gfx950 MX FP4

--- 272f88d: Add sample subtile impl ---
Author: brianshi <brianshi@amd.com>

--- 60ecede: GR Offset calculation (#1) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- be69c1d: Enable post-loop code generation, and add some
subroutines ---
Author: b-shi <brianshi@amd.com>

--- 646d102: LR offset calculation (#2) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 71f4bca: Add GR load emit logic, and misc fixes (#3) ---
Author: b-shi <brianshi@amd.com>

--- 1fd0db9: Emit LR + init ACCVGPR (#4) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 9d406b9: Add loop and ptr update code ---
Author: b-shi <brianshi@amd.com>

--- b6127bc: Update GR/LR offset calculation to fully support 2x2,
1x4, 4x1 waveConfigs (#7) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 89ec87c: Account for valuC macro value in SK WS store code ---
Author: b-shi <brianshi@amd.com>

--- 6edf53d: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 34e79fc: Enable fp4 (#8) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- d5a5c57: [Tensilelite] Add MX FP4 scale offset computation for
subtile-based kernel (#6) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- 7a8a85a: Add lds buffer swap logic ---
Author: b-shi <brianshi@amd.com>

--- d24a8fe: Add optimized storeD code (#9) ---
Author: b-shi <brianshi@amd.com>

--- a45c20c: Fix MX scale tensor initialization: set
forceDenorm=false in generateMXInput (#11) ---
Author: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>

--- f945268: [Tensilelite] Enable the MX FP4 scale emit code in the
subtile-based kernel (#10) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- cf37df4: Use fixed value for SrdMXSA/B+2 (#14) ---
Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com>

--- f0c8dbc: Merge subtile_mx_f4_schedule to subtile_mx branch (#16)
---
Author: b-shi <brianshi@amd.com>

--- 543796f: Enable DU > 256, and reduce sgpr allocation (#18) ---
Author: b-shi <brianshi@amd.com>

--- c65bdb0: Add missing mxblock check for non-mx data types ---
Author: b-shi <brianshi@amd.com>

--- d64d226: Introduce UseSubtileImpl parameter (#20) ---
Author: b-shi <brianshi@amd.com>

Squash commits 20-35 from subtile_mx branch

--- e4780da: Enable FixSrd2 for A/B (#23) ---
Author: b-shi <brianshi@amd.com>

* Enable FixSrd2 for A/B

* Address comments from PR

---------

--- e4c64a7: Add nt libs ---
Author: b-shi <brianshi@amd.com>

--- cd13ec1: [Tensilelite] Pad MX scale tensor dimensions for
unaligned problem sizes (#21) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

* Add scale padding

* Add tests

* Remove redundant pre-swizzle path

* Remove code from
conflict

* Fix reverted mxdatagen path for tensile tests

* Add diverse test cases for scale padding in MXScalePadding_test and
subtile.yaml
- Expanded test cases to include non-multiple-of-32, even
non-multiple-of-16, and odd dimensions.

--- d87938f: Split subtile.yaml into subtile_bf16.yaml and
subtile_mxfp4.yaml (#22) ---
Author: James Newling <james.newling@gmail.com>

Replace the 'monolithic' subtile.yaml with two focused test files.
All original test coverage is preserved. Two new FP4 groups added.

BF16 coverage (subtile_bf16.yaml, tests are essentially unchanged):

  # | Description        | Dest | MIs | PGR | DU      | SK  | Sizes
  --+--------------------+------+-----+-----+---------+-----+------
  0 | BF16 TN main       | b    |  19 |   0 | 64      | 0,3 |  11
  1 | BF16 TN large DU   | b    |   4 |   0 | 128,192 | 0,3 |   7
  2 | BSS (f32 output)   | s    |   6 |   0 | 64      | 0,3 |   9
  3 | BF16 bias          | b    |   2 |   0 | 64      | 0   |   1

FP4 coverage (subtile_mxfp4.yaml):

  # | Description        | Dest | MIs | PGR | DU  | SK  | Sizes | Status
--+--------------------+------+-----+-----+-----+-----+-------+--------
0 | FP4 TN main | b | 15 | 0 | 256 | 0,3 | 23 | from original
1 | FP4 TN large DU | b | 4 | 0 | 512 | 0,3 | 13 | from original
2 | F4SS (f32 output) | s | 5 | 0 | 256 | 0,3 | 13 | from original
3 | FP4 bias | b | 2 | 0 | 256 | 0 | 1 | from original
  4 | FP4 PGR=2          | b    |  13 |   2 | 256 | 0   |   5   | new
  5 | FP4 expanded MIWT  | b    |  24 |   0 | 256 | 0   |   5   | new
6 | PGR=2 WG 4x1/1x4 | | 6 | 2 | 256 | 0 | 1 | known failures
(commented)

Run times on gfx950 (8x MI350X):

  File               | NEV=-1 | NEV=0
  -------------------+--------+------
  subtile_bf16.yaml  |    23s |   23s
  subtile_mxfp4.yaml |    37s |   40s

Where NEV is number of elements to validate. I (James) have checked
these numbers,
and weirdly it is true that NEV=0 is a bit faster than NEV=-1 for mxfp4.

--- af04f0d: Dependency based instruction scheduling (#19) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Revert to single partition

* Start using dependencies

* as is

* start using separate EmittedModules

* remove reduntant wait

* Add _extractPathsFromBeforeDeps

* Continue simplification

* Simplifying

* Add more rules

* cleanup

* Add fp4 test

* fix test

* Add tests

* Remove after field on emittedmodule

* Refactoring instructionSchedule

* Add comments

* cleanup modules vs ops

* Refactoring print functions

* Test cleanup

* Add more tests

* Replace subgroup by partition

* Remove unused unroll param

* Add high level notes

* Simplify NLL and NGLL GR removal

* Add some comments

* Force instruction insertion if no slots available

* Fix test after rebase

* Move scale before A/B and track inflight count

* Fine-grain vmcnt calculation

* Separate counts for scaleA and B

* Avoid using m0 update and buffer_lod on same MFMA slot to avoid scalar
instruction serialization

* Fix test

* Add vmcnt test

* Fix duplicated loads for 1x4 and 4x1

* Fix placement in reverse order

* Fix regression on PGR0

* add fallback to numMFMA=1

--- 3ec902b: Add some 1x4 and 4x1 origami solutions ---
Author: b-shi <brianshi@amd.com>

--- c5000d3: Fix typo ---
Author: b-shi <brianshi@amd.com>

--- 226ed84: [hipblaslt] Refactor Srd2 calculation for useFixedSrd2
(#30) ---
Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com>

--- abf19d4: [Tensilelite] UseSubtileImpl: subtile-aligned edge check
for store path (#29) ---
Author: b-shi <brianshi@amd.com>

* [Tensilelite] UseSubtileImpl: subtile-aligned edge check, OOB guard,
and refactoring

- Replace Size%MT edge check with subtile-aligned check: NonEdge paired
  store when trailing rows/cols are a multiple of the subtile block size
(waveGroupM rows for M, 16 cols for N). Non-last workgroups always take
NonEdge.
- Add per-wave OOB guard (subtileM32ValidBlocksSgpr /
subtileN16ValidBlocksSgpr)
  to skip stores outside valid M/N tile bounds in the NonEdge path.
- Refactor duplicated OOB guard into _emitSubtileOobGuard helper;
refactor
M/N guard SGPR computation into _emitSubtileMGuard / _emitSubtileNGuard.
- Fix orphan scalar store blockIdxM (was tt0, now
(tt0*MatrixInstM)//mBlockSize).
- Add quick-exit and edge/non-edge header comments to generated ASM.

* Add some bias tests, combine M/N guard to single routine

* Add OOB check for C loads, update storeD unit tests to check OOB,
simplify quick exit checks

* Address more PR comments: add M group skip, and skip to store end.
simplified loadC OOB mask

---------

--- 637881a: Fix unit tests & remove legacy code for subtile
interleaving (#33) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Fix gr_lr_roundtrip test

* Use non-interleaved version as ref code

* Fix scheduler test

* Removed legacy interleaved mode for LR/GR offset calculation

--- e9cb889: Fix MX FP4 scale buffer allocation and initialization
for batched GEMM (#25) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

* Fix bacth count issue

* Add batch count tests

* Fix bacth count issue

* Address PR review: clarify FP4-specific byte stride and add
non-aligned batched tests

- Updated comments on dataBatchBytes computation to clarify FP4 packing
  assumption (2 elements/byte) and flag that non-FP4 block-scaling types
  would require updating this conversion.
- Added batched test cases with non-multiple-of-32 M/N dimensions:
  FP4 DU=256: [48,48,2] and [33,65,2]
  FP4 DU=512: [63,63,2]
  BF16: [50,100,2]

---------

--- a43247b: Update some test yamls (#31) ---
Author: b-shi <brianshi@amd.com>

--- e2f69c8: Add f4bs origami library with activation function
support. Refactor sgpr allocation to reduce sgpr usage in post loop.
Store code-path reorganization (#32) ---
Author: b-shi <brianshi@amd.com>

* Free swap/localwritebase sgprs before post-loop

* Defer sgpr allocation to remove holds in sgpr pool.

Add Origami library logic files for Cijk_Alik_Bljk_F4BS_MXA32_MXB32
(base, nta4, ntb4 variants).

* Remove uneeded alignment and comment

* Add more epilogue tests

* Remove older origami library for f4bs

* Reorder post-loop code blocks to after persistant loop Misc fixes

* Fix build issues, relax longjump sgpr requirements

* Fix GSU0 branch logic

---------

--- 3f034bf: Add F4HS and F4SS Origami library logic for FP4→F16 and
FP4→F32 GEMM (#35) ---
Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com>

* Add F4HS and F4SS Origami library logic for FP4→F16 and FP4→F32 GEMM

- Add 6 new yaml files (F4HS, F4SS) across Origami, Origami_nta4,
Origami_ntb4
- Update F4BS yaml files: AssertSummationElementMultiple 32→256 for
K%256 enforcement
- Add ("F4", "F4", "H", "S") to _validGEMMTypes and _HPATypes in
Problem.py

* Add F4HS test cases to subtile_mxfp4.yaml

Add two new benchmark problem blocks for FP4→F16 (F4HS):
- No-bias block: same wavetile and problem size coverage as F4SS
- Bias epilogue block: BiasDataTypeList [s, h], relu/none activations

* Add F4HS (FP4->Half) type support to Tensile client

Add TypedGemm_F4_H_S typedef and corresponding reference CPU solver
case so F4HS (FP4 input, Float16 output, Float compute) problems
can be validated by the benchmark client.

---------

--- d0bc8fd: Rewrite subtile-based scheduler. Fix DU>64 & enable very
large MT (#36) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Initial support for DU>256

* Renaming

* add option to do DU=512 in the tests

* blocked K-major for scale

* Change scaleSet swap logic

* Update print functions

* Put scales after values for avoid race conditions

* Fix tests

* more test

* tweak printschedule display

* Add PGR2 in the yaml tests

* Add new scaleGROp

* comment out failing tests

* Revert "comment out failing tests"

This reverts commit 1f5802c.

* Draft new logical scheduler

* Refactoring

* Add more test on step1

* Add more tests on step1

* add bf16 320x320 test

* reduce step1 code

* Simplify step1 logic

* validate some step1 test

* Fix partition 2x2 test

* more step1 test

* 320x320 BF16 test

* Add test DU512 + partition2x2

* Simplify step1 code

* Add step2 tests

* Fix multi-partition step2

* Add step2 du512, 2x2 partition test

* Use common algo for all numPartitions

* Draft for step3 tests

* remove useless tests

* New GR algo (draft)

* [Step3] Add more test

* Iteration on GR

* Display ordered GR list with granularities

* More test

* Add some comments

* Disable by default debug logs

* Getting rid of step naming

* Start remove AnnotatedOp (still there in group pass)

* Split dependency Ops

* Add todo on place_GRs pass

* Valid test_annotate_deps_1x1_partition_DU256

* Test output looking better (still WIP)

* single dep for LR tooo

* Add remove_cross_deps pass

* Fix bugs in dependency pass

* insert_gr_lr_inc pass

* Add group_lr_gr pass

* Add emit pass

* Quick port of instruction Emit code

* Move emit function to separate file

* Refactoring instructionEmitter

* Port vgprTile tracking

* Reworking second pass (WIP)

* Display unrolling requirement

* Unrolling check on 2nd pass

* Generic validation for assign_vgpr pass

* Fix unroll

* Add inst schedule in standalone mode

* Use lrGran for vgprTile size calculation

* Fix bug in emit pass (missing depencency)

* PreMFMA path + non-duplication scale load

* missing globalReadLDSBufferSwap for GR_INC scales

* add wairlr_sync on all LR->GR dep

* add waitgr_sync op

* remove_unnecessary_gr_deps

* Change LR dispatch algo a bit to avoid too many waitgr_sync

* Avoid duplicated loads in emitter

* Fix bug on gr_emit code

* GrInc pass. fix duplicated insertion for B

* Fix missing LR_inc for SA/SB

* preloop, NLL, NGLL

* Simplify preloop

* minor changes

* Move unroll logic to scheduler

* minor changes

* Fix unroll id bug on NLL / NGLL

* Disable post GRINC for now

* Remove commented code

* Handle 1x4, 4x1 gr read gran

* Fix vmcnt computation

* Use correct grCount mapping

* Revert in emit logic on buffer_load for PGR0 needs

* Add bf16 version in standalone test

* Fix LR_Inc insertion on DU>64

* Add subIterK/Partition comment to codegen

* Fix issue in GrInc placement

* Remove last_mt

* Fix LR MT index bug with muli-partition

* Disable early LDS size check when subtileImpl is on

* Add pass to remove redundant LR deps + fixed issue on dependency
annotation pass

* Remove more LR redundant deps

* Only insert wait_lr_sync on deps

* Simple algo to select partition config

* Remove HC value for partitions...

* Take into account all inflight GR (all tensors)

* Fix tests and regressions on gr counts

* Fix grCount merge calculation

* Better display of dependencies

* Add remove_wait_lr_sync after grouping

* Add temporary non reg file

* Change merge logic on GR grouping pass

* Fix non necessary wait_lr_sync

* Downgrade some waitlr_sync to sync + added 384x256 no reg test

* non reg test 320x320

* Add larger MT

* non reg test for fp4 256x256

* Moving out instructionScheduler

* Remove old scheduler

* Renaming scheduler

* Re-work test

* Add larger MT test cases

* Rename non-ref test

* Re-add standalone mode

* Refactor DepOp

* Remove dead code

* Remove MFMATileSize class

* Remove from_til_info

* Avoid redundant tensor list creation

* Remove hardcode granularities in vgrpTile allocation pass. Simplify
code.

* Re-enable  # PGR=2 WG 4x1/1x4, K > DU tests

* Remove unused GRScaleOp

* DepRef renaming

* Get rid of MT string representation

* Remove TODO

* EmmitedModule simplication

* Use explicit pass dependencies

* Renaming LogicalScheduler

* Remove old test_InterleavingScheduler.py file

* Commenting failing test for now

* Remove debug logs

* Disable lds padding when using UseSubtileImpl

--- e8e8c09: Fix LR-GR dependency issue when DU>64 (#40) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Fix and simplify logic for remove_unnecessary_lr_deps

* Add new ref tests for 128x128x(128,64)

--- 4aa441a: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 5ba911e: Skip rocRoller for FP4-A/FP4-B + pre-swizzled scale
layout (#42) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- 9c74998: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 842b149: Addressed review comments for KernelWriter and
KernelWriterAssembly ---
Author: Koji Nakajima <knakajim@amd.com>

--- dce43b1: Fix computeLoadSrd issue ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- c075bbf: Fix preSolution CPU re-sync regressing
subtile_mxfp4.yaml ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- ced840f: Fix computeLoadSrd issue (#43) ---
Author: bnemanich <brad.nemanich@amd.com>

--- bc2f6dd: Small update for gfx950 mx tests + more - enable
UseSubtileImpl for all gfx950 non subtile mx tests - skip all gfx950
mxfp8 - use MXScaleFormat=1 as default - set
AssertSummationElementMultiple=256 for subtile mxfp4 - fix
isSwizzledSubtile in computeLoadSrd ---
Author: Koji Nakajima <knakajim@amd.com>

--- 5c794b7: Fix gsuasb.yaml failures ---
Author: b-shi <brianshi@amd.com>

--- 727f8db: tensilelite: add solution reject conditions for
UseSubtileImpl=1 (#38) ---
Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com>

--- 8928fbb: Add more reject conditions for Subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- 6e63ab6: Fix kringshift test failures ---
Author: b-shi <brianshi@amd.com>

--- b3e9724: Update reject condtion for DepthU in subtile case. Plus,
update DepthU setting for gfx950 mx test cases ---
Author: Koji Nakajima <knakajim@amd.com>

--- 5ab6009: Fix build errors ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 4a4edca: Update more mxfp4 tensilelite test cases ---
Author: Koji Nakajima <knakajim@amd.com>

--- bbbc553: Update change log ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 6476c04: Add more reject conditions for gfx950 subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- c5828c4: Updated gfx950 mxfp4 test cases - add StreamK setting -
skip groupgemm tests for now (groupgemm does not support streamK) ---
Author: Koji Nakajima <knakajim@amd.com>

--- f1fc2f1: Fix hipblaslt build error of gfx950 ---
Author: Koji Nakajima <knakajim@amd.com>

--- 70cea1b: Updated subtile_mxfp4.yaml (add StreamK) ---
Author: Koji Nakajima <knakajim@amd.com>

--- c1c9b2a: Add uninit lsc,lsp, etc.. fields for subtile ---
Author: b-shi <brianshi@amd.com>

--- c0c1f72: Fixed merge error in testing_matmul.hpp ---
Author: Koji Nakajima <knakajim@amd.com>

--- 191e0cb: Add missed batch_count >1 changes ---
Author: archana-ramalingam <Archana.Ramalingam@amd.com>

--- 01c52f8: Addressed PR comments ---
Author: Koji Nakajima <knakajim@amd.com>

--- 4e89c91: Reduce mxfp4 test time ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 3dac20f: Prevent overflow for wgmxcc sgpr allocation ---
Author: b-shi <brianshi@amd.com>

--- 18dec79: Fix error with problem type ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 9e69ffd: Add a reject conditoin for gfx950 mx + non Subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- 0eed3ba: Add more valid GEMM types ---
Author: Brad Nemanich <brad.nemanich@amd.com>

--- 8b5514e: Fix missing b build error ---
Author: archana-ramalingam <Archana.Ramalingam@amd.com>

--- f981ff5: Fix 1250 tests ---
Author: Brad Nemanich <brad.nemanich@amd.com>

--- d1e69d9: Add more FP4 tests ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- e3a688f: Add MXScaleFormat: 1 to all gfx950 mx test yaml ---
Author: Koji Nakajima <knakajim@amd.com>

--- aaef3f5: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml
---
Author: Koji Nakajima <knakajim@amd.com>

--- 861ef8e: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml
(nta4,ntb4) ---
Author: Koji Nakajima <knakajim@amd.com>

Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com>
Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>
Co-authored-by: Brian Shi <Brian.Shi@amd.com>
Co-authored-by: James Newling <James.Newling@amd.com>
Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com>
Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com>
Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com>
Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com>
Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>
Co-authored-by: Brian Shi <Brian.Shi@amd.com>
Co-authored-by: James Newling <James.Newling@amd.com>
Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com>
Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com>
Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com>
Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

github actions migration Tasks or issues tied to migration to this monorepo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants