Skip to content

Change shared libs name#3

Closed
jayhawk-commits wants to merge 1 commit into
mockupfrom
miopenCI
Closed

Change shared libs name#3
jayhawk-commits wants to merge 1 commit into
mockupfrom
miopenCI

Conversation

@jayhawk-commits
Copy link
Copy Markdown
Collaborator

For testing Jenkins workflows with monorepo

@jayhawk-commits jayhawk-commits deleted the miopenCI branch May 2, 2025 19:59
assistant-librarian Bot pushed a commit that referenced this pull request May 13, 2025
* finished sort keys no size tests

* completed sort pair no size functions

* completed sort pair with size functions

* updated changelog
@jayhawk-commits jayhawk-commits self-assigned this May 17, 2025
@jayhawk-commits jayhawk-commits added the migration Tasks or issues tied to migration to this monorepo label May 17, 2025
assistant-librarian Bot pushed a commit that referenced this pull request May 28, 2025
…/rocrand.h (#3)

* aded float4 out tests

* added double4 tests

* added double2 out tests

* reformated numeric test to only use one structure

* moved new tests to eof

* started implementation on UD with states

* added memory deallocation

* updated to use only one thread

* implemented rest of states UD test

* updated changelog
jayhawk-commits pushed a commit that referenced this pull request Jun 8, 2025
* Enable gfx12 support

[ROCm/hipRAND commit: 881a6bf]
idass1990 pushed a commit that referenced this pull request Jun 13, 2025
Removes the cmake dependency that was added for hip. This was preventing cuda builds of hipBLAS from building.

[ROCm/hipBLAS-common commit: eb8d0fa]
jayhawk-commits pushed a commit that referenced this pull request Jun 20, 2025
* feat: no global working path with asserts

* fix: bad import statement

* fix: different build_tmp dir in cmake than tensile

* fix: incorrect pathing in Tensile benchmarking

* fix: string wrap paths for assert checks

* fix: ensure all paths are built

* fix: missing global parameter warning

* fix: ensure library logic path is built

* fix: client writer paths

* fix: use child path for client libraries

* remove PrintTiming from global params

* remove another assert

* restore PrintTiming

* fix: reviewer comments

* refactor: use env var for @timing decorator

* refactor: remove library print debug

* refactor: remove exit after kernel gen option

* style: remove commented code

* refactor: move common to a module

* feat: remove 'CustomKernelDirectory' global param

* fix: amd clang version global param

* fix: second pass on amd clang version

* fix: imports for bolted on ops scripts

* chore: remove safety asserts

* chore: create more separation in Common

* fix: missing line at eof

* fix: basic reviewer comments

* chore: move global dependenct funcs to GlobalParameters.py

* fix: pass global params to splitArchs

* style: don't format caps funcs

* fix: bad import

* Add TensileLogic program to verify matrix instructions (#3)

* Remove unused and static global parameters (#5)

* fix: purge library-print-debug
* feat: make build paths static
* feat: remove 'SortProblems' global param
* feat: remove 'ExpandRanges' global param
* feat: remove 'WavefrontWidth' global param
* feat: remove 'ValidateLibrary' global param
* feat: remove 'EnableHalf' and 'ClientArgs' global params
* feat: prefer profile decorator over 'Profiler' global param
* chore: remove 'LibraryPrintDebug' from build_client.yaml
* feat: remove 'MaxFileName' from global params

* Update preferred search paths (#4)

* fix: remove files updated by precommit in merge

* fix: remove log file

* Copyright update

---------

Co-authored-by: David Dixon <david.dixon@amd.com>

[ROCm/hipBLASLt commit: f45e5b9]
ammallya pushed a commit that referenced this pull request Sep 24, 2025
* add in cmake targets to run clang check

* add ability to do clang format check via make target

* add CONFIGURE_DEPENDS

* modify target names to suggested ones

* yeet my debugging line
ammallya pushed a commit that referenced this pull request Sep 24, 2025
* add in cmake targets to run clang check

* add ability to do clang format check via make target

* add CONFIGURE_DEPENDS

* modify target names to suggested ones

* yeet my debugging line

[ROCm/hipDNN commit: 069fe9d]
JoeLiuAMD added a commit that referenced this pull request Oct 8, 2025
…nel configurations (#1978)

# Fix GenericSearch warm-up bias: apply warm-up to all configurations

>** 📝 Note**: This PR has follow up PR #1993
## Motivation

MIOpen's generic search algorithm suffers from a **race condition** that
causes optimal kernels to be randomly rejected, leading to 3-4x
performance degradation in some cases.

### Problem Description

When running the same convolution workload multiple times as sample
below:
```bash
MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 -p 1 -q 1 \
  -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC \
  -m conv -g 1 -t 1 -F 2
```

**Observed behavior:**
- **Lucky case**: Selected optimal kernel → **0.099 ms** per operation
- **Unlucky case**: Selected suboptimal kernel → **0.332 ms** per
operation (**3.35x slower**)


[Lucky_Joe_20250930.log](https://github.com/user-attachments/files/22714922/Lucky_Joe_20250930.log)

[Normal_Joe_20250930.log](https://github.com/user-attachments/files/22714923/Normal_Joe_20250930.log)

### Root Cause

**Cold-start bias in warm-up logic** (`generic_search.hpp`, lines
559-564):

```cpp
// Original buggy code
if(n_current == 0)  // ❌ Only first config gets warm-up
{
    invoker(profile_h, invoke_ctx);
    profile_h.ResetKernelTime();
}
```

This condition creates an **unfair advantage** for the first
configuration tested:
- **First kernel** (n_current == 0): Gets warm-up → Fair performance
measurement
- **Subsequent kernels** (n_current > 0): No warm-up → Cold-start
penalty (up to **100x slower** in extreme cases)

### Impact

- **High false negative rate**: Up to 40% chance of rejecting the
optimal kernel
- **Performance degradation**: 4x slower execution when suboptimal
kernel is selected
- **Non-deterministic behavior**: Kernel selection depends on which
configuration is tested first

### Example from Production Logs

**Environment**: MI355X (gfx950), ROCm 7.0.2

```
AI generated 4 kernel configurations for testing:

Kernel #0 (128,128,32,32,8,8...): 10 samples → avg 0.343166 ms → selected as "best"
Kernel #1 (64,64,64,32,8,8...):   1 sample  → 1.219 ms         → rejected (cold-start!)
Kernel #2 (64,64,16,32,8,8...):   1 sample  → 3.0267 ms        → rejected (cold-start!)
Kernel #3 (64,16,64,32,8,8...):   1 sample  → 0.482 ms         → rejected (cold-start!)

Final execution: 0.332 ms (using Kernel #0)

Issue: Kernel #2 suffered from cold-start bias (3.0267 ms first sample)
       With proper warm-up, its true performance is ~0.099 ms (3.4x faster than selected kernel)
```

**Detailed timing from Normal_Joe_20250930.log:**

Optimal kernel (incorrectly rejected due to cold-start):
-
`DeviceGroupedConvBwdData_Xdl_CShuffle_v1<64,64,16,32,8,8,Default,16,16,4,1,8,1,1,1>+1`
- Sample 1: **3.027 ms** ← Cold start! (30x slower than true
performance)
- Samples 2-11: 0.369, 0.349, 0.366, 0.352, 0.353, 0.365, 0.352, 0.359,
0.347, 0.352 ms
- **True mean**: 0.354 ms (excluding cold-start outlier)
- **Decision**: Rejected by early-stop (3.027 > 0.377 × 1.1)
- **Wrong outcome**: Best kernel discarded due to unfair cold-start
penalty

---

## Technical Details

### Changes

This PR contains **only the bug fix** - removing the unfair warm-up
condition:

```diff
-                // Warm-up run for first time invoker is used
-                if(n_current == 0)
-                {
-                    invoker(profile_h, invoke_ctx);
-                    profile_h.ResetKernelTime();
-                }
+                // Warm-up run for every configuration to eliminate cold-start bias
+                invoker(profile_h, invoke_ctx);
+                profile_h.ResetKernelTime();
```

**File modified:**
`projects/miopen/src/include/miopen/generic_search.hpp` (lines 559-564)

**Change summary:**
- 3 insertions(+), 6 deletions(-)
- Removes `if(n_current == 0)` condition
- Ensures every configuration receives one warm-up run before
measurement

### Why This is Low Risk

1. **Minimal code change**: Only 4 lines changed
2. **No algorithm change**: Same sampling strategy, same early-stop
logic
3. **Only ensures fairness**: All configs now receive identical warm-up
treatment
4. **No performance regression**: Adds one extra kernel call per config
(~0.3ms overhead per config)
5. **Negligible overhead**: For typical 4-config search, adds 1.2ms
total (kernel compilation takes 10-30 seconds, so overhead is <0.01%)

---

## Test Plan

### Test Environment
- **Hardware**: MI355X (gfx950)
- **ROCm Version**: 7.0.2 (HIP 7.0.51831)
- **Workload**: Grouped convolution backward data (NHWC layout, 4 kernel
configurations)

### Test Command
```bash
export MIOPEN_LOG_LEVEL=5
export MIOPEN_FIND_MODE=1
./bin/MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 \
  -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 \
  --in_layout NHWC --fil_layout NHWC --out_layout NHWC \
  -m conv -g 1 -t 1 -F 2
```

### Test Results

#### Before Fix (with bug):
- **Success rate**: 6/10 runs selected optimal kernel (40% error rate)
- **Failure pattern**: Optimal kernel rejected when its cold-start time
triggered early-stop
- **Performance impact**: Up to 4x slower when wrong kernel selected
(0.332ms vs 0.099ms)

#### After Fix:
- **Success rate**: 10/10 runs selected optimal kernel (0% error rate)
- **Consistency**: All configurations receive fair warm-up
- **Performance**: Optimal kernel always selected, no degradation
- **Overhead**: +1.2ms for 4 configs (negligible vs 10-30s compilation
time)

---

## Performance Impact

### Search Time Overhead
- **Additional cost**: 1 warm-up run per configuration (only for configs
beyond the first)

### Accuracy Improvement
- **Before**: 60% success rate (6/10 runs correct)
- **After**: 100% success rate (10/10 runs correct)
- **Performance gain**: Eliminates 4x slowdown from selecting wrong
kernel

---

## Backward Compatibility

✅ **Fully compatible** 
- No API changes
- No behavior changes except for fixing the bug
- All existing tests pass
- No impact on already-cached kernels (find database not affected)

---

## Why Split into Two PRs?

Following reviewer feedback, this work has been split into two separate
PRs:

### **PR1 (This PR) - Bug Fix: Warm-up Bias**
- **Risk**: Low (4 lines changed)
- **Impact**: Fixes root cause of unfair kernel comparison
- **Decision**: Ready for immediate merge
- **Rationale**: Without fair warm-up, no amount of threshold tuning can
fix the problem. Cold-start penalties (30-100x slower) make any single
threshold value inadequate.

### **PR2 (Separate PR) - Optimization: Early-Stop Strategy**
- **Branch**: `users/JoeLiuAMD/miopen-generic-search-optimization` #1993
- **Changes**: Dual-sample testing + 1.2x threshold + enhanced logging
- **Risk**: Medium (affects benchmark timing)
- **Decision**: Needs more validation and benchmarking
- **Rationale**: These optimizations improve accuracy further (40% → 0%
error rate) but add ~2 kernel executions per config. The performance
impact needs separate evaluation.

### Why This Approach?
- **PR1 can merge immediately**: Fixes the critical bug with minimal
risk
- **PR2 can be validated thoroughly**: Performance trade-offs can be
evaluated independently
- **Easier to isolate regressions**: If issues arise, we know which
change caused them
- **Progressive improvement**: Get the bug fix deployed while
optimizations are being validated


---

## Submission Checklist

- [x] I have read and agreed with the [contributing
guidelines](CONTRIBUTING.md)
- [x] The changes are minimal and focused on the bug fix
- [x] All existing tests pass
- [x] The fix has been verified on target hardware (MI355X/gfx950)
- [x] The fix eliminates the non-deterministic kernel selection issue
- [x] No performance regression introduced
- [x] Documentation (commit message) clearly explains the problem and
solution
- [x] Backward compatibility confirmed
- [x] Test data and logs provided for verification
ibrahimw1 added a commit that referenced this pull request Nov 25, 2025
stanleytsang-amd pushed a commit that referenced this pull request Dec 12, 2025
## Motivation

Enable gfx1152 and gfx1153.

## Technical Details

1. combine arrays into tables and use local macros to reduce repetition
(for maintainability)
2. monkey-see-monkey-do wherever `gfx11...` was found

## Test Plan

Build existing ctests for, and run them on, gfx1152 and gfx1153.

## Test Result

### 

<details>
<summary>gfx1152 passed (click to see log)</summary>

```
INFO:root:++ Exec [/tmp/eble]$ ctest --test-dir /tmp/eble/rocm/bin/rocprim --output-o
n-failure --parallel 2 --exclude-regex 'rocprim.lookback_reproducibility|rocprim.link
ing|rocprim.device_merge_inplace|rocprim.device_merge_sort|rocprim.device_partition|r
ocprim.device_radix_sort|rocprim.device_scan|rocprim.device_select|rocprim.device_fin
d_first_of|rocprim.device_reduce_by_key' --timeout 60
Test project /tmp/eble/rocm/bin/rocprim
      Start  1: hip.device_api
      Start  2: hip.async_copy
 1/73 Test  #2: hip.async_copy ..............................   Passed    0.01 sec
      Start  3: hip.ordered_block_id
 2/73 Test  #1: hip.device_api ..............................   Passed    0.02 sec
      Start  4: rocprim.internal_merge_path
 3/73 Test  #4: rocprim.internal_merge_path .................   Passed    0.01 sec
      Start  5: rocprim.basic_test
 4/73 Test  #3: hip.ordered_block_id ........................   Passed    0.01 sec
      Start  6: rocprim.arg_index_iterator
 5/73 Test  #5: rocprim.basic_test ..........................   Passed    0.01 sec
      Start  7: rocprim.temporary_storage_partitioning
 6/73 Test  #6: rocprim.arg_index_iterator ..................   Passed    0.01 sec
      Start  8: rocprim.block_adjacent_difference
 7/73 Test  #7: rocprim.temporary_storage_partitioning ......   Passed    0.01 sec
      Start  9: rocprim.block_discontinuity
 8/73 Test  #8: rocprim.block_adjacent_difference ...........   Passed    2.34 sec
      Start 10: rocprim.bit_cast
 9/73 Test #10: rocprim.bit_cast ............................   Passed    0.02 sec
      Start 11: rocprim.block_exchange
10/73 Test #11: rocprim.block_exchange ......................   Passed    0.73 sec
      Start 12: rocprim.block_histogram
11/73 Test #12: rocprim.block_histogram .....................   Passed    0.54 sec
      Start 13: rocprim.block_load_store
12/73 Test #13: rocprim.block_load_store ....................   Passed    0.44 sec
      Start 14: rocprim.block_sort_merge
13/73 Test #14: rocprim.block_sort_merge ....................   Passed    0.02 sec
      Start 15: rocprim.block_sort_merge_stable
14/73 Test #15: rocprim.block_sort_merge_stable .............   Passed    0.02 sec
      Start 16: rocprim.block_radix_rank
15/73 Test #16: rocprim.block_radix_rank ....................   Passed    0.03 sec
      Start 17: rocprim.block_radix_sort
16/73 Test #17: rocprim.block_radix_sort ....................   Passed    4.79 sec
      Start 18: rocprim.block_reduce
17/73 Test #18: rocprim.block_reduce ........................   Passed    0.26 sec
      Start 19: rocprim.block_run_length_decode
18/73 Test #19: rocprim.block_run_length_decode .............   Passed    0.54 sec
      Start 20: rocprim.block_scan
19/73 Test #20: rocprim.block_scan ..........................   Passed    0.04 sec
      Start 21: rocprim.block_shuffle
20/73 Test #21: rocprim.block_shuffle .......................   Passed    2.70 sec
      Start 22: rocprim.block_sort_bitonic
21/73 Test  #9: rocprim.block_discontinuity .................   Passed   17.36 sec
      Start 23: rocprim.config_dispatch
22/73 Test #23: rocprim.config_dispatch .....................   Passed    0.09 sec
      Start 24: rocprim.constant_iterator
23/73 Test #24: rocprim.constant_iterator ...................   Passed    0.07 sec
      Start 25: rocprim.counting_iterator
24/73 Test #25: rocprim.counting_iterator ...................   Passed    0.07 sec
      Start 26: rocprim.device_batch_memcpy
25/73 Test #26: rocprim.device_batch_memcpy .................   Passed    1.20 sec
      Start 27: rocprim.device_binary_search
26/73 Test #27: rocprim.device_binary_search ................   Passed    0.02 sec
      Start 28: rocprim.device_adjacent_difference
27/73 Test #28: rocprim.device_adjacent_difference ..........   Passed    0.01 sec
      Start 29: rocprim.device_adjacent_find
28/73 Test #29: rocprim.device_adjacent_find ................   Passed    0.01 sec
      Start 30: rocprim.device_find_end
29/73 Test #30: rocprim.device_find_end .....................   Passed    0.01 sec
      Start 31: rocprim.device_histogram
30/73 Test #22: rocprim.block_sort_bitonic ..................   Passed   13.23 sec
      Start 32: rocprim.device_merge
31/73 Test #31: rocprim.device_histogram ....................   Passed    7.85 sec
      Start 33: rocprim.nth_element
32/73 Test #33: rocprim.nth_element .........................   Passed    0.03 sec
      Start 34: rocprim.device_partial_sort
33/73 Test #34: rocprim.device_partial_sort .................   Passed    0.02 sec
      Start 35: rocprim.device_reduce
34/73 Test #35: rocprim.device_reduce .......................   Passed    8.94 sec
      Start 36: rocprim.device_run_length_encode
35/73 Test #32: rocprim.device_merge ........................   Passed   14.05 sec
      Start 37: rocprim.device_search
36/73 Test #37: rocprim.device_search .......................   Passed    0.02 sec
      Start 38: rocprim.device_segmented_radix_sort
37/73 Test #36: rocprim.device_run_length_encode ............   Passed   13.92 sec
      Start 39: rocprim.device_search_n
38/73 Test #39: rocprim.device_search_n .....................   Passed    0.02 sec
      Start 40: rocprim.device_segmented_reduce
39/73 Test #40: rocprim.device_segmented_reduce .............   Passed    5.21 sec
      Start 41: rocprim.device_segmented_scan
40/73 Test #41: rocprim.device_segmented_scan ...............   Passed    0.02 sec
      Start 42: rocprim.device_transform
41/73 Test #42: rocprim.device_transform ....................   Passed   13.90 sec
      Start 43: rocprim.discard_iterator
42/73 Test #43: rocprim.discard_iterator ....................   Passed    0.07 sec
      Start 44: rocprim.radix_key_codec
43/73 Test #44: rocprim.radix_key_codec .....................   Pas09:54:11 [55/1943]
      Start 45: rocprim.predicate_iterator
44/73 Test #45: rocprim.predicate_iterator ..................   Passed    0.07 sec
      Start 46: rocprim.reverse_iterator
45/73 Test #46: rocprim.reverse_iterator ....................   Passed    0.09 sec
      Start 47: rocprim.rocprim_tuple
46/73 Test #47: rocprim.rocprim_tuple .......................   Passed    0.01 sec
      Start 48: rocprim.rocprim_types
47/73 Test #48: rocprim.rocprim_types .......................   Passed    0.01 sec
      Start 49: rocprim.texture_cache_iterator
48/73 Test #49: rocprim.texture_cache_iterator ..............   Passed    0.01 sec
      Start 50: rocprim.thread
49/73 Test #50: rocprim.thread ..............................   Passed    0.07 sec
      Start 51: rocprim.thread_algos
50/73 Test #51: rocprim.thread_algos ........................   Passed    0.35 sec
      Start 52: rocprim.tuple
51/73 Test #52: rocprim.tuple ...............................   Passed    0.02 sec
      Start 53: rocprim.utils_sort_checker
52/73 Test #53: rocprim.utils_sort_checker ..................   Passed    0.01 sec
      Start 54: rocprim.transform_iterator
53/73 Test #54: rocprim.transform_iterator ..................   Passed    0.11 sec
      Start 55: rocprim.type_traits_interface_cpp17
54/73 Test #55: rocprim.type_traits_interface_cpp17 .........   Passed    0.01 sec
      Start 56: rocprim.type_traits_interface_gnupp17
55/73 Test #56: rocprim.type_traits_interface_gnupp17 .......   Passed    0.01 sec
      Start 57: rocprim.type_traits_interface_cpp20
56/73 Test #57: rocprim.type_traits_interface_cpp20 .........   Passed    0.01 sec
      Start 58: rocprim.type_traits_interface_gnupp20
57/73 Test #58: rocprim.type_traits_interface_gnupp20 .......   Passed    0.01 sec
      Start 59: rocprim.no_half_operators
58/73 Test #59: rocprim.no_half_operators ...................   Passed    0.01 sec
      Start 60: rocprim.intrinsics
59/73 Test #60: rocprim.intrinsics ..........................   Passed    0.21 sec
      Start 61: rocprim.intrinsics_atomic
60/73 Test #61: rocprim.intrinsics_atomic ...................   Passed    0.02 sec
      Start 62: rocprim.invoke_result
61/73 Test #62: rocprim.invoke_result .......................   Passed    0.01 sec
      Start 63: rocprim.warp_exchange
62/73 Test #63: rocprim.warp_exchange .......................   Passed    0.08 sec
      Start 64: rocprim.warp_load
63/73 Test #64: rocprim.warp_load ...........................   Passed    0.08 sec
      Start 65: rocprim.warp_reduce
64/73 Test #65: rocprim.warp_reduce .........................   Passed    0.14 sec
      Start 66: rocprim.warp_scan
65/73 Test #66: rocprim.warp_scan ...........................   Passed    0.20 sec
      Start 67: rocprim.warp_scan_disable_dpp_disable_dpp
66/73 Test #67: rocprim.warp_scan_disable_dpp_disable_dpp ...   Passed    0.21 sec
      Start 68: rocprim.warp_sort
67/73 Test #68: rocprim.warp_sort ...........................   Passed    0.09 sec
      Start 69: rocprim.warp_store
68/73 Test #69: rocprim.warp_store ..........................   Passed    0.02 sec
      Start 70: rocprim.zip_iterator
69/73 Test #70: rocprim.zip_iterator ........................   Passed    0.02 sec
      Start 71: rocprim.accumulator_t
70/73 Test #71: rocprim.accumulator_t .......................   Passed    0.02 sec
      Start 72: hipgraph.basic
71/73 Test #72: hipgraph.basic ..............................   Passed    0.02 sec
      Start 73: hipgraph.algs
72/73 Test #73: hipgraph.algs ...............................   Passed    0.01 sec
73/73 Test #38: rocprim.device_segmented_radix_sort .........   Passed   31.97 sec

100% tests passed, 0 tests failed out of 73

Total Test time (real) =  71.80 sec
✅ test_rocprim.py PASSED
```

</details>



<details>
<summary>gfx1153 passed (click to see log)</summary>

```
INFO:root:++ Exec [/tmp/eble]$ ctest --test-dir /tmp/eble/rocm/bin/rocprim --output-o
n-failure --parallel 1 --exclude-regex 'rocprim.lookback_reproducibility|rocprim.link
ing|rocprim.device_merge_inplace|rocprim.device_merge_sort|rocprim.device_partition|r
ocprim.device_radix_sort|rocprim.device_scan|rocprim.device_select|rocprim.device_fin
d_first_of|rocprim.device_reduce_by_key' --timeout 60
Test project /tmp/eble/rocm/bin/rocprim
      Start  1: hip.device_api
 1/73 Test  #1: hip.device_api ..............................   Passed    0.01 sec
      Start  2: hip.async_copy
 2/73 Test  #2: hip.async_copy ..............................   Passed    0.01 sec
      Start  3: hip.ordered_block_id
 3/73 Test  #3: hip.ordered_block_id ........................   Passed    0.01 sec
      Start  4: rocprim.internal_merge_path
 4/73 Test  #4: rocprim.internal_merge_path .................   Passed    0.01 sec
      Start  5: rocprim.basic_test
 5/73 Test  #5: rocprim.basic_test ..........................   Passed    0.01 sec
      Start  6: rocprim.arg_index_iterator
 6/73 Test  #6: rocprim.arg_index_iterator ..................   Passed    0.01 sec
      Start  7: rocprim.temporary_storage_partitioning
 7/73 Test  #7: rocprim.temporary_storage_partitioning ......   Passed    0.01 sec
      Start  8: rocprim.block_adjacent_difference
 8/73 Test  #8: rocprim.block_adjacent_difference ...........   Passed    2.94 sec
      Start  9: rocprim.block_discontinuity
 9/73 Test  #9: rocprim.block_discontinuity .................   Passed   21.10 sec
      Start 10: rocprim.bit_cast
10/73 Test #10: rocprim.bit_cast ............................   Passed    0.01 sec
      Start 11: rocprim.block_exchange
11/73 Test #11: rocprim.block_exchange ......................   Passed    2.20 sec
      Start 12: rocprim.block_histogram
12/73 Test #12: rocprim.block_histogram .....................   Passed    0.71 sec
      Start 13: rocprim.block_load_store
13/73 Test #13: rocprim.block_load_store ....................   Passed    0.48 sec
      Start 14: rocprim.block_sort_merge
14/73 Test #14: rocprim.block_sort_merge ....................   Passed    0.02 sec
      Start 15: rocprim.block_sort_merge_stable
15/73 Test #15: rocprim.block_sort_merge_stable .............   Passed    0.02 sec
      Start 16: rocprim.block_radix_rank
16/73 Test #16: rocprim.block_radix_rank ....................   Passed    0.02 sec
      Start 17: rocprim.block_radix_sort
17/73 Test #17: rocprim.block_radix_sort ....................   Passed    6.12 sec
      Start 18: rocprim.block_reduce
18/73 Test #18: rocprim.block_reduce ........................   Passed    0.31 sec
      Start 19: rocprim.block_run_length_decode
19/73 Test #19: rocprim.block_run_length_decode .............   Passed    0.68 sec
      Start 20: rocprim.block_scan
20/73 Test #20: rocprim.block_scan ..........................   Passed    0.03 sec
      Start 21: rocprim.block_shuffle
21/73 Test #21: rocprim.block_shuffle .......................   Passed    3.63 sec
      Start 22: rocprim.block_sort_bitonic
22/73 Test #22: rocprim.block_sort_bitonic ..................   Passed   19.34 sec
      Start 23: rocprim.config_dispatch
23/73 Test #23: rocprim.config_dispatch .....................   Passed    0.10 sec
      Start 24: rocprim.constant_iterator
24/73 Test #24: rocprim.constant_iterator ...................   Passed    0.09 sec
      Start 25: rocprim.counting_iterator
25/73 Test #25: rocprim.counting_iterator ...................   Passed    0.09 sec
      Start 26: rocprim.device_batch_memcpy
26/73 Test #26: rocprim.device_batch_memcpy .................   Passed    1.42 sec
      Start 27: rocprim.device_binary_search
27/73 Test #27: rocprim.device_binary_search ................   Passed    0.01 sec
      Start 28: rocprim.device_adjacent_difference
28/73 Test #28: rocprim.device_adjacent_difference ..........   Passed    0.01 sec
      Start 29: rocprim.device_adjacent_find
29/73 Test #29: rocprim.device_adjacent_find ................   Passed    0.01 sec
      Start 30: rocprim.device_find_end
30/73 Test #30: rocprim.device_find_end .....................   Passed    0.01 sec
      Start 31: rocprim.device_histogram
31/73 Test #31: rocprim.device_histogram ....................   Passed    8.77 sec
      Start 32: rocprim.device_merge
32/73 Test #32: rocprim.device_merge ........................   Passed   16.23 sec
      Start 33: rocprim.nth_element
33/73 Test #33: rocprim.nth_element .........................   Passed    0.01 sec
      Start 34: rocprim.device_partial_sort
34/73 Test #34: rocprim.device_partial_sort .................   Passed    0.02 sec
      Start 35: rocprim.device_reduce
35/73 Test #35: rocprim.device_reduce .......................   Passed   10.92 sec
      Start 36: rocprim.device_run_length_encode
36/73 Test #36: rocprim.device_run_length_encode ............   Passed   14.34 sec
      Start 37: rocprim.device_search
37/73 Test #37: rocprim.device_search .......................   Passed    0.01 sec
      Start 38: rocprim.device_segmented_radix_sort
38/73 Test #38: rocprim.device_segmented_radix_sort .........   Passed   38.28 sec
      Start 39: rocprim.device_search_n
39/73 Test #39: rocprim.device_search_n .....................   Passed    0.02 sec
      Start 40: rocprim.device_segmented_reduce
40/73 Test #40: rocprim.device_segmented_reduce .............   Passed    7.19 sec
      Start 41: rocprim.device_segmented_scan
41/73 Test #41: rocprim.device_segmented_scan ...............   Passed    0.02 sec
      Start 42: rocprim.device_transform
42/73 Test #42: rocprim.device_transform ....................   Passed   17.64 sec
      Start 43: rocprim.discard_iterator
43/73 Test #43: rocprim.discard_iterator ....................   Passed    0.12 sec
      Start 44: rocprim.radix_key_codec
44/73 Test #44: rocprim.radix_key_codec .....................   Passed    0.01 sec
      Start 45: rocprim.predicate_iterator
45/73 Test #45: rocprim.predicate_iterator ..................   Passed    0.08 sec
      Start 46: rocprim.reverse_iterator
46/73 Test #46: rocprim.reverse_iterator ....................   Pas10:13:26 [46/1844]
      Start 47: rocprim.rocprim_tuple
47/73 Test #47: rocprim.rocprim_tuple .......................   Passed    0.01 sec
      Start 48: rocprim.rocprim_types
48/73 Test #48: rocprim.rocprim_types .......................   Passed    0.01 sec
      Start 49: rocprim.texture_cache_iterator
49/73 Test #49: rocprim.texture_cache_iterator ..............   Passed    0.01 sec
      Start 50: rocprim.thread
50/73 Test #50: rocprim.thread ..............................   Passed    0.08 sec
      Start 51: rocprim.thread_algos
51/73 Test #51: rocprim.thread_algos ........................   Passed    0.43 sec
      Start 52: rocprim.tuple
52/73 Test #52: rocprim.tuple ...............................   Passed    0.01 sec
      Start 53: rocprim.utils_sort_checker
53/73 Test #53: rocprim.utils_sort_checker ..................   Passed    0.01 sec
      Start 54: rocprim.transform_iterator
54/73 Test #54: rocprim.transform_iterator ..................   Passed    0.12 sec
      Start 55: rocprim.type_traits_interface_cpp17
55/73 Test #55: rocprim.type_traits_interface_cpp17 .........   Passed    0.01 sec
      Start 56: rocprim.type_traits_interface_gnupp17
56/73 Test #56: rocprim.type_traits_interface_gnupp17 .......   Passed    0.01 sec
      Start 57: rocprim.type_traits_interface_cpp20
57/73 Test #57: rocprim.type_traits_interface_cpp20 .........   Passed    0.01 sec
      Start 58: rocprim.type_traits_interface_gnupp20
58/73 Test #58: rocprim.type_traits_interface_gnupp20 .......   Passed    0.01 sec
      Start 59: rocprim.no_half_operators
59/73 Test #59: rocprim.no_half_operators ...................   Passed    0.01 sec
      Start 60: rocprim.intrinsics
60/73 Test #60: rocprim.intrinsics ..........................   Passed    0.29 sec
      Start 61: rocprim.intrinsics_atomic
61/73 Test #61: rocprim.intrinsics_atomic ...................   Pas10:13:27 [16/1844]
      Start 62: rocprim.invoke_result
62/73 Test #62: rocprim.invoke_result .......................   Passed    0.01 sec
      Start 63: rocprim.warp_exchange
63/73 Test #63: rocprim.warp_exchange .......................   Passed    0.09 sec
      Start 64: rocprim.warp_load
64/73 Test #64: rocprim.warp_load ...........................   Passed    0.09 sec
      Start 65: rocprim.warp_reduce
65/73 Test #65: rocprim.warp_reduce .........................   Passed    0.17 sec
      Start 66: rocprim.warp_scan
66/73 Test #66: rocprim.warp_scan ...........................   Passed    0.26 sec
      Start 67: rocprim.warp_scan_disable_dpp_disable_dpp
67/73 Test #67: rocprim.warp_scan_disable_dpp_disable_dpp ...   Passed    0.26 sec
      Start 68: rocprim.warp_sort
68/73 Test #68: rocprim.warp_sort ...........................   Passed    0.10 sec
      Start 69: rocprim.warp_store
69/73 Test #69: rocprim.warp_store ..........................   Passed    0.01 sec
      Start 70: rocprim.zip_iterator
70/73 Test #70: rocprim.zip_iterator ........................   Passed    0.01 sec
      Start 71: rocprim.accumulator_t
71/73 Test #71: rocprim.accumulator_t .......................   Passed    0.01 sec
      Start 72: hipgraph.basic
72/73 Test #72: hipgraph.basic ..............................   Passed    0.01 sec
      Start 73: hipgraph.algs
73/73 Test #73: hipgraph.algs ...............................   Passed    0.01 sec

100% tests passed, 0 tests failed out of 73

Total Test time (real) = 175.34 sec
✅ test_rocprim.py PASSED
```

<detail>


## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
evetsso pushed a commit to evetsso/rocm-libraries that referenced this pull request Dec 31, 2025
amcamd pushed a commit that referenced this pull request Jan 9, 2026
## Summary

Improved memory management in `host_alloc.cpp` by refactoring the
deallocation logic for better efficiency and adding diagnostic warnings
for untracked pointer frees.

## Changes

1. **Refactored `free_ptr_use()` to use iterator-based access** (reduces
redundant map lookups)
2. **Added warning for freeing untracked pointers** (helps detect
potential memory corruption issues)

## Motivation

The current implementation of `free_ptr_use()` in `host_alloc.cpp` had
two issues:

1. **Performance:** Redundant map lookups when deallocating tracked
memory
2. **Correctness:** Using `operator[]` on a map can unintentionally
insert entries for non-existent keys
3. **Diagnostics:** Silent failures when attempting to free untracked
pointers made debugging difficult

This PR addresses all three issues by refactoring the lookup logic and
adding diagnostic warnings.

## Technical Details

**File:** `projects/rocblas/clients/common/host_alloc.cpp`

### Change 1: Iterator-based memory deallocation

**Before:**
```cpp
if(ptr && mem_allocated[ptr]) {           // Lookup #1 (may insert!)
    mem_used -= mem_allocated[ptr];       // Lookup #2
    mem_allocated.erase(ptr);             // Lookup #3
}
```

**After:**
```cpp
auto it = mem_allocated.find(ptr);        // Single lookup
if(ptr && it != mem_allocated.end()) {
    mem_used -= it->second;               // Use cached iterator
    mem_allocated.erase(it);              // Use cached iterator
}
```

**Benefits:**
- Reduces map lookups from 3 to 1
- Prevents unintended map insertions via `operator[]`
- More efficient memory tracking

### Change 2: Diagnostic warning for untracked pointers

**Added:**
```cpp
else if(ptr && call_free)
{
    rocblas_cerr << "Warning: Freeing untracked pointer " << ptr
                 << " - untracked memory released (potential double-free or memory corruption)"
                 << std::endl;
}
```

**Benefits:**
- Helps detect double-free bugs
- Identifies potential memory corruption issues
- Provides actionable diagnostic information during testing

## Test Plan

- Built rocBLAS with no compilation errors
- Existing client test suite exercises memory allocation/deallocation
paths
- No functional behavior changes to normal operation
- Warning messages help identify issues during debugging

## Test Result

- All existing tests continue to pass
- Memory tracking functionality unchanged
- Warning properly triggers for untracked pointer frees
- No behavioral differences for correctly tracked memory

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
BrianHarrisonAMD added a commit that referenced this pull request Jan 30, 2026
#3710)

## Motivation

Optimizing the tensor filling functions started a discussion about
optimizing tensor iteration in general:
#3471 (comment)

## Technical Details

After some deliberation, the approach taken here (using std::variant
inside the iterator to represent the different types of indexing)
reflects both the desire the improve iteration in the case of packed
tensors while also maintaining the existing API.

A fully templated approach would be more optimal but would require API
changes to the ITensor class itself, whether making it templated or
changing the definition of its iterator-related methods at the very
least.

## Test Plan

Ran ninja check inside the build folder of hipDNN.

## Test Result

```
[185/187] Running all tests via ctest
Test project /therock/output/build/ml-libs/hipDNN/build
    Start 1: hipdnn_data_sdk_tests
1/7 Test #1: hipdnn_data_sdk_tests ............   Passed    1.30 sec
    Start 2: hipdnn_backend_tests
2/7 Test #2: hipdnn_backend_tests .............   Passed    1.29 sec
    Start 3: hipdnn_frontend_tests
3/7 Test #3: hipdnn_frontend_tests ............   Passed    0.03 sec
    Start 4: hipdnn_test_sdk_tests
4/7 Test #4: hipdnn_test_sdk_tests ............   Passed    4.32 sec
    Start 5: hipdnn_plugin_sdk_tests
5/7 Test #5: hipdnn_plugin_sdk_tests ..........   Passed    0.03 sec
    Start 6: public_hipdnn_backend_tests
6/7 Test #6: public_hipdnn_backend_tests ......   Passed    0.33 sec
    Start 7: public_hipdnn_frontend_tests
7/7 Test #7: public_hipdnn_frontend_tests .....   Passed    0.26 sec

100% tests passed, 0 tests failed out of 7

Label Time Summary:
integration_test    =   0.59 sec*proc (2 tests)
unit_test           =   6.96 sec*proc (5 tests)

Total Test time (real) =   7.56 sec
```

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com>
sebvince referenced this pull request in sebvince/rocm-libraries Mar 23, 2026
…ernel (ROCm#6)

* Add sample subtile impl

* Fix issues when disabling subtile impl

* GR Offset calculation (#1)

* Add sample subtile impl

* Move allocOffsetRegisters before setupNewTile

* Start adding GR offset calculation

* Rest of logic (no swizzling)

* refacto

* spgr offsets

* Add newserial code

* Add script to debug offsets

* Add unit test for GR offset calculation

* Grid display

* Fix both code and ref test function

* Add DPP quad perm to rocisa

* Apply swizzling (no rotation yet)

* Function swizzling + rotation + test

* Refactor test to have a single output array + add test for SGPRs

* Add debug mode to test + add dynamic wavegroup calculation based on MT

* Fix test runtime issue and check all vgpr offsets

* Add ref test code for 1x4 & 4x1

* Fix tests

* Fixed SGPR offset calculation for 2x2

* Fix more tests

* Add more tests

* Refactor tests

* simplify tests

* Remove unused script

* cleanup

* fix camelCase in ref test code

* cleanup

* Fix typo

---------

Co-authored-by: brianshi <brianshi@amd.com>

* Enable post-loop code generation, and add some subroutines

* LR offset calculation (#2)

* Add tests

* as is

* Add permlane16_swap instruction to rocisa

* Ongoing progress

* Draft for partition A0/A1

* Wave partitioning

* Draft ref code in tests

* Handle 1x4 wavesplit param

* 2x2 test passing

* Draft 1x4 LR wave partitioning

* Fix alginement issue

* Integration testing

* Update integration test

* Fix swizzling pattern on GRA. Only swizzling on even LDS rows

* Subtile based test

* testing A

* Test both A and B

* Remove graonly mode

* Fix 1x4 case

* Move global offset for B after rest of the logic

* cleanup

* cleanup

* Fix ref test code for 4x1

* Fix spgr alloc issue

* Remove tmp test file

* Remove debug prints

* Add test case

* Add GR load emit logic, and misc fixes (#3)

* gr emit fix

* Emit LR + init ACCVGPR (#4)

* Emit ds_reads

* Add waits for LR and GR

* Init Acc VGPR to Zero

* Add missing bit_length on VLShiftLeftB32

* Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness

* Fix gra test ref code for 1x4

* Remove some debug prints

* Add loop and ptr update code

* Update scale offset

* Add tests

* Address review

* Add scale roundtrip e2e test and constraint assertions

Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency
across 4 tile configs x 2 matrices. Add power-of-2 assertion for
scaleBlockSize and matching scaleBlockSize assertions for A/B in
shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes
instead of re-deriving MIWaveGroup from tile dimensions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update fixes

* Fix scale being skipped

* Add flag to print layout

* Fix missed merge conflicts

* Fix missed merge conflicts

* Refactor scale rountrip test with gpu helper fns

* Fix extra spaces

* Fix tests

---------

Co-authored-by: brianshi <brianshi@amd.com>
Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com>
Co-authored-by: b-shi <bbbrianme@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
nakajee pushed a commit to nakajee/rocm-libraries that referenced this pull request Mar 31, 2026
nakajee pushed a commit to nakajee/rocm-libraries that referenced this pull request Mar 31, 2026
…ernel (ROCm#6)

* Add sample subtile impl

* Fix issues when disabling subtile impl

* GR Offset calculation (ROCm#1)

* Add sample subtile impl

* Move allocOffsetRegisters before setupNewTile

* Start adding GR offset calculation

* Rest of logic (no swizzling)

* refacto

* spgr offsets

* Add newserial code

* Add script to debug offsets

* Add unit test for GR offset calculation

* Grid display

* Fix both code and ref test function

* Add DPP quad perm to rocisa

* Apply swizzling (no rotation yet)

* Function swizzling + rotation + test

* Refactor test to have a single output array + add test for SGPRs

* Add debug mode to test + add dynamic wavegroup calculation based on MT

* Fix test runtime issue and check all vgpr offsets

* Add ref test code for 1x4 & 4x1

* Fix tests

* Fixed SGPR offset calculation for 2x2

* Fix more tests

* Add more tests

* Refactor tests

* simplify tests

* Remove unused script

* cleanup

* fix camelCase in ref test code

* cleanup

* Fix typo

---------

Co-authored-by: brianshi <brianshi@amd.com>

* Enable post-loop code generation, and add some subroutines

* LR offset calculation (ROCm#2)

* Add tests

* as is

* Add permlane16_swap instruction to rocisa

* Ongoing progress

* Draft for partition A0/A1

* Wave partitioning

* Draft ref code in tests

* Handle 1x4 wavesplit param

* 2x2 test passing

* Draft 1x4 LR wave partitioning

* Fix alginement issue

* Integration testing

* Update integration test

* Fix swizzling pattern on GRA. Only swizzling on even LDS rows

* Subtile based test

* testing A

* Test both A and B

* Remove graonly mode

* Fix 1x4 case

* Move global offset for B after rest of the logic

* cleanup

* cleanup

* Fix ref test code for 4x1

* Fix spgr alloc issue

* Remove tmp test file

* Remove debug prints

* Add test case

* Add GR load emit logic, and misc fixes (ROCm#3)

* gr emit fix

* Emit LR + init ACCVGPR (ROCm#4)

* Emit ds_reads

* Add waits for LR and GR

* Init Acc VGPR to Zero

* Add missing bit_length on VLShiftLeftB32

* Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness

* Fix gra test ref code for 1x4

* Remove some debug prints

* Add loop and ptr update code

* Update scale offset

* Add tests

* Address review

* Add scale roundtrip e2e test and constraint assertions

Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency
across 4 tile configs x 2 matrices. Add power-of-2 assertion for
scaleBlockSize and matching scaleBlockSize assertions for A/B in
shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes
instead of re-deriving MIWaveGroup from tile dimensions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update fixes

* Fix scale being skipped

* Add flag to print layout

* Fix missed merge conflicts

* Fix missed merge conflicts

* Refactor scale rountrip test with gpu helper fns

* Fix extra spaces

* Fix tests

---------

Co-authored-by: brianshi <brianshi@amd.com>
Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com>
Co-authored-by: b-shi <bbbrianme@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
sebvince referenced this pull request in sebvince/rocm-libraries Apr 3, 2026
sebvince referenced this pull request in sebvince/rocm-libraries Apr 3, 2026
…ernel (ROCm#6)

* Add sample subtile impl

* Fix issues when disabling subtile impl

* GR Offset calculation (#1)

* Add sample subtile impl

* Move allocOffsetRegisters before setupNewTile

* Start adding GR offset calculation

* Rest of logic (no swizzling)

* refacto

* spgr offsets

* Add newserial code

* Add script to debug offsets

* Add unit test for GR offset calculation

* Grid display

* Fix both code and ref test function

* Add DPP quad perm to rocisa

* Apply swizzling (no rotation yet)

* Function swizzling + rotation + test

* Refactor test to have a single output array + add test for SGPRs

* Add debug mode to test + add dynamic wavegroup calculation based on MT

* Fix test runtime issue and check all vgpr offsets

* Add ref test code for 1x4 & 4x1

* Fix tests

* Fixed SGPR offset calculation for 2x2

* Fix more tests

* Add more tests

* Refactor tests

* simplify tests

* Remove unused script

* cleanup

* fix camelCase in ref test code

* cleanup

* Fix typo

---------

Co-authored-by: brianshi <brianshi@amd.com>

* Enable post-loop code generation, and add some subroutines

* LR offset calculation (#2)

* Add tests

* as is

* Add permlane16_swap instruction to rocisa

* Ongoing progress

* Draft for partition A0/A1

* Wave partitioning

* Draft ref code in tests

* Handle 1x4 wavesplit param

* 2x2 test passing

* Draft 1x4 LR wave partitioning

* Fix alginement issue

* Integration testing

* Update integration test

* Fix swizzling pattern on GRA. Only swizzling on even LDS rows

* Subtile based test

* testing A

* Test both A and B

* Remove graonly mode

* Fix 1x4 case

* Move global offset for B after rest of the logic

* cleanup

* cleanup

* Fix ref test code for 4x1

* Fix spgr alloc issue

* Remove tmp test file

* Remove debug prints

* Add test case

* Add GR load emit logic, and misc fixes (#3)

* gr emit fix

* Emit LR + init ACCVGPR (#4)

* Emit ds_reads

* Add waits for LR and GR

* Init Acc VGPR to Zero

* Add missing bit_length on VLShiftLeftB32

* Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness

* Fix gra test ref code for 1x4

* Remove some debug prints

* Add loop and ptr update code

* Update scale offset

* Add tests

* Address review

* Add scale roundtrip e2e test and constraint assertions

Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency
across 4 tile configs x 2 matrices. Add power-of-2 assertion for
scaleBlockSize and matching scaleBlockSize assertions for A/B in
shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes
instead of re-deriving MIWaveGroup from tile dimensions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update fixes

* Fix scale being skipped

* Add flag to print layout

* Fix missed merge conflicts

* Fix missed merge conflicts

* Refactor scale rountrip test with gpu helper fns

* Fix extra spaces

* Fix tests

---------

Co-authored-by: brianshi <brianshi@amd.com>
Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com>
Co-authored-by: b-shi <bbbrianme@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
BrianHarrisonAMD added a commit that referenced this pull request Apr 7, 2026
- Convert /// comments to /** @brief */ Doxygen format on all 12
  serialize/deserialize/to_*/from_* public API methods (#4)
- Add namespace closing comment to IncompatibleBackend.hpp (#11)
- Add BinarySerializeNullSize test for null graphByteSize pointer (#9)
- Add BinaryDeserializeCorruptedData test through C API (#12)
- Replace fragile pop_back() null terminator removal with clean
  resize(graphByteSize - 1) in JSON serialize (#3)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sebvince referenced this pull request in sebvince/rocm-libraries Apr 9, 2026
sebvince referenced this pull request in sebvince/rocm-libraries Apr 9, 2026
…ernel (ROCm#6)

* Add sample subtile impl

* Fix issues when disabling subtile impl

* GR Offset calculation (#1)

* Add sample subtile impl

* Move allocOffsetRegisters before setupNewTile

* Start adding GR offset calculation

* Rest of logic (no swizzling)

* refacto

* spgr offsets

* Add newserial code

* Add script to debug offsets

* Add unit test for GR offset calculation

* Grid display

* Fix both code and ref test function

* Add DPP quad perm to rocisa

* Apply swizzling (no rotation yet)

* Function swizzling + rotation + test

* Refactor test to have a single output array + add test for SGPRs

* Add debug mode to test + add dynamic wavegroup calculation based on MT

* Fix test runtime issue and check all vgpr offsets

* Add ref test code for 1x4 & 4x1

* Fix tests

* Fixed SGPR offset calculation for 2x2

* Fix more tests

* Add more tests

* Refactor tests

* simplify tests

* Remove unused script

* cleanup

* fix camelCase in ref test code

* cleanup

* Fix typo

---------

Co-authored-by: brianshi <brianshi@amd.com>

* Enable post-loop code generation, and add some subroutines

* LR offset calculation (#2)

* Add tests

* as is

* Add permlane16_swap instruction to rocisa

* Ongoing progress

* Draft for partition A0/A1

* Wave partitioning

* Draft ref code in tests

* Handle 1x4 wavesplit param

* 2x2 test passing

* Draft 1x4 LR wave partitioning

* Fix alginement issue

* Integration testing

* Update integration test

* Fix swizzling pattern on GRA. Only swizzling on even LDS rows

* Subtile based test

* testing A

* Test both A and B

* Remove graonly mode

* Fix 1x4 case

* Move global offset for B after rest of the logic

* cleanup

* cleanup

* Fix ref test code for 4x1

* Fix spgr alloc issue

* Remove tmp test file

* Remove debug prints

* Add test case

* Add GR load emit logic, and misc fixes (#3)

* gr emit fix

* Emit LR + init ACCVGPR (#4)

* Emit ds_reads

* Add waits for LR and GR

* Init Acc VGPR to Zero

* Add missing bit_length on VLShiftLeftB32

* Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness

* Fix gra test ref code for 1x4

* Remove some debug prints

* Add loop and ptr update code

* Update scale offset

* Add tests

* Address review

* Add scale roundtrip e2e test and constraint assertions

Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency
across 4 tile configs x 2 matrices. Add power-of-2 assertion for
scaleBlockSize and matching scaleBlockSize assertions for A/B in
shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes
instead of re-deriving MIWaveGroup from tile dimensions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update fixes

* Fix scale being skipped

* Add flag to print layout

* Fix missed merge conflicts

* Fix missed merge conflicts

* Refactor scale rountrip test with gpu helper fns

* Fix extra spaces

* Fix tests

---------

Co-authored-by: brianshi <brianshi@amd.com>
Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com>
Co-authored-by: b-shi <bbbrianme@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 10, 2026
#8, stacked on #1, #3]

Template solveCPUFastInF32 on MathOpAccumT so the inner reduction
applies XFloat32 truncation when MathOpAccumT=XFloat32. Remove the
XFloat32 rejection guard from isFastPathEligible. Update SolveGemmCPU
dispatch to branch on f32XdlMathOp.

When MathOpAccumT=float (default), casts are no-ops with identical
codegen to the previous implementation.

Changes:
- solveCPUFastInF32 gains template<MathOpAccumT=float>
- innerReduction uses float(MathOpAccumT(val)) cast chain
- isFastPathEligible no longer rejects XFloat32
- SolveGemmCPU dispatches solveCPUFastInF32<XFloat32> for TF32
- 10 new fast-path tf32 tests (20 total tf32 tests now)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 10, 2026
…#3, #5, stacked on #1, #2, #5]

Generalize the fast path to support double-precision accumulation:

- Template ShadowBuffer<AccumT>: storage, pointer, and element access
  are all AccumT. Float/Double inputs zero-copy when AccumT matches;
  sub-float types go through float then widen.
- Template loadTo<AccumT, SrcType> and storeFrom<AccumT, DstType>
  (renamed from loadToFloat/storeFromFloat).
- Rename solveCPUFastInF32 → solveCPUFast<AccumT, MathOpAccumT>:
  all tile registers, inner reduction, epilogue, alpha/beta extraction,
  bias reading, and activation args use AccumT.
- Add Double to isFastPathEligible's supported input/output types.
- SolveGemmCPU dispatch: route Double to solveCPUFast<double>.
- Add 10 f64 fast-path tests (transpose combos, Beta, Bias,
  AllFeatures, TN_AllFeatures, ScaleAB Scalar/Vector).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 21, 2026
#8, stacked on #1, #3]

Template solveCPUFastInF32 on MathOpAccumT so the inner reduction
applies XFloat32 truncation when MathOpAccumT=XFloat32. Remove the
XFloat32 rejection guard from isFastPathEligible. Update SolveGemmCPU
dispatch to branch on f32XdlMathOp.

When MathOpAccumT=float (default), casts are no-ops with identical
codegen to the previous implementation.

Changes:
- solveCPUFastInF32 gains template<MathOpAccumT=float>
- innerReduction uses float(MathOpAccumT(val)) cast chain
- isFastPathEligible no longer rejects XFloat32
- SolveGemmCPU dispatches solveCPUFastInF32<XFloat32> for TF32
- 10 new fast-path tf32 tests (20 total tf32 tests now)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 21, 2026
…#3, #5, stacked on #1, #2, #5]

Generalize the fast path to support double-precision accumulation:

- Template ShadowBuffer<AccumT>: storage, pointer, and element access
  are all AccumT. Float/Double inputs zero-copy when AccumT matches;
  sub-float types go through float then widen.
- Template loadTo<AccumT, SrcType> and storeFrom<AccumT, DstType>
  (renamed from loadToFloat/storeFromFloat).
- Rename solveCPUFastInF32 → solveCPUFast<AccumT, MathOpAccumT>:
  all tile registers, inner reduction, epilogue, alpha/beta extraction,
  bias reading, and activation args use AccumT.
- Add Double to isFastPathEligible's supported input/output types.
- SolveGemmCPU dispatch: route Double to solveCPUFast<double>.
- Add 10 f64 fast-path tests (transpose combos, Beta, Bias,
  AllFeatures, TN_AllFeatures, ScaleAB Scalar/Vector).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DDEle added a commit that referenced this pull request Apr 24, 2026
## Motivation

Fork PRs fail CI when `RUN_AITER_TESTS` or `RUN_FA_TESTS` is enabled.
The docker scripts run `git clone -b "$CK_*_BRANCH"
https://github.com/ROCm/rocm-libraries.git`, but a fork's branch doesn't
exist upstream:

```
fatal: Remote branch <fork-branch> not found in upstream origin
```

Example: [PR #6529 build
#4](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-6529/4/pipeline).

## Technical Details

**`Jenkinsfile`** — for PRs, use the upstream-visible PR ref instead of
the head branch name:

```groovy
CURRENT_BRANCH_NAME = env.CHANGE_ID
    ? "refs/pull/${env.CHANGE_ID}/head"
    : (env.CHANGE_BRANCH ? env.CHANGE_BRANCH : env.BRANCH_NAME)
```

**`Dockerfile.aiter` / `Dockerfile.fa`** — `git clone -b <ref>` only
accepts branches (`refs/heads/*`) and tags (`refs/tags/*`), so it can't
resolve `refs/pull/N/head`. Switch to `git fetch`, which accepts any
refspec (and still works for plain branch names):

```sh
mkdir rocm-libraries && cd rocm-libraries
git init -q
git remote add origin https://github.com/ROCm/rocm-libraries.git
git fetch --depth 1 --filter=blob:none origin "$CK_*_BRANCH"
git sparse-checkout init --cone
git sparse-checkout set projects/composablekernel
git checkout FETCH_HEAD
```

`git checkout FETCH_HEAD` lands in detached HEAD, which breaks the
existing `git branch -m "$CK_*_BRANCH"` (and that name isn't a valid
local branch anyway). Decouple the local branch name from the upstream
ref:

- Replace `git init` + `git branch -m` with `git init -b
"$LOCAL_BRANCH"` (requires git ≥ 2.28, satisfied by base images)
- `LOCAL_BRANCH="ck-import-${ROCM_LIBRARIES_SHA}"` in the rocm-libraries
path; `LOCAL_BRANCH="$CK_*_BRANCH"` in the fallback
- Downstream `git clone -b ... ../ck` uses `$LOCAL_BRANCH`

## Test Plan

Manually trigger a build on this PR with `RUN_AITER_TESTS=true` and
`RUN_FA_TESTS=true`; both docker images should build end-to-end.

## Test Result
[jenkins / rocm-libraries-folder/Composable Kernel / PR-6701 /
#3](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-6701/3/pipeline/)

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
bnemanich added a commit that referenced this pull request May 3, 2026
# Add gfx950 MXFP4 Subtile-based kernel implementation
## Summary
This PR is a follow-up to #6499 ([hipblaslt] Add support for gfx950
mxfp4)
and adds the **Subtile-based kernel implementation
(`UseSubtileImpl=1`)**
for hipBLASLt on **gfx950**. It introduces a new tile-decomposed code
generation path optimized for **MXFP4** and **BF16** GEMMs, plus the
solution-selection plumbing, validation, Origami logic yamls, and unit
tests
needed to make it production-usable.
## Motivation
PR #6499 brought MX data type support online for gfx950, but the
existing
TensileLite codegen path leaves significant performance on the table for
MXFP4-heavy workloads. The Subtile path restructures global-read /
local-read / MFMA / store scheduling at a finer granularity, which
**greatly improves MXFP4 GEMM performance when using
`HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT`** (added to the
hipBLASLt CHANGELOG).
## What's included
### 1. New Subtile-based kernel components (Tensile)
New modules under `projects/hipblaslt/tensilelite/Tensile/Components/`:
* `SubtileBasedKernel.py` (~1850 LOC) — entry point and orchestration of
  the subtile codegen path; replaces large portions of the standard
  prefetch / unroll / store flow when `UseSubtileImpl=1`.
* `SubtileBasedLogicalScheduler.py` (~2415 LOC) — logical scheduler that
  builds the subtile-grained instruction graph (GR loads, LR offsets,
  MFMA tiles, scale loads, stores) from kernel parameters.
* `SubtileBasedInstructionScheduler.py` (~433 LOC) — converts the
logical
  schedule to an emit order respecting wave / register / hazard
  constraints.
* `SubtileBasedInstructionEmitter.py` (~216 LOC) — instruction emission
  helpers shared by the subtile components.
### 2. Kernel writer / common changes
* **`KernelWriter.py`**, **`KernelWriterAssembly.py`**: integration
points
  for the subtile path — prefetch, GR offset calculation, LR offset
  calculation, post-loop, MFMA macro accounting, optimized `storeD`,
  LDS buffer swap, MX FP4 scale emit, `SrdMXSA/B+2` handling, sgpr
  allocation / overflow guards, computeLoadSrd fix.
* **`SolutionStructs/Solution.py`**, **`SolutionStructs/Problem.py`**:
  introduces the `UseSubtileImpl` parameter, MX-related reject
  conditions for non-Subtile paths on gfx950, and additional valid GEMM
  type combinations for MX inputs.
* **`Common/ValidParameters.py`**, **`Common/RequiredParameters.py`**,
  **`Common/GlobalParameters.py`**: `UseSubtileImpl` registration and
  defaults.
* **`Components/StreamK.py`**: subtile-aware StreamK fixup (incl. import
  union with the `BufferLoadB32` cache-coherence change from #6837).
* **`Components/GlobalWriteBatch.py`**: optimized global write batching
  for the subtile path (~670 LOC of changes).
* **`Components/ComputeStoreVgprs.py`**, **`Components/LSU.py`**,
  **`Components/WorkGroupMappingAlgos.py`**, **`AsmStoreState.py`**,
  **`KernelWriterModules.py`**: minor adjustments needed by the subtile
  pipeline.
### 3. rocisa / host / client
* **`rocisa/rocisa/include/container.hpp`**: helpers needed by the new
  emitter.
* **`tensile_host.cpp`**, **`include/Tensile/TensorDescriptor.hpp`**:
  small fixups for the subtile path and gfx950 build.
* **`client/include/DataInitialization.hpp`**,
**`client/src/DataInitialization.cpp`**,
**`client/src/Reference.cpp`**, **`client/src/ReferenceValidator.cpp`**,
  **`client/include/TypedId.hpp`**: MX scale init and reference paths
  used by the new tests.
* **`clients/common/include/testing_matmul.hpp`**,
  **`clients/common/include/norm.hpp`**,
  **`clients/common/include/hipblaslt_datatype2string.hpp`**,
  **`clients/common/src/mxDataGen.cpp`**: wiring for batched (>1)
  testing and MX init.
### 4. Origami / solution selection (gfx950 MXFP4)
New auto-tuned logic yamls under

`projects/hipblaslt/library/.../Tensile/Logic/asm_full/gfx950/gfx950/Origami/`
covering the FP4 SS / HS / BS variants in three layouts:
* `Origami/` (default)
* `Origami/Origami_nta4/` (no-transpose-A FP4)
* `Origami/Origami_ntb4/` (no-transpose-B FP4)
(9 new `gfx950_Cijk_Alik_Bljk_F4{SS,HS,BS}_MXA32_MXB32_*_UserArgs.yaml`
files in total.)
### 5. New tests
**End-to-end gfx950 GEMM yamls** in
`Tensile/Tests/common/gemm/gfx950/`:
* `subtile_bf16.yaml`, `subtile_mxfp4.yaml`
* `mx32f4_tn.yaml`, `mx32f8_tn.yaml`
* `mxfp4_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
* `mxfp4_fp8_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
* `fp8_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
**StreamK + MX:** `Tensile/Tests/common/streamk/sk_mx32f4_quick.yaml`,
`sk_mx32f8_quick.yaml`.
**New unit tests** (`Tensile/Tests/unit/`):
* `test_SubtileBasedLogicalScheduler.py` (~1735 LOC)
* `test_SubtileBasedSchedulerRef.py` (~596 LOC)
* `test_gr_lr_roundtrip.py` (~571 LOC)
* `test_storeD_roundtrip.py` (~2420 LOC)
* `test_graTileAssignment.py` (~354 LOC)
* `test_lraTileAssignment.py` (~360 LOC)
* `conftest.py`, `gpu_test_helpers.py` shared fixtures (~601 LOC)
**New gtest:** `tensilelite/tests/MXScalePadding_test.cpp`.
### 6. Misc / hardening
* Reject conditions: gfx950 MX + non-Subtile, DepthU constraints,
GroupGEMM
not yet supported with StreamK + MX, AssertSummationElementMultiple=256
  for subtile MXFP4, missing-mxblock check for non-MX types.
* Skip rocRoller for FP4-A/FP4-B with pre-swizzled scale layout (#42).
* `forceDenorm=False` in `generateMXInput` (#11).
* Several rebase fixes, copyright/year header updates, and
review-comment
  fixes to `KernelWriter` / `KernelWriterAssembly`.
### 7. CHANGELOG
Greatly improved MXFP4 GEMM performance when using
HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT

## How to use
Set `UseSubtileImpl: 1` on a gfx950 MX-FP4 solution (see the new
`subtile_mxfp4.yaml` / `mx32f4_tn.yaml` for canonical configs). The path
is
opt-in — non-MX and non-gfx950 kernels are unaffected.
## Backwards compatibility / risk
* All new behavior is gated on `UseSubtileImpl=1` and gfx950. Existing
  solutions on other architectures or non-MX paths are unchanged.
* `GroupGEMM + StreamK + MX` is intentionally rejected for now (TODO).
* New Origami yamls only add solutions; nothing existing is modified.
## Test plan
* New gtests + unit tests run automatically in CI (Tensilelite Python
  unit suite, `MXDataGen_test`, `MXScalePadding_test`).
* New end-to-end gfx950 GEMM and StreamK yamls are added to the common
  test buckets.
* Manual: run the gfx950 MXFP4 subtile suites
  (`pytest -k gfx950` after building Tensile, plus
  `tensilelite-client --yaml subtile_mxfp4.yaml` for sanity).
## Notes for reviewers
* This branch was rebased onto current `develop` (post-#6499) by
skipping
  the `users/nakajee/gfx950_mx_rebase_merge` history (which #6499
squash-merged) and replaying only the subtile-specific work as a single
  squashed commit. The actual code changes in this PR are limited to the
  files listed above (24 added, 56 modified; ~+170k / −2.6k including
  generated logic yamls).
* The largest reviewable diffs are:
*
`Tensile/Components/SubtileBased{Kernel,LogicalScheduler,InstructionScheduler,InstructionEmitter}.py`
(new files)
  * `Tensile/KernelWriter.py`, `Tensile/KernelWriterAssembly.py`
  * `Tensile/SolutionStructs/{Problem,Solution}.py`
  * `Tensile/Components/{GlobalWriteBatch,StreamK}.py`
  * `clients/common/include/testing_matmul.hpp`
  * `client/src/DataInitialization.cpp`

* Description of all commits that were squashed for this feature branch:

Subtile implementation for gfx950 MX FP4

--- 272f88d: Add sample subtile impl ---
Author: brianshi <brianshi@amd.com>

--- 60ecede: GR Offset calculation (#1) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- be69c1d: Enable post-loop code generation, and add some
subroutines ---
Author: b-shi <brianshi@amd.com>

--- 646d102: LR offset calculation (#2) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 71f4bca: Add GR load emit logic, and misc fixes (#3) ---
Author: b-shi <brianshi@amd.com>

--- 1fd0db9: Emit LR + init ACCVGPR (#4) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 9d406b9: Add loop and ptr update code ---
Author: b-shi <brianshi@amd.com>

--- b6127bc: Update GR/LR offset calculation to fully support 2x2,
1x4, 4x1 waveConfigs (#7) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 89ec87c: Account for valuC macro value in SK WS store code ---
Author: b-shi <brianshi@amd.com>

--- 6edf53d: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 34e79fc: Enable fp4 (#8) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- d5a5c57: [Tensilelite] Add MX FP4 scale offset computation for
subtile-based kernel (#6) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- 7a8a85a: Add lds buffer swap logic ---
Author: b-shi <brianshi@amd.com>

--- d24a8fe: Add optimized storeD code (#9) ---
Author: b-shi <brianshi@amd.com>

--- a45c20c: Fix MX scale tensor initialization: set
forceDenorm=false in generateMXInput (#11) ---
Author: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>

--- f945268: [Tensilelite] Enable the MX FP4 scale emit code in the
subtile-based kernel (#10) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- cf37df4: Use fixed value for SrdMXSA/B+2 (#14) ---
Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com>

--- f0c8dbc: Merge subtile_mx_f4_schedule to subtile_mx branch (#16)
---
Author: b-shi <brianshi@amd.com>

--- 543796f: Enable DU > 256, and reduce sgpr allocation (#18) ---
Author: b-shi <brianshi@amd.com>

--- c65bdb0: Add missing mxblock check for non-mx data types ---
Author: b-shi <brianshi@amd.com>

--- d64d226: Introduce UseSubtileImpl parameter (#20) ---
Author: b-shi <brianshi@amd.com>

Squash commits 20-35 from subtile_mx branch

--- e4780da: Enable FixSrd2 for A/B (#23) ---
Author: b-shi <brianshi@amd.com>

* Enable FixSrd2 for A/B

* Address comments from PR

---------

--- e4c64a7: Add nt libs ---
Author: b-shi <brianshi@amd.com>

--- cd13ec1: [Tensilelite] Pad MX scale tensor dimensions for
unaligned problem sizes (#21) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

* Add scale padding

* Add tests

* Remove redundant pre-swizzle path

* Remove code from
conflict

* Fix reverted mxdatagen path for tensile tests

* Add diverse test cases for scale padding in MXScalePadding_test and
subtile.yaml
- Expanded test cases to include non-multiple-of-32, even
non-multiple-of-16, and odd dimensions.

--- d87938f: Split subtile.yaml into subtile_bf16.yaml and
subtile_mxfp4.yaml (#22) ---
Author: James Newling <james.newling@gmail.com>

Replace the 'monolithic' subtile.yaml with two focused test files.
All original test coverage is preserved. Two new FP4 groups added.

BF16 coverage (subtile_bf16.yaml, tests are essentially unchanged):

  # | Description        | Dest | MIs | PGR | DU      | SK  | Sizes
  --+--------------------+------+-----+-----+---------+-----+------
  0 | BF16 TN main       | b    |  19 |   0 | 64      | 0,3 |  11
  1 | BF16 TN large DU   | b    |   4 |   0 | 128,192 | 0,3 |   7
  2 | BSS (f32 output)   | s    |   6 |   0 | 64      | 0,3 |   9
  3 | BF16 bias          | b    |   2 |   0 | 64      | 0   |   1

FP4 coverage (subtile_mxfp4.yaml):

  # | Description        | Dest | MIs | PGR | DU  | SK  | Sizes | Status
--+--------------------+------+-----+-----+-----+-----+-------+--------
0 | FP4 TN main | b | 15 | 0 | 256 | 0,3 | 23 | from original
1 | FP4 TN large DU | b | 4 | 0 | 512 | 0,3 | 13 | from original
2 | F4SS (f32 output) | s | 5 | 0 | 256 | 0,3 | 13 | from original
3 | FP4 bias | b | 2 | 0 | 256 | 0 | 1 | from original
  4 | FP4 PGR=2          | b    |  13 |   2 | 256 | 0   |   5   | new
  5 | FP4 expanded MIWT  | b    |  24 |   0 | 256 | 0   |   5   | new
6 | PGR=2 WG 4x1/1x4 | | 6 | 2 | 256 | 0 | 1 | known failures
(commented)

Run times on gfx950 (8x MI350X):

  File               | NEV=-1 | NEV=0
  -------------------+--------+------
  subtile_bf16.yaml  |    23s |   23s
  subtile_mxfp4.yaml |    37s |   40s

Where NEV is number of elements to validate. I (James) have checked
these numbers,
and weirdly it is true that NEV=0 is a bit faster than NEV=-1 for mxfp4.

--- af04f0d: Dependency based instruction scheduling (#19) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Revert to single partition

* Start using dependencies

* as is

* start using separate EmittedModules

* remove reduntant wait

* Add _extractPathsFromBeforeDeps

* Continue simplification

* Simplifying

* Add more rules

* cleanup

* Add fp4 test

* fix test

* Add tests

* Remove after field on emittedmodule

* Refactoring instructionSchedule

* Add comments

* cleanup modules vs ops

* Refactoring print functions

* Test cleanup

* Add more tests

* Replace subgroup by partition

* Remove unused unroll param

* Add high level notes

* Simplify NLL and NGLL GR removal

* Add some comments

* Force instruction insertion if no slots available

* Fix test after rebase

* Move scale before A/B and track inflight count

* Fine-grain vmcnt calculation

* Separate counts for scaleA and B

* Avoid using m0 update and buffer_lod on same MFMA slot to avoid scalar
instruction serialization

* Fix test

* Add vmcnt test

* Fix duplicated loads for 1x4 and 4x1

* Fix placement in reverse order

* Fix regression on PGR0

* add fallback to numMFMA=1

--- 3ec902b: Add some 1x4 and 4x1 origami solutions ---
Author: b-shi <brianshi@amd.com>

--- c5000d3: Fix typo ---
Author: b-shi <brianshi@amd.com>

--- 226ed84: [hipblaslt] Refactor Srd2 calculation for useFixedSrd2
(#30) ---
Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com>

--- abf19d4: [Tensilelite] UseSubtileImpl: subtile-aligned edge check
for store path (#29) ---
Author: b-shi <brianshi@amd.com>

* [Tensilelite] UseSubtileImpl: subtile-aligned edge check, OOB guard,
and refactoring

- Replace Size%MT edge check with subtile-aligned check: NonEdge paired
  store when trailing rows/cols are a multiple of the subtile block size
(waveGroupM rows for M, 16 cols for N). Non-last workgroups always take
NonEdge.
- Add per-wave OOB guard (subtileM32ValidBlocksSgpr /
subtileN16ValidBlocksSgpr)
  to skip stores outside valid M/N tile bounds in the NonEdge path.
- Refactor duplicated OOB guard into _emitSubtileOobGuard helper;
refactor
M/N guard SGPR computation into _emitSubtileMGuard / _emitSubtileNGuard.
- Fix orphan scalar store blockIdxM (was tt0, now
(tt0*MatrixInstM)//mBlockSize).
- Add quick-exit and edge/non-edge header comments to generated ASM.

* Add some bias tests, combine M/N guard to single routine

* Add OOB check for C loads, update storeD unit tests to check OOB,
simplify quick exit checks

* Address more PR comments: add M group skip, and skip to store end.
simplified loadC OOB mask

---------

--- 637881a: Fix unit tests & remove legacy code for subtile
interleaving (#33) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Fix gr_lr_roundtrip test

* Use non-interleaved version as ref code

* Fix scheduler test

* Removed legacy interleaved mode for LR/GR offset calculation

--- e9cb889: Fix MX FP4 scale buffer allocation and initialization
for batched GEMM (#25) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

* Fix bacth count issue

* Add batch count tests

* Fix bacth count issue

* Address PR review: clarify FP4-specific byte stride and add
non-aligned batched tests

- Updated comments on dataBatchBytes computation to clarify FP4 packing
  assumption (2 elements/byte) and flag that non-FP4 block-scaling types
  would require updating this conversion.
- Added batched test cases with non-multiple-of-32 M/N dimensions:
  FP4 DU=256: [48,48,2] and [33,65,2]
  FP4 DU=512: [63,63,2]
  BF16: [50,100,2]

---------

--- a43247b: Update some test yamls (#31) ---
Author: b-shi <brianshi@amd.com>

--- e2f69c8: Add f4bs origami library with activation function
support. Refactor sgpr allocation to reduce sgpr usage in post loop.
Store code-path reorganization (#32) ---
Author: b-shi <brianshi@amd.com>

* Free swap/localwritebase sgprs before post-loop

* Defer sgpr allocation to remove holds in sgpr pool.

Add Origami library logic files for Cijk_Alik_Bljk_F4BS_MXA32_MXB32
(base, nta4, ntb4 variants).

* Remove uneeded alignment and comment

* Add more epilogue tests

* Remove older origami library for f4bs

* Reorder post-loop code blocks to after persistant loop Misc fixes

* Fix build issues, relax longjump sgpr requirements

* Fix GSU0 branch logic

---------

--- 3f034bf: Add F4HS and F4SS Origami library logic for FP4→F16 and
FP4→F32 GEMM (#35) ---
Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com>

* Add F4HS and F4SS Origami library logic for FP4→F16 and FP4→F32 GEMM

- Add 6 new yaml files (F4HS, F4SS) across Origami, Origami_nta4,
Origami_ntb4
- Update F4BS yaml files: AssertSummationElementMultiple 32→256 for
K%256 enforcement
- Add ("F4", "F4", "H", "S") to _validGEMMTypes and _HPATypes in
Problem.py

* Add F4HS test cases to subtile_mxfp4.yaml

Add two new benchmark problem blocks for FP4→F16 (F4HS):
- No-bias block: same wavetile and problem size coverage as F4SS
- Bias epilogue block: BiasDataTypeList [s, h], relu/none activations

* Add F4HS (FP4->Half) type support to Tensile client

Add TypedGemm_F4_H_S typedef and corresponding reference CPU solver
case so F4HS (FP4 input, Float16 output, Float compute) problems
can be validated by the benchmark client.

---------

--- d0bc8fd: Rewrite subtile-based scheduler. Fix DU>64 & enable very
large MT (#36) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Initial support for DU>256

* Renaming

* add option to do DU=512 in the tests

* blocked K-major for scale

* Change scaleSet swap logic

* Update print functions

* Put scales after values for avoid race conditions

* Fix tests

* more test

* tweak printschedule display

* Add PGR2 in the yaml tests

* Add new scaleGROp

* comment out failing tests

* Revert "comment out failing tests"

This reverts commit 1f5802c.

* Draft new logical scheduler

* Refactoring

* Add more test on step1

* Add more tests on step1

* add bf16 320x320 test

* reduce step1 code

* Simplify step1 logic

* validate some step1 test

* Fix partition 2x2 test

* more step1 test

* 320x320 BF16 test

* Add test DU512 + partition2x2

* Simplify step1 code

* Add step2 tests

* Fix multi-partition step2

* Add step2 du512, 2x2 partition test

* Use common algo for all numPartitions

* Draft for step3 tests

* remove useless tests

* New GR algo (draft)

* [Step3] Add more test

* Iteration on GR

* Display ordered GR list with granularities

* More test

* Add some comments

* Disable by default debug logs

* Getting rid of step naming

* Start remove AnnotatedOp (still there in group pass)

* Split dependency Ops

* Add todo on place_GRs pass

* Valid test_annotate_deps_1x1_partition_DU256

* Test output looking better (still WIP)

* single dep for LR tooo

* Add remove_cross_deps pass

* Fix bugs in dependency pass

* insert_gr_lr_inc pass

* Add group_lr_gr pass

* Add emit pass

* Quick port of instruction Emit code

* Move emit function to separate file

* Refactoring instructionEmitter

* Port vgprTile tracking

* Reworking second pass (WIP)

* Display unrolling requirement

* Unrolling check on 2nd pass

* Generic validation for assign_vgpr pass

* Fix unroll

* Add inst schedule in standalone mode

* Use lrGran for vgprTile size calculation

* Fix bug in emit pass (missing depencency)

* PreMFMA path + non-duplication scale load

* missing globalReadLDSBufferSwap for GR_INC scales

* add wairlr_sync on all LR->GR dep

* add waitgr_sync op

* remove_unnecessary_gr_deps

* Change LR dispatch algo a bit to avoid too many waitgr_sync

* Avoid duplicated loads in emitter

* Fix bug on gr_emit code

* GrInc pass. fix duplicated insertion for B

* Fix missing LR_inc for SA/SB

* preloop, NLL, NGLL

* Simplify preloop

* minor changes

* Move unroll logic to scheduler

* minor changes

* Fix unroll id bug on NLL / NGLL

* Disable post GRINC for now

* Remove commented code

* Handle 1x4, 4x1 gr read gran

* Fix vmcnt computation

* Use correct grCount mapping

* Revert in emit logic on buffer_load for PGR0 needs

* Add bf16 version in standalone test

* Fix LR_Inc insertion on DU>64

* Add subIterK/Partition comment to codegen

* Fix issue in GrInc placement

* Remove last_mt

* Fix LR MT index bug with muli-partition

* Disable early LDS size check when subtileImpl is on

* Add pass to remove redundant LR deps + fixed issue on dependency
annotation pass

* Remove more LR redundant deps

* Only insert wait_lr_sync on deps

* Simple algo to select partition config

* Remove HC value for partitions...

* Take into account all inflight GR (all tensors)

* Fix tests and regressions on gr counts

* Fix grCount merge calculation

* Better display of dependencies

* Add remove_wait_lr_sync after grouping

* Add temporary non reg file

* Change merge logic on GR grouping pass

* Fix non necessary wait_lr_sync

* Downgrade some waitlr_sync to sync + added 384x256 no reg test

* non reg test 320x320

* Add larger MT

* non reg test for fp4 256x256

* Moving out instructionScheduler

* Remove old scheduler

* Renaming scheduler

* Re-work test

* Add larger MT test cases

* Rename non-ref test

* Re-add standalone mode

* Refactor DepOp

* Remove dead code

* Remove MFMATileSize class

* Remove from_til_info

* Avoid redundant tensor list creation

* Remove hardcode granularities in vgrpTile allocation pass. Simplify
code.

* Re-enable  # PGR=2 WG 4x1/1x4, K > DU tests

* Remove unused GRScaleOp

* DepRef renaming

* Get rid of MT string representation

* Remove TODO

* EmmitedModule simplication

* Use explicit pass dependencies

* Renaming LogicalScheduler

* Remove old test_InterleavingScheduler.py file

* Commenting failing test for now

* Remove debug logs

* Disable lds padding when using UseSubtileImpl

--- e8e8c09: Fix LR-GR dependency issue when DU>64 (#40) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Fix and simplify logic for remove_unnecessary_lr_deps

* Add new ref tests for 128x128x(128,64)

--- 4aa441a: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 5ba911e: Skip rocRoller for FP4-A/FP4-B + pre-swizzled scale
layout (#42) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- 9c74998: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 842b149: Addressed review comments for KernelWriter and
KernelWriterAssembly ---
Author: Koji Nakajima <knakajim@amd.com>

--- dce43b1: Fix computeLoadSrd issue ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- c075bbf: Fix preSolution CPU re-sync regressing
subtile_mxfp4.yaml ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- ced840f: Fix computeLoadSrd issue (#43) ---
Author: bnemanich <brad.nemanich@amd.com>

--- bc2f6dd: Small update for gfx950 mx tests + more - enable
UseSubtileImpl for all gfx950 non subtile mx tests - skip all gfx950
mxfp8 - use MXScaleFormat=1 as default - set
AssertSummationElementMultiple=256 for subtile mxfp4 - fix
isSwizzledSubtile in computeLoadSrd ---
Author: Koji Nakajima <knakajim@amd.com>

--- 5c794b7: Fix gsuasb.yaml failures ---
Author: b-shi <brianshi@amd.com>

--- 727f8db: tensilelite: add solution reject conditions for
UseSubtileImpl=1 (#38) ---
Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com>

--- 8928fbb: Add more reject conditions for Subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- 6e63ab6: Fix kringshift test failures ---
Author: b-shi <brianshi@amd.com>

--- b3e9724: Update reject condtion for DepthU in subtile case. Plus,
update DepthU setting for gfx950 mx test cases ---
Author: Koji Nakajima <knakajim@amd.com>

--- 5ab6009: Fix build errors ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 4a4edca: Update more mxfp4 tensilelite test cases ---
Author: Koji Nakajima <knakajim@amd.com>

--- bbbc553: Update change log ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 6476c04: Add more reject conditions for gfx950 subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- c5828c4: Updated gfx950 mxfp4 test cases - add StreamK setting -
skip groupgemm tests for now (groupgemm does not support streamK) ---
Author: Koji Nakajima <knakajim@amd.com>

--- f1fc2f1: Fix hipblaslt build error of gfx950 ---
Author: Koji Nakajima <knakajim@amd.com>

--- 70cea1b: Updated subtile_mxfp4.yaml (add StreamK) ---
Author: Koji Nakajima <knakajim@amd.com>

--- c1c9b2a: Add uninit lsc,lsp, etc.. fields for subtile ---
Author: b-shi <brianshi@amd.com>

--- c0c1f72: Fixed merge error in testing_matmul.hpp ---
Author: Koji Nakajima <knakajim@amd.com>

--- 191e0cb: Add missed batch_count >1 changes ---
Author: archana-ramalingam <Archana.Ramalingam@amd.com>

--- 01c52f8: Addressed PR comments ---
Author: Koji Nakajima <knakajim@amd.com>

--- 4e89c91: Reduce mxfp4 test time ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 3dac20f: Prevent overflow for wgmxcc sgpr allocation ---
Author: b-shi <brianshi@amd.com>

--- 18dec79: Fix error with problem type ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 9e69ffd: Add a reject conditoin for gfx950 mx + non Subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- 0eed3ba: Add more valid GEMM types ---
Author: Brad Nemanich <brad.nemanich@amd.com>

--- 8b5514e: Fix missing b build error ---
Author: archana-ramalingam <Archana.Ramalingam@amd.com>

--- f981ff5: Fix 1250 tests ---
Author: Brad Nemanich <brad.nemanich@amd.com>

--- d1e69d9: Add more FP4 tests ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- e3a688f: Add MXScaleFormat: 1 to all gfx950 mx test yaml ---
Author: Koji Nakajima <knakajim@amd.com>

--- aaef3f5: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml
---
Author: Koji Nakajima <knakajim@amd.com>

--- 861ef8e: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml
(nta4,ntb4) ---
Author: Koji Nakajima <knakajim@amd.com>

Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com>
Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>
Co-authored-by: Brian Shi <Brian.Shi@amd.com>
Co-authored-by: James Newling <James.Newling@amd.com>
Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com>
Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com>
Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com>
Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com>
Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>
Co-authored-by: Brian Shi <Brian.Shi@amd.com>
Co-authored-by: James Newling <James.Newling@amd.com>
Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com>
Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com>
Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com>
Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>
Alex-Vasile added a commit that referenced this pull request May 7, 2026
…rows in 8t9 audit

DZL_SCOPE_REASSESSMENT.md §1.1 SCC: adds source-level confirmation
that CommonInstruction (parent of every SCC-touching SOPx class) has
only dst/dst1/srcs fields, with getDstParams()/getSrcParams() bodies
shown verbatim from instruction.hpp:382. Rules out the q9j-style
"binding gap" hypothesis at the source level, not just by Python
dir() probing.

ROCISA_WORKAROUNDS_AUDIT.md: adds a "Status update (2026-05-07)"
preamble that flags which rows have been resolved or reframed by
subsequent investigations:
- Row #3 (_OPERAND_RULES, ~600 LoC estimate) → reassessed by q9j
  reassessment to ~12 lines C++ binding + ~260 LoC validator
  migration (the C++ API already exists).
- Rows #7, #8, #19 (VCC machinery) → REMOVED by uraq; no rocisa-side
  replacement planned. VCC dataflow tracking is permanently dropped
  from the validator's scope.
- MFMA acc special-casing → RECLASSIFIED to q9j Category A (acc and
  acc2 are already in the C++ getDstParams/getSrcParams partition).
- Category C narrowed from SCC/VCC/m0/acc to SCC/m0 only.

The audit's table and prose are preserved as the original artifact;
the preamble lets a reader find the current canonical reframing for
each affected row without rewriting the historical record.
bghimireamd added a commit that referenced this pull request May 8, 2026
…ty and CLI

Data Integrity checks #1-#3 tagged as *proposed* — none are implemented
today. CLI flags marked as not yet existing. Prevents readers from
confusing proposed behavior with current state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NolanHannaAMD added a commit that referenced this pull request May 8, 2026
…7008)

## Motivation

Several memory leaks were detected in MIOpen gtests using ASan. Some of
the tests were blacklisted and others were not. This change looks to fix
all of the low hanging fruit, which are the majority of the leaks found.
This includes all of the critical leaks (>100MB) that were reported.
Some other leaks were identified as needing a larger refactor to
resolve.

After fixing the w_supertensor.cpp leak, the supertensor tests hit
virtual memory area limit errors. These changes are to enable running
them with ASan, but only with a subset of the tests in order to not hit
limits. The test coverage being lost is fairly negligible and the full
tests are still run without ASan enabled.

## Technical Details

Here is a summary of the files looked at and the changes made:

| File | Status | Notes |
|------|--------|-------|
| `test/gtest/gpu_mha_forward.cpp` | Fixed | Added
`miopenDestroySolution()` loop after using solutions. Fixes MHA forward
solution descriptor leaks (report #11, #13). |
| `test/gtest/gpu_mha_backward.cpp` | Fixed | Added
`miopenDestroySolution()` loop after using solutions. Fixes MHA backward
solution descriptor leaks (report #9, #12). |
| `test/gtest/mha_find20.cpp` | Fixed | Added `miopenDestroySolution()`
loop in both `MhaForward` and `MhaBackward` tests. Fixes MHA Find2.0
solution leak (report #16). |
| `test/gtest/gtest_desc_guard.hpp` | Fixed | New shared header
introduced by the refactor. Provides a single `DescGuard<DescType,
CreateFn, DestroyFn>` template (with `TensorDescGuard`, `ConvDescGuard`,
`DropoutDescGuard`, `RNNDescGuard` aliases), a `HandleGuard` RAII
wrapper for `miopenHandle_t`, and the
`DestroyInternalRnnDropoutDesc(rnnDesc)` helper used by every
RNN/LSTM/GRU test to free the internal `DropoutDescriptor` that
`miopenCreateRNNDescriptor` allocates and `miopenSetRNNDescriptor*` then
leaks. Replaces the per-file ad-hoc guard structs from the initial
implementation. |
| `test/gtest/w_supertensor.cpp` | Fixed | Switched raw descriptors to
the shared `RNNDescGuard` / `TensorDescGuard` from
`gtest_desc_guard.hpp`. Added a class-local `DestroyDropoutDesc()`
(called from `TearDown` and before `miopenSetRNNDescriptor`) to prevent
the `miopenSetRNNDescriptor` overwrite leak. Reduced test parameter
space under ASan to avoid OOM. Removed unused `seqLen` parameter and
dead `param_dev_out`/`bias_dev_out` allocations. |
| `test/gtest/lstm.hpp` | Fixed | Switched `rnnDesc` → `RNNDescGuard`,
`DropoutDesc` → `DropoutDescGuard`, and `mio_handle` → `HandleGuard`
(which now owns the `miopenDestroy` call), all from the shared
`gtest_desc_guard.hpp`. Hoisted `dropout_state_buf` so it can be
`hipFree`d at the end of the dropout path. Added
`DestroyInternalRnnDropoutDesc(rnnDesc)` before
`miopenSetRNNDescriptor*` and (in the non-dropout path only) at end of
`Run`, which frees the internal `DropoutDescriptor` that the Set call
would otherwise leak. Fixes ~615 MB dropout leaks and ~16.5 KB
non-dropout descriptor leaks (report #3, #4, #6, #17-#22). |
| `test/gtest/gru_test.cpp` | Fixed | In the GRU test class: switched
`rnnDesc` → `RNNDescGuard`, `DropoutDesc` → `DropoutDescGuard`,
`mio_handle` → `HandleGuard`, and added
`DestroyInternalRnnDropoutDesc(rnnDesc)` before
`miopenSetRNNDescriptor*` and (non-dropout path only) at end of `Run`
plus `hipFree(dropout_state_buf)` for the dropout path. In the in-file
`GRUFwdCPUVerify` / `GRUBwdDataCPUVerify` helpers: converted the raw
`dropout_inputTensor` / `dropout_outputTensor` declarations to
`TensorDescGuard` (mirroring the `cpu_rnn.hpp` change for the LSTM/RNN
helpers). |
| `test/gtest/softmax_find20.cpp` | Fixed | Changed `Finalize()` to take
the `std::vector<miopenSolution_t>&` and destroy each solution via
`miopenDestroySolution()` before destroying the problem. Updated all 6
`TEST(...)` callers to pass the solutions vector. Fixes Find2.0 softmax
solution/kernel leaks (report #25-#27). |
| `test/gtest/rnn_seq_api.hpp` | Fixed | Hoisted `dropout_state_buf` so
the dropout path can `hipFree` it at the end. Added
`DestroyInternalRnnDropoutDesc(&rnnDesc)` before
`miopenSetRNNDescriptor*` and (non-dropout path only) at end of the run
to free the internal `DropoutDescriptor` allocations leaked by
`RNNDescriptor` copy-assignment. Same pattern as the LSTM/GRU fixes
(report #29-#30). |
| `test/cpu_rnn.hpp` | Fixed | Converted 6 raw `miopenTensorDescriptor_t
dropout_input/outputTensor` declarations across the LSTM/RNN CPU
verification helpers (`LSTMFwdCPUVerify`, `LSTMBwdDataCPUVerify`,
`RNNFwdTrainCPUVerify`, `RNNBwdDataCPUVerify`, `GRUFwdCPUVerify`,
`GRUBwdDataCPUVerify`) to the shared `TensorDescGuard` from
`gtest_desc_guard.hpp`. Removed the redundant
`miopenCreateTensorDescriptor` calls and updated 12 `miopen::deref(...)`
sites to `.get()`. (Note: the GRU helpers in this header are stale
duplicates; the live ones are inside `test/gtest/gru_test.cpp` and were
updated there too.) Fixes the LSTM/GRU CPU-verify tensor descriptor
leaks (report #1, #2, #14, #15, #32, #33). |
| `test/gtest/rnn_vanilla_common.hpp` | Fixed | Added
`DestroyInternalRnnDropoutDesc(rnnDesc)` calls before
`miopenSetRNNDescriptor*` and (non-dropout path only) at end of the run.
The `RNNDescGuard` / `DropoutDescGuard` usage was already in place from
an earlier commit and now resolves to the shared definitions in
`gtest_desc_guard.hpp`. Same pattern as LSTM/GRU/rnn_seq_api fixes
(report #14, #15, #32, #33). |
| `test/gtest/graphapi_gtest_common.hpp` | Skipped | File no longer
exists in the codebase. The GraphApi test infrastructure has been
removed; only the leak report (a stale snapshot) still references it. No
fix possible against the current source tree. |
| `test/gtest/graphapi_execution_plan.cpp` | Skipped | File no longer
exists in the codebase (GraphApi removed). The leak it represented was
largely an external hipblaslt bug anyway; the remaining test-side
portion is not fixable against the current source. |
| `test/gtest/na_train.cpp` / `na_inference.cpp` / `na_*_find2.cpp` |
Skipped | Leaks come from the internal MIOpen solver/kernel cache living
on the global singleton handle, which is never destroyed. Not easily
fixable without redesigning global handle lifecycle. Should be
suppressed in the ASan suppression file. |
| hipblaslt / rocblaslt (external) | Skipped |
`SolutionCache::addKernel` and `preloadCustomKernels` leak via
`_rocblaslt_handle` constructor. Called from
`miopen::Handle::CreateHipblasLtHandle`. This is an upstream bug in
hipblaslt/rocblaslt, not fixable in MIOpen. Affects suites that
initialize a handle (report Category 2, Category 7). |
| CLR / HIP runtime (external) | Skipped | `amd::Context` and
`amd::roc::Device` global initialization leaks from
`rocclr/platform/context.cpp`. HIP runtime internals, not fixable in
MIOpen. |
| `src/hipoc/hipoc_program.cpp` | Skipped | `HIPOCProgramImpl` objects
leak during kernel compilation/caching (line ~178). This is an internal
MIOpen kernel cache lifecycle issue that requires deeper architectural
changes to fix. Contributes small amounts to MHA and Softmax Find2.0
leaks. |
| `test/gtest/conv_api.cpp` | Fixed | Already clean against current
source — `miopenDestroyConvolutionDescriptor(conv_desc)` call exists
(line 24) inside the test loop. ASAN run reports no leaks. The leak
report was based on a stale snapshot. The hipblaslt handle init portion
is tracked under the external-skipped row. |
| `test/gtest/log_test.cpp` (CPU_LOG_TEST_FUSION / CPU_LOG_TEST_NEG) |
Fixed | Already clean against current source — `Tensor`, `Conv`,
`CreateCBAFusionPlan`, `CreateBNormFusionPlan` all have proper
destructors that call the corresponding `miopenDestroy*` APIs in
`log.cpp`. ASAN run on `CPU_LOG_TEST_*` (11 tests across log_test.cpp +
log_test_neg.cpp) reports no leaks. The hipblaslt handle init portion is
tracked separately under the external-skipped row. |
| `test/gtest/fusion_test.cpp` (CPU_FusionCreateOpConvForward) | Fixed |
File renamed from `fusion.cpp` to `fusion_test.cpp`. Already clean
against current source — uses `TensorDescGuard`/`ConvDescGuard` for
tensor/conv descriptors and calls
`miopenDestroyFusionPlan(fusionPlanDesc)` on the fusion plan (line 195).
ASAN run on `CPU_FusionCreateOpConvForward_FP32.*` reports no leaks. |
| `test/gtest/deterministic_conv_api.cpp` | Fixed | Already clean
against current source — uses `ConvDescGuard` (line 66) for the conv
descriptor. ASAN run on `*CPU_DeterministicConvApi*` reports no leaks. |
| `test/gtest/fusion_aux.cpp` (GPU_FusionAux) | Fixed | Already clean
against current source — uses `ConvDescGuard` plus stack-allocated
internal C++ objects (`miopen::TensorDescriptor`,
`miopen::FusionPlanDescriptor`) which have proper destructors. The
`convoOp` handle is owned by the fusion plan. ASAN run on
`*GPU_FusionAux*` reports no leaks. |
| `test/gtest/backend_api.cpp` (CPU_BackendApi) | Skipped | File no
longer exists in the codebase. The backend API test infrastructure (part
of the removed GraphApi suite) was removed; no fix possible against the
current source tree. |


### High-level notes

New shared infrastructure (test/gtest/gtest_desc_guard.hpp)
- DescGuard<DescType, CreateFn, DestroyFn> — a single RAII template
parameterized on the descriptor type and its create/destroy entry
points. Aliases provide TensorDescGuard, ConvDescGuard,
DropoutDescGuard, and RNNDescGuard, replacing the four near-identical
guard structs
  that were copy-pasted across test files in the initial implementation.
- HandleGuard — separate RAII wrapper for miopenHandle_t (couldn't reuse
the template because miopenCreateWithStream takes an extra hipStream_t
argument). Supports lazy create(stream) so callers that only need a
handle in the dropout branch can default-construct one and
  populate it conditionally.
- DestroyInternalRnnDropoutDesc(rnnDesc) — frees the internal
DropoutDescriptor that miopenCreateRNNDescriptor allocates and that
miopenSetRNNDescriptor* then orphans. Replaces the equivalent inline
blocks that LSTM/GRU/RNN tests were each carrying. The header documents
the
two call-sites: before each Set* (always safe) and at end-of-run only on
the non-dropout path (the dropout path aliases the user-owned
descriptor, so freeing would double-free).

Recurring patterns enabled by the refactor
- The "leak from Set* overwriting the default-constructed internal
dropout descriptor" fix collapsed from per-file code to a one-line
helper call, applied uniformly across lstm.hpp, gru_test.cpp,
rnn_seq_api.hpp, and rnn_vanilla_common.hpp.
- mio_handle ownership in LSTM/GRU is now expressed via HandleGuard
rather than a manual miopenDestroy at the end of the dropout branch —
eliminates a class of forgotten-cleanup bugs.
- dropout_state_buf is consistently hoisted out of the dropout if block
so an end-of-run hipFree can release it; deletion of the buffer pairs
visibly with its allocation.

Notable non-RNN change
- softmax_find20.cpp was the only Find2.0 leak fix in this commit:
Finalize() now takes the solutions vector and calls
miopenDestroySolution() for each before destroying the problem. Same
shape applied to all 6 tests in the file.


## Test Plan

Run the tests beforehand to observe the ASan leak errors and then again
afterward to verify the fixes have resolved the problem.

## Test Result

List from ROCM-21512:

| # | Test Name | Status | Leak Status |

|---:|------------------------------------------------------|------------------------------------------------|-----------------|
| 1 | Smoke/GPU_RNNVanillaDropout_FP32 | PASSED (4 tests) | No leaks |
| 2 | Smoke/GPU_RNNVanillaDropout_FP16 | PASSED (4 tests) | No leaks |
| 3 | Full/GPU_LSTM_dropout_FP32 | PASSED (4 tests) | No leaks |
| 4 | Full/GPU_LSTM_dropout_FP16 | PASSED (4 tests) | No leaks |
| 5 | CPU_GraphApiExecutionPlanBuilder_NONE | REMOVED (PR #5603,
2026-03-26) | n/a (deleted) |
| 6 | Full/GPU_LSTM_dropout_FP64 | REMOVED (PR #5750, 2026-03-26) | n/a
(deleted) |
| 7 | Unit/CPU_GraphApiPointwise_NONE | REMOVED (PR #5603, 2026-03-26) |
n/a (deleted) |
| 8 | Full/GPU_LstmMSRnn_FP32 | PASSED (1152 tests) | No leaks |
| 9 | Smoke/GPU_Bwd_Mha_FP32 | PASSED (12 tests) | No leaks |
| 10 | Full/GPU_LstmMSRnn_FP16 | PASSED (864 tests) | No leaks |
| 11 | Smoke/GPU_Fwd_Mha_FP32 | PASSED (15 tests) | No leaks |
| 12 | Full/GPU_Bwd_Mha_FP32 | PASSED (6 tests) | No leaks |
| 13 | Full/GPU_Fwd_Mha_FP32 | PASSED (7 tests) | No leaks |
| 14 | Full/GPU_RNNVanilla_FP32 | PASSED (96 tests) | No leaks |
| 15 | Full/GPU_RNNVanilla_FP16 | PASSED (96 tests) | No leaks |
| 16 | GPU_TestMhaFind20_FP32 | PASSED (2 tests) | No leaks |
| 17 | Full/GPU_LSTM_FP32 | PASSED (32 tests) | No leaks |
| 18 | Full/GPU_LSTM_FP16 | PASSED (32 tests) | No leaks |
| 19 | Full/GPU_LSTM_extra_FP32 | PASSED (30 tests) | No leaks |
| 20 | Full/GPU_LSTM_extra_FP16 | PASSED (30 tests) | No leaks |
| 21 | Full/GPU_DeepBench_LSTM_FP16 | PASSED (22 tests) | No leaks |
| 22 | Full/GPU_DeepBench_LSTM_FP32 | PASSED (22 tests) | No leaks |
| 23 | CPU_LOG_TEST_FUSION_NONE | PASSED (2 tests) | No leaks |
| 24 | CPU_LOG_TEST_NEG_NONE | PASSED (4 tests) | No leaks |
| 25 | GPU_SoftmaxFind20_BFP16 | PASSED (2 tests) | No leaks |
| 26 | GPU_SoftmaxFind20_FP16 | PASSED (2 tests) | No leaks |
| 27 | GPU_SoftmaxFind20_FP32 | PASSED (2 tests) | No leaks |
| 28 | CPU_ConvApi_NONE | PASSED (1 test) | No leaks |
| 29 | Full/GPU_RNNSeqApi_FP16 | PASSED (16 tests) | No leaks |
| 30 | Full/GPU_RNNSeqApi_FP32 | PASSED (16 tests) | No leaks |
| 31 | UnitVAN/CPU_GraphApiRng_NONE | REMOVED (PR #5603, 2026-03-26) |
n/a (deleted) |
| 32 | Smoke/GPU_RNNVanilla_FP16 | PASSED (4 tests) | No leaks |
| 33 | Smoke/GPU_RNNVanilla_FP32 | PASSED (4 tests) | No leaks |
| 34 | CPU_FusionCreateOpConvForward_FP32 | PASSED (1 test) | No leaks |
| 35 | CPU_GraphApiOperationReduction_NONE | REMOVED (PR #5603,
2026-03-26) | n/a (deleted) |
| 36 | Unit2IV1/CPU_GraphApiOperationPointwiseBwd_NONE | REMOVED (PR
#5603, 2026-03-26) | n/a (deleted) |
| 37 | Unit2IV1/CPU_GraphApiOperationPointwiseTwoInput | REMOVED (PR
#5603, 2026-03-26) | n/a (deleted) |
| 38 | Unit2IV2/CPU_GraphApiOperationPointwiseBwd_NONE | REMOVED (PR
#5603, 2026-03-26) | n/a (deleted) |
| 39 | Unit2IV2/CPU_GraphApiOperationPointwiseTwoInput | REMOVED (PR
#5603, 2026-03-26) | n/a (deleted) |
| 40 | UnitVAB/CPU_GraphApiRng_NONE | REMOVED (PR #5603, 2026-03-26) |
n/a (deleted) |
| 41 | Smoke/GPU_RNNVanillaDropout_FP16 (duplicate of #2) | (see #2) |
(see #2) |
| 42 | CPU_GraphApiOperationGraphDescriptor_NONE | REMOVED (PR #5603,
2026-03-26) | n/a (deleted) |
| 43 | UnitVA/CPU_GraphApiVariantPack_NONE | REMOVED (PR #5603,
2026-03-26) | n/a (deleted) |
| 44 | UnitVAU/CPU_GraphApiRng_NONE | REMOVED (PR #5603, 2026-03-26) |
n/a (deleted) |
| 45 | CPU_GraphApiOperationReshape_NONE | REMOVED (PR #5603,
2026-03-26) | n/a (deleted) |
| 46 | Smoke/CPU_DeterministicConvApi_NONE | PASSED (1 test) | No leaks
|
| 47 | Smoke/GPU_FusionAux_FP32 | PASSED (1 test) | No leaks |
| 48 | CPU_GraphApiEngineHeur_NONE | REMOVED (PR #5603, 2026-03-26) |
n/a (deleted) |
| 49 | Unit/CPU_GraphApiReduction_NONE | REMOVED (PR #5603, 2026-03-26)
| n/a (deleted) |
| 50 | CPU_GraphApiEngineCfg_NONE | REMOVED (PR #5603, 2026-03-26) | n/a
(deleted) |
| 51 | Unit/CPU_GraphApiMatMul_NONE | REMOVED (PR #5603, 2026-03-26) |
n/a (deleted) |
| 52 | CPU_BackendApi_NONE | REMOVED (PR #5603, 2026-03-26) | n/a
(deleted) |
| 53 | UnitIV/CPU_GraphApiOperationPointwiseOneInput | REMOVED (PR
#5603, 2026-03-26) | n/a (deleted) |
| 54 | Unit3IV/CPU_GraphApiOperationPointwiseThreeInput | REMOVED (PR
#5603, 2026-03-26) | n/a (deleted) |
| 55 | UnitVA/CPU_GraphApiOperationMatmul_NONE | REMOVED (PR #5603,
2026-03-26) | n/a (deleted) |
| 56 | UnitVA/CPU_GraphApiOperationRng_NONE | REMOVED (PR #5603,
2026-03-26) | n/a (deleted) |


| Outcome                            | Count |
|------------------------------------|------:|
| Passed, no leaks                   |    39 |
| Passed, leaks detected             |     0 |
| Failed                             |     0 |
| Crashed / timed out                |     0 |
| Removed — GraphAPI purge (#5603)   |    15 |
| Removed — FP64 LSTM purge (#5750)  |     1 |
| Duplicate (not re-run)             |     1 |
| **Total rows**                     |  **56** |

## Risk Assessment
Low

---------

Co-authored-by: JonathanLichtnerAMD <195780826+JonathanLichtnerAMD@users.noreply.github.com>
aledudek pushed a commit that referenced this pull request May 20, 2026
## Motivation

Fork PRs fail CI when `RUN_AITER_TESTS` or `RUN_FA_TESTS` is enabled.
The docker scripts run `git clone -b "$CK_*_BRANCH"
https://github.com/ROCm/rocm-libraries.git`, but a fork's branch doesn't
exist upstream:

```
fatal: Remote branch <fork-branch> not found in upstream origin
```

Example: [PR #6529 build
#4](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-6529/4/pipeline).

## Technical Details

**`Jenkinsfile`** — for PRs, use the upstream-visible PR ref instead of
the head branch name:

```groovy
CURRENT_BRANCH_NAME = env.CHANGE_ID
    ? "refs/pull/${env.CHANGE_ID}/head"
    : (env.CHANGE_BRANCH ? env.CHANGE_BRANCH : env.BRANCH_NAME)
```

**`Dockerfile.aiter` / `Dockerfile.fa`** — `git clone -b <ref>` only
accepts branches (`refs/heads/*`) and tags (`refs/tags/*`), so it can't
resolve `refs/pull/N/head`. Switch to `git fetch`, which accepts any
refspec (and still works for plain branch names):

```sh
mkdir rocm-libraries && cd rocm-libraries
git init -q
git remote add origin https://github.com/ROCm/rocm-libraries.git
git fetch --depth 1 --filter=blob:none origin "$CK_*_BRANCH"
git sparse-checkout init --cone
git sparse-checkout set projects/composablekernel
git checkout FETCH_HEAD
```

`git checkout FETCH_HEAD` lands in detached HEAD, which breaks the
existing `git branch -m "$CK_*_BRANCH"` (and that name isn't a valid
local branch anyway). Decouple the local branch name from the upstream
ref:

- Replace `git init` + `git branch -m` with `git init -b
"$LOCAL_BRANCH"` (requires git ≥ 2.28, satisfied by base images)
- `LOCAL_BRANCH="ck-import-${ROCM_LIBRARIES_SHA}"` in the rocm-libraries
path; `LOCAL_BRANCH="$CK_*_BRANCH"` in the fallback
- Downstream `git clone -b ... ../ck` uses `$LOCAL_BRANCH`

## Test Plan

Manually trigger a build on this PR with `RUN_AITER_TESTS=true` and
`RUN_FA_TESTS=true`; both docker images should build end-to-end.

## Test Result
[jenkins / rocm-libraries-folder/Composable Kernel / PR-6701 /
#3](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-6701/3/pipeline/)

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
aledudek pushed a commit that referenced this pull request May 20, 2026
# Add gfx950 MXFP4 Subtile-based kernel implementation
## Summary
This PR is a follow-up to #6499 ([hipblaslt] Add support for gfx950
mxfp4)
and adds the **Subtile-based kernel implementation
(`UseSubtileImpl=1`)**
for hipBLASLt on **gfx950**. It introduces a new tile-decomposed code
generation path optimized for **MXFP4** and **BF16** GEMMs, plus the
solution-selection plumbing, validation, Origami logic yamls, and unit
tests
needed to make it production-usable.
## Motivation
PR #6499 brought MX data type support online for gfx950, but the
existing
TensileLite codegen path leaves significant performance on the table for
MXFP4-heavy workloads. The Subtile path restructures global-read /
local-read / MFMA / store scheduling at a finer granularity, which
**greatly improves MXFP4 GEMM performance when using
`HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT`** (added to the
hipBLASLt CHANGELOG).
## What's included
### 1. New Subtile-based kernel components (Tensile)
New modules under `projects/hipblaslt/tensilelite/Tensile/Components/`:
* `SubtileBasedKernel.py` (~1850 LOC) — entry point and orchestration of
  the subtile codegen path; replaces large portions of the standard
  prefetch / unroll / store flow when `UseSubtileImpl=1`.
* `SubtileBasedLogicalScheduler.py` (~2415 LOC) — logical scheduler that
  builds the subtile-grained instruction graph (GR loads, LR offsets,
  MFMA tiles, scale loads, stores) from kernel parameters.
* `SubtileBasedInstructionScheduler.py` (~433 LOC) — converts the
logical
  schedule to an emit order respecting wave / register / hazard
  constraints.
* `SubtileBasedInstructionEmitter.py` (~216 LOC) — instruction emission
  helpers shared by the subtile components.
### 2. Kernel writer / common changes
* **`KernelWriter.py`**, **`KernelWriterAssembly.py`**: integration
points
  for the subtile path — prefetch, GR offset calculation, LR offset
  calculation, post-loop, MFMA macro accounting, optimized `storeD`,
  LDS buffer swap, MX FP4 scale emit, `SrdMXSA/B+2` handling, sgpr
  allocation / overflow guards, computeLoadSrd fix.
* **`SolutionStructs/Solution.py`**, **`SolutionStructs/Problem.py`**:
  introduces the `UseSubtileImpl` parameter, MX-related reject
  conditions for non-Subtile paths on gfx950, and additional valid GEMM
  type combinations for MX inputs.
* **`Common/ValidParameters.py`**, **`Common/RequiredParameters.py`**,
  **`Common/GlobalParameters.py`**: `UseSubtileImpl` registration and
  defaults.
* **`Components/StreamK.py`**: subtile-aware StreamK fixup (incl. import
  union with the `BufferLoadB32` cache-coherence change from #6837).
* **`Components/GlobalWriteBatch.py`**: optimized global write batching
  for the subtile path (~670 LOC of changes).
* **`Components/ComputeStoreVgprs.py`**, **`Components/LSU.py`**,
  **`Components/WorkGroupMappingAlgos.py`**, **`AsmStoreState.py`**,
  **`KernelWriterModules.py`**: minor adjustments needed by the subtile
  pipeline.
### 3. rocisa / host / client
* **`rocisa/rocisa/include/container.hpp`**: helpers needed by the new
  emitter.
* **`tensile_host.cpp`**, **`include/Tensile/TensorDescriptor.hpp`**:
  small fixups for the subtile path and gfx950 build.
* **`client/include/DataInitialization.hpp`**,
**`client/src/DataInitialization.cpp`**,
**`client/src/Reference.cpp`**, **`client/src/ReferenceValidator.cpp`**,
  **`client/include/TypedId.hpp`**: MX scale init and reference paths
  used by the new tests.
* **`clients/common/include/testing_matmul.hpp`**,
  **`clients/common/include/norm.hpp`**,
  **`clients/common/include/hipblaslt_datatype2string.hpp`**,
  **`clients/common/src/mxDataGen.cpp`**: wiring for batched (>1)
  testing and MX init.
### 4. Origami / solution selection (gfx950 MXFP4)
New auto-tuned logic yamls under

`projects/hipblaslt/library/.../Tensile/Logic/asm_full/gfx950/gfx950/Origami/`
covering the FP4 SS / HS / BS variants in three layouts:
* `Origami/` (default)
* `Origami/Origami_nta4/` (no-transpose-A FP4)
* `Origami/Origami_ntb4/` (no-transpose-B FP4)
(9 new `gfx950_Cijk_Alik_Bljk_F4{SS,HS,BS}_MXA32_MXB32_*_UserArgs.yaml`
files in total.)
### 5. New tests
**End-to-end gfx950 GEMM yamls** in
`Tensile/Tests/common/gemm/gfx950/`:
* `subtile_bf16.yaml`, `subtile_mxfp4.yaml`
* `mx32f4_tn.yaml`, `mx32f8_tn.yaml`
* `mxfp4_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
* `mxfp4_fp8_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
* `fp8_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
**StreamK + MX:** `Tensile/Tests/common/streamk/sk_mx32f4_quick.yaml`,
`sk_mx32f8_quick.yaml`.
**New unit tests** (`Tensile/Tests/unit/`):
* `test_SubtileBasedLogicalScheduler.py` (~1735 LOC)
* `test_SubtileBasedSchedulerRef.py` (~596 LOC)
* `test_gr_lr_roundtrip.py` (~571 LOC)
* `test_storeD_roundtrip.py` (~2420 LOC)
* `test_graTileAssignment.py` (~354 LOC)
* `test_lraTileAssignment.py` (~360 LOC)
* `conftest.py`, `gpu_test_helpers.py` shared fixtures (~601 LOC)
**New gtest:** `tensilelite/tests/MXScalePadding_test.cpp`.
### 6. Misc / hardening
* Reject conditions: gfx950 MX + non-Subtile, DepthU constraints,
GroupGEMM
not yet supported with StreamK + MX, AssertSummationElementMultiple=256
  for subtile MXFP4, missing-mxblock check for non-MX types.
* Skip rocRoller for FP4-A/FP4-B with pre-swizzled scale layout (#42).
* `forceDenorm=False` in `generateMXInput` (#11).
* Several rebase fixes, copyright/year header updates, and
review-comment
  fixes to `KernelWriter` / `KernelWriterAssembly`.
### 7. CHANGELOG
Greatly improved MXFP4 GEMM performance when using
HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT

## How to use
Set `UseSubtileImpl: 1` on a gfx950 MX-FP4 solution (see the new
`subtile_mxfp4.yaml` / `mx32f4_tn.yaml` for canonical configs). The path
is
opt-in — non-MX and non-gfx950 kernels are unaffected.
## Backwards compatibility / risk
* All new behavior is gated on `UseSubtileImpl=1` and gfx950. Existing
  solutions on other architectures or non-MX paths are unchanged.
* `GroupGEMM + StreamK + MX` is intentionally rejected for now (TODO).
* New Origami yamls only add solutions; nothing existing is modified.
## Test plan
* New gtests + unit tests run automatically in CI (Tensilelite Python
  unit suite, `MXDataGen_test`, `MXScalePadding_test`).
* New end-to-end gfx950 GEMM and StreamK yamls are added to the common
  test buckets.
* Manual: run the gfx950 MXFP4 subtile suites
  (`pytest -k gfx950` after building Tensile, plus
  `tensilelite-client --yaml subtile_mxfp4.yaml` for sanity).
## Notes for reviewers
* This branch was rebased onto current `develop` (post-#6499) by
skipping
  the `users/nakajee/gfx950_mx_rebase_merge` history (which #6499
squash-merged) and replaying only the subtile-specific work as a single
  squashed commit. The actual code changes in this PR are limited to the
  files listed above (24 added, 56 modified; ~+170k / −2.6k including
  generated logic yamls).
* The largest reviewable diffs are:
*
`Tensile/Components/SubtileBased{Kernel,LogicalScheduler,InstructionScheduler,InstructionEmitter}.py`
(new files)
  * `Tensile/KernelWriter.py`, `Tensile/KernelWriterAssembly.py`
  * `Tensile/SolutionStructs/{Problem,Solution}.py`
  * `Tensile/Components/{GlobalWriteBatch,StreamK}.py`
  * `clients/common/include/testing_matmul.hpp`
  * `client/src/DataInitialization.cpp`

* Description of all commits that were squashed for this feature branch:

Subtile implementation for gfx950 MX FP4

--- 272f88d: Add sample subtile impl ---
Author: brianshi <brianshi@amd.com>

--- 60ecede: GR Offset calculation (#1) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- be69c1d: Enable post-loop code generation, and add some
subroutines ---
Author: b-shi <brianshi@amd.com>

--- 646d102: LR offset calculation (#2) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 71f4bca: Add GR load emit logic, and misc fixes (#3) ---
Author: b-shi <brianshi@amd.com>

--- 1fd0db9: Emit LR + init ACCVGPR (#4) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 9d406b9: Add loop and ptr update code ---
Author: b-shi <brianshi@amd.com>

--- b6127bc: Update GR/LR offset calculation to fully support 2x2,
1x4, 4x1 waveConfigs (#7) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 89ec87c: Account for valuC macro value in SK WS store code ---
Author: b-shi <brianshi@amd.com>

--- 6edf53d: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 34e79fc: Enable fp4 (#8) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- d5a5c57: [Tensilelite] Add MX FP4 scale offset computation for
subtile-based kernel (#6) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- 7a8a85a: Add lds buffer swap logic ---
Author: b-shi <brianshi@amd.com>

--- d24a8fe: Add optimized storeD code (#9) ---
Author: b-shi <brianshi@amd.com>

--- a45c20c: Fix MX scale tensor initialization: set
forceDenorm=false in generateMXInput (#11) ---
Author: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>

--- f945268: [Tensilelite] Enable the MX FP4 scale emit code in the
subtile-based kernel (#10) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- cf37df4: Use fixed value for SrdMXSA/B+2 (#14) ---
Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com>

--- f0c8dbc: Merge subtile_mx_f4_schedule to subtile_mx branch (#16)
---
Author: b-shi <brianshi@amd.com>

--- 543796f: Enable DU > 256, and reduce sgpr allocation (#18) ---
Author: b-shi <brianshi@amd.com>

--- c65bdb0: Add missing mxblock check for non-mx data types ---
Author: b-shi <brianshi@amd.com>

--- d64d226: Introduce UseSubtileImpl parameter (#20) ---
Author: b-shi <brianshi@amd.com>

Squash commits 20-35 from subtile_mx branch

--- e4780da: Enable FixSrd2 for A/B (#23) ---
Author: b-shi <brianshi@amd.com>

* Enable FixSrd2 for A/B

* Address comments from PR

---------

--- e4c64a7: Add nt libs ---
Author: b-shi <brianshi@amd.com>

--- cd13ec1: [Tensilelite] Pad MX scale tensor dimensions for
unaligned problem sizes (#21) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

* Add scale padding

* Add tests

* Remove redundant pre-swizzle path

* Remove code from
conflict

* Fix reverted mxdatagen path for tensile tests

* Add diverse test cases for scale padding in MXScalePadding_test and
subtile.yaml
- Expanded test cases to include non-multiple-of-32, even
non-multiple-of-16, and odd dimensions.

--- d87938f: Split subtile.yaml into subtile_bf16.yaml and
subtile_mxfp4.yaml (#22) ---
Author: James Newling <james.newling@gmail.com>

Replace the 'monolithic' subtile.yaml with two focused test files.
All original test coverage is preserved. Two new FP4 groups added.

BF16 coverage (subtile_bf16.yaml, tests are essentially unchanged):

  # | Description        | Dest | MIs | PGR | DU      | SK  | Sizes
  --+--------------------+------+-----+-----+---------+-----+------
  0 | BF16 TN main       | b    |  19 |   0 | 64      | 0,3 |  11
  1 | BF16 TN large DU   | b    |   4 |   0 | 128,192 | 0,3 |   7
  2 | BSS (f32 output)   | s    |   6 |   0 | 64      | 0,3 |   9
  3 | BF16 bias          | b    |   2 |   0 | 64      | 0   |   1

FP4 coverage (subtile_mxfp4.yaml):

  # | Description        | Dest | MIs | PGR | DU  | SK  | Sizes | Status
--+--------------------+------+-----+-----+-----+-----+-------+--------
0 | FP4 TN main | b | 15 | 0 | 256 | 0,3 | 23 | from original
1 | FP4 TN large DU | b | 4 | 0 | 512 | 0,3 | 13 | from original
2 | F4SS (f32 output) | s | 5 | 0 | 256 | 0,3 | 13 | from original
3 | FP4 bias | b | 2 | 0 | 256 | 0 | 1 | from original
  4 | FP4 PGR=2          | b    |  13 |   2 | 256 | 0   |   5   | new
  5 | FP4 expanded MIWT  | b    |  24 |   0 | 256 | 0   |   5   | new
6 | PGR=2 WG 4x1/1x4 | | 6 | 2 | 256 | 0 | 1 | known failures
(commented)

Run times on gfx950 (8x MI350X):

  File               | NEV=-1 | NEV=0
  -------------------+--------+------
  subtile_bf16.yaml  |    23s |   23s
  subtile_mxfp4.yaml |    37s |   40s

Where NEV is number of elements to validate. I (James) have checked
these numbers,
and weirdly it is true that NEV=0 is a bit faster than NEV=-1 for mxfp4.

--- af04f0d: Dependency based instruction scheduling (#19) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Revert to single partition

* Start using dependencies

* as is

* start using separate EmittedModules

* remove reduntant wait

* Add _extractPathsFromBeforeDeps

* Continue simplification

* Simplifying

* Add more rules

* cleanup

* Add fp4 test

* fix test

* Add tests

* Remove after field on emittedmodule

* Refactoring instructionSchedule

* Add comments

* cleanup modules vs ops

* Refactoring print functions

* Test cleanup

* Add more tests

* Replace subgroup by partition

* Remove unused unroll param

* Add high level notes

* Simplify NLL and NGLL GR removal

* Add some comments

* Force instruction insertion if no slots available

* Fix test after rebase

* Move scale before A/B and track inflight count

* Fine-grain vmcnt calculation

* Separate counts for scaleA and B

* Avoid using m0 update and buffer_lod on same MFMA slot to avoid scalar
instruction serialization

* Fix test

* Add vmcnt test

* Fix duplicated loads for 1x4 and 4x1

* Fix placement in reverse order

* Fix regression on PGR0

* add fallback to numMFMA=1

--- 3ec902b: Add some 1x4 and 4x1 origami solutions ---
Author: b-shi <brianshi@amd.com>

--- c5000d3: Fix typo ---
Author: b-shi <brianshi@amd.com>

--- 226ed84: [hipblaslt] Refactor Srd2 calculation for useFixedSrd2
(#30) ---
Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com>

--- abf19d4: [Tensilelite] UseSubtileImpl: subtile-aligned edge check
for store path (#29) ---
Author: b-shi <brianshi@amd.com>

* [Tensilelite] UseSubtileImpl: subtile-aligned edge check, OOB guard,
and refactoring

- Replace Size%MT edge check with subtile-aligned check: NonEdge paired
  store when trailing rows/cols are a multiple of the subtile block size
(waveGroupM rows for M, 16 cols for N). Non-last workgroups always take
NonEdge.
- Add per-wave OOB guard (subtileM32ValidBlocksSgpr /
subtileN16ValidBlocksSgpr)
  to skip stores outside valid M/N tile bounds in the NonEdge path.
- Refactor duplicated OOB guard into _emitSubtileOobGuard helper;
refactor
M/N guard SGPR computation into _emitSubtileMGuard / _emitSubtileNGuard.
- Fix orphan scalar store blockIdxM (was tt0, now
(tt0*MatrixInstM)//mBlockSize).
- Add quick-exit and edge/non-edge header comments to generated ASM.

* Add some bias tests, combine M/N guard to single routine

* Add OOB check for C loads, update storeD unit tests to check OOB,
simplify quick exit checks

* Address more PR comments: add M group skip, and skip to store end.
simplified loadC OOB mask

---------

--- 637881a: Fix unit tests & remove legacy code for subtile
interleaving (#33) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Fix gr_lr_roundtrip test

* Use non-interleaved version as ref code

* Fix scheduler test

* Removed legacy interleaved mode for LR/GR offset calculation

--- e9cb889: Fix MX FP4 scale buffer allocation and initialization
for batched GEMM (#25) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

* Fix bacth count issue

* Add batch count tests

* Fix bacth count issue

* Address PR review: clarify FP4-specific byte stride and add
non-aligned batched tests

- Updated comments on dataBatchBytes computation to clarify FP4 packing
  assumption (2 elements/byte) and flag that non-FP4 block-scaling types
  would require updating this conversion.
- Added batched test cases with non-multiple-of-32 M/N dimensions:
  FP4 DU=256: [48,48,2] and [33,65,2]
  FP4 DU=512: [63,63,2]
  BF16: [50,100,2]

---------

--- a43247b: Update some test yamls (#31) ---
Author: b-shi <brianshi@amd.com>

--- e2f69c8: Add f4bs origami library with activation function
support. Refactor sgpr allocation to reduce sgpr usage in post loop.
Store code-path reorganization (#32) ---
Author: b-shi <brianshi@amd.com>

* Free swap/localwritebase sgprs before post-loop

* Defer sgpr allocation to remove holds in sgpr pool.

Add Origami library logic files for Cijk_Alik_Bljk_F4BS_MXA32_MXB32
(base, nta4, ntb4 variants).

* Remove uneeded alignment and comment

* Add more epilogue tests

* Remove older origami library for f4bs

* Reorder post-loop code blocks to after persistant loop Misc fixes

* Fix build issues, relax longjump sgpr requirements

* Fix GSU0 branch logic

---------

--- 3f034bf: Add F4HS and F4SS Origami library logic for FP4→F16 and
FP4→F32 GEMM (#35) ---
Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com>

* Add F4HS and F4SS Origami library logic for FP4→F16 and FP4→F32 GEMM

- Add 6 new yaml files (F4HS, F4SS) across Origami, Origami_nta4,
Origami_ntb4
- Update F4BS yaml files: AssertSummationElementMultiple 32→256 for
K%256 enforcement
- Add ("F4", "F4", "H", "S") to _validGEMMTypes and _HPATypes in
Problem.py

* Add F4HS test cases to subtile_mxfp4.yaml

Add two new benchmark problem blocks for FP4→F16 (F4HS):
- No-bias block: same wavetile and problem size coverage as F4SS
- Bias epilogue block: BiasDataTypeList [s, h], relu/none activations

* Add F4HS (FP4->Half) type support to Tensile client

Add TypedGemm_F4_H_S typedef and corresponding reference CPU solver
case so F4HS (FP4 input, Float16 output, Float compute) problems
can be validated by the benchmark client.

---------

--- d0bc8fd: Rewrite subtile-based scheduler. Fix DU>64 & enable very
large MT (#36) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Initial support for DU>256

* Renaming

* add option to do DU=512 in the tests

* blocked K-major for scale

* Change scaleSet swap logic

* Update print functions

* Put scales after values for avoid race conditions

* Fix tests

* more test

* tweak printschedule display

* Add PGR2 in the yaml tests

* Add new scaleGROp

* comment out failing tests

* Revert "comment out failing tests"

This reverts commit 1f5802c.

* Draft new logical scheduler

* Refactoring

* Add more test on step1

* Add more tests on step1

* add bf16 320x320 test

* reduce step1 code

* Simplify step1 logic

* validate some step1 test

* Fix partition 2x2 test

* more step1 test

* 320x320 BF16 test

* Add test DU512 + partition2x2

* Simplify step1 code

* Add step2 tests

* Fix multi-partition step2

* Add step2 du512, 2x2 partition test

* Use common algo for all numPartitions

* Draft for step3 tests

* remove useless tests

* New GR algo (draft)

* [Step3] Add more test

* Iteration on GR

* Display ordered GR list with granularities

* More test

* Add some comments

* Disable by default debug logs

* Getting rid of step naming

* Start remove AnnotatedOp (still there in group pass)

* Split dependency Ops

* Add todo on place_GRs pass

* Valid test_annotate_deps_1x1_partition_DU256

* Test output looking better (still WIP)

* single dep for LR tooo

* Add remove_cross_deps pass

* Fix bugs in dependency pass

* insert_gr_lr_inc pass

* Add group_lr_gr pass

* Add emit pass

* Quick port of instruction Emit code

* Move emit function to separate file

* Refactoring instructionEmitter

* Port vgprTile tracking

* Reworking second pass (WIP)

* Display unrolling requirement

* Unrolling check on 2nd pass

* Generic validation for assign_vgpr pass

* Fix unroll

* Add inst schedule in standalone mode

* Use lrGran for vgprTile size calculation

* Fix bug in emit pass (missing depencency)

* PreMFMA path + non-duplication scale load

* missing globalReadLDSBufferSwap for GR_INC scales

* add wairlr_sync on all LR->GR dep

* add waitgr_sync op

* remove_unnecessary_gr_deps

* Change LR dispatch algo a bit to avoid too many waitgr_sync

* Avoid duplicated loads in emitter

* Fix bug on gr_emit code

* GrInc pass. fix duplicated insertion for B

* Fix missing LR_inc for SA/SB

* preloop, NLL, NGLL

* Simplify preloop

* minor changes

* Move unroll logic to scheduler

* minor changes

* Fix unroll id bug on NLL / NGLL

* Disable post GRINC for now

* Remove commented code

* Handle 1x4, 4x1 gr read gran

* Fix vmcnt computation

* Use correct grCount mapping

* Revert in emit logic on buffer_load for PGR0 needs

* Add bf16 version in standalone test

* Fix LR_Inc insertion on DU>64

* Add subIterK/Partition comment to codegen

* Fix issue in GrInc placement

* Remove last_mt

* Fix LR MT index bug with muli-partition

* Disable early LDS size check when subtileImpl is on

* Add pass to remove redundant LR deps + fixed issue on dependency
annotation pass

* Remove more LR redundant deps

* Only insert wait_lr_sync on deps

* Simple algo to select partition config

* Remove HC value for partitions...

* Take into account all inflight GR (all tensors)

* Fix tests and regressions on gr counts

* Fix grCount merge calculation

* Better display of dependencies

* Add remove_wait_lr_sync after grouping

* Add temporary non reg file

* Change merge logic on GR grouping pass

* Fix non necessary wait_lr_sync

* Downgrade some waitlr_sync to sync + added 384x256 no reg test

* non reg test 320x320

* Add larger MT

* non reg test for fp4 256x256

* Moving out instructionScheduler

* Remove old scheduler

* Renaming scheduler

* Re-work test

* Add larger MT test cases

* Rename non-ref test

* Re-add standalone mode

* Refactor DepOp

* Remove dead code

* Remove MFMATileSize class

* Remove from_til_info

* Avoid redundant tensor list creation

* Remove hardcode granularities in vgrpTile allocation pass. Simplify
code.

* Re-enable  # PGR=2 WG 4x1/1x4, K > DU tests

* Remove unused GRScaleOp

* DepRef renaming

* Get rid of MT string representation

* Remove TODO

* EmmitedModule simplication

* Use explicit pass dependencies

* Renaming LogicalScheduler

* Remove old test_InterleavingScheduler.py file

* Commenting failing test for now

* Remove debug logs

* Disable lds padding when using UseSubtileImpl

--- e8e8c09: Fix LR-GR dependency issue when DU>64 (#40) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Fix and simplify logic for remove_unnecessary_lr_deps

* Add new ref tests for 128x128x(128,64)

--- 4aa441a: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 5ba911e: Skip rocRoller for FP4-A/FP4-B + pre-swizzled scale
layout (#42) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- 9c74998: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 842b149: Addressed review comments for KernelWriter and
KernelWriterAssembly ---
Author: Koji Nakajima <knakajim@amd.com>

--- dce43b1: Fix computeLoadSrd issue ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- c075bbf: Fix preSolution CPU re-sync regressing
subtile_mxfp4.yaml ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- ced840f: Fix computeLoadSrd issue (#43) ---
Author: bnemanich <brad.nemanich@amd.com>

--- bc2f6dd: Small update for gfx950 mx tests + more - enable
UseSubtileImpl for all gfx950 non subtile mx tests - skip all gfx950
mxfp8 - use MXScaleFormat=1 as default - set
AssertSummationElementMultiple=256 for subtile mxfp4 - fix
isSwizzledSubtile in computeLoadSrd ---
Author: Koji Nakajima <knakajim@amd.com>

--- 5c794b7: Fix gsuasb.yaml failures ---
Author: b-shi <brianshi@amd.com>

--- 727f8db: tensilelite: add solution reject conditions for
UseSubtileImpl=1 (#38) ---
Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com>

--- 8928fbb: Add more reject conditions for Subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- 6e63ab6: Fix kringshift test failures ---
Author: b-shi <brianshi@amd.com>

--- b3e9724: Update reject condtion for DepthU in subtile case. Plus,
update DepthU setting for gfx950 mx test cases ---
Author: Koji Nakajima <knakajim@amd.com>

--- 5ab6009: Fix build errors ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 4a4edca: Update more mxfp4 tensilelite test cases ---
Author: Koji Nakajima <knakajim@amd.com>

--- bbbc553: Update change log ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 6476c04: Add more reject conditions for gfx950 subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- c5828c4: Updated gfx950 mxfp4 test cases - add StreamK setting -
skip groupgemm tests for now (groupgemm does not support streamK) ---
Author: Koji Nakajima <knakajim@amd.com>

--- f1fc2f1: Fix hipblaslt build error of gfx950 ---
Author: Koji Nakajima <knakajim@amd.com>

--- 70cea1b: Updated subtile_mxfp4.yaml (add StreamK) ---
Author: Koji Nakajima <knakajim@amd.com>

--- c1c9b2a: Add uninit lsc,lsp, etc.. fields for subtile ---
Author: b-shi <brianshi@amd.com>

--- c0c1f72: Fixed merge error in testing_matmul.hpp ---
Author: Koji Nakajima <knakajim@amd.com>

--- 191e0cb: Add missed batch_count >1 changes ---
Author: archana-ramalingam <Archana.Ramalingam@amd.com>

--- 01c52f8: Addressed PR comments ---
Author: Koji Nakajima <knakajim@amd.com>

--- 4e89c91: Reduce mxfp4 test time ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 3dac20f: Prevent overflow for wgmxcc sgpr allocation ---
Author: b-shi <brianshi@amd.com>

--- 18dec79: Fix error with problem type ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 9e69ffd: Add a reject conditoin for gfx950 mx + non Subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- 0eed3ba: Add more valid GEMM types ---
Author: Brad Nemanich <brad.nemanich@amd.com>

--- 8b5514e: Fix missing b build error ---
Author: archana-ramalingam <Archana.Ramalingam@amd.com>

--- f981ff5: Fix 1250 tests ---
Author: Brad Nemanich <brad.nemanich@amd.com>

--- d1e69d9: Add more FP4 tests ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- e3a688f: Add MXScaleFormat: 1 to all gfx950 mx test yaml ---
Author: Koji Nakajima <knakajim@amd.com>

--- aaef3f5: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml
---
Author: Koji Nakajima <knakajim@amd.com>

--- 861ef8e: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml
(nta4,ntb4) ---
Author: Koji Nakajima <knakajim@amd.com>

Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com>
Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>
Co-authored-by: Brian Shi <Brian.Shi@amd.com>
Co-authored-by: James Newling <James.Newling@amd.com>
Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com>
Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com>
Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com>
Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com>
Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>
Co-authored-by: Brian Shi <Brian.Shi@amd.com>
Co-authored-by: James Newling <James.Newling@amd.com>
Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com>
Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com>
Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com>
Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>
aledudek pushed a commit that referenced this pull request May 20, 2026
…7008)

## Motivation

Several memory leaks were detected in MIOpen gtests using ASan. Some of
the tests were blacklisted and others were not. This change looks to fix
all of the low hanging fruit, which are the majority of the leaks found.
This includes all of the critical leaks (>100MB) that were reported.
Some other leaks were identified as needing a larger refactor to
resolve.

After fixing the w_supertensor.cpp leak, the supertensor tests hit
virtual memory area limit errors. These changes are to enable running
them with ASan, but only with a subset of the tests in order to not hit
limits. The test coverage being lost is fairly negligible and the full
tests are still run without ASan enabled.

## Technical Details

Here is a summary of the files looked at and the changes made:

| File | Status | Notes |
|------|--------|-------|
| `test/gtest/gpu_mha_forward.cpp` | Fixed | Added
`miopenDestroySolution()` loop after using solutions. Fixes MHA forward
solution descriptor leaks (report #11, #13). |
| `test/gtest/gpu_mha_backward.cpp` | Fixed | Added
`miopenDestroySolution()` loop after using solutions. Fixes MHA backward
solution descriptor leaks (report #9, #12). |
| `test/gtest/mha_find20.cpp` | Fixed | Added `miopenDestroySolution()`
loop in both `MhaForward` and `MhaBackward` tests. Fixes MHA Find2.0
solution leak (report #16). |
| `test/gtest/gtest_desc_guard.hpp` | Fixed | New shared header
introduced by the refactor. Provides a single `DescGuard<DescType,
CreateFn, DestroyFn>` template (with `TensorDescGuard`, `ConvDescGuard`,
`DropoutDescGuard`, `RNNDescGuard` aliases), a `HandleGuard` RAII
wrapper for `miopenHandle_t`, and the
`DestroyInternalRnnDropoutDesc(rnnDesc)` helper used by every
RNN/LSTM/GRU test to free the internal `DropoutDescriptor` that
`miopenCreateRNNDescriptor` allocates and `miopenSetRNNDescriptor*` then
leaks. Replaces the per-file ad-hoc guard structs from the initial
implementation. |
| `test/gtest/w_supertensor.cpp` | Fixed | Switched raw descriptors to
the shared `RNNDescGuard` / `TensorDescGuard` from
`gtest_desc_guard.hpp`. Added a class-local `DestroyDropoutDesc()`
(called from `TearDown` and before `miopenSetRNNDescriptor`) to prevent
the `miopenSetRNNDescriptor` overwrite leak. Reduced test parameter
space under ASan to avoid OOM. Removed unused `seqLen` parameter and
dead `param_dev_out`/`bias_dev_out` allocations. |
| `test/gtest/lstm.hpp` | Fixed | Switched `rnnDesc` → `RNNDescGuard`,
`DropoutDesc` → `DropoutDescGuard`, and `mio_handle` → `HandleGuard`
(which now owns the `miopenDestroy` call), all from the shared
`gtest_desc_guard.hpp`. Hoisted `dropout_state_buf` so it can be
`hipFree`d at the end of the dropout path. Added
`DestroyInternalRnnDropoutDesc(rnnDesc)` before
`miopenSetRNNDescriptor*` and (in the non-dropout path only) at end of
`Run`, which frees the internal `DropoutDescriptor` that the Set call
would otherwise leak. Fixes ~615 MB dropout leaks and ~16.5 KB
non-dropout descriptor leaks (report #3, #4, #6, #17-#22). |
| `test/gtest/gru_test.cpp` | Fixed | In the GRU test class: switched
`rnnDesc` → `RNNDescGuard`, `DropoutDesc` → `DropoutDescGuard`,
`mio_handle` → `HandleGuard`, and added
`DestroyInternalRnnDropoutDesc(rnnDesc)` before
`miopenSetRNNDescriptor*` and (non-dropout path only) at end of `Run`
plus `hipFree(dropout_state_buf)` for the dropout path. In the in-file
`GRUFwdCPUVerify` / `GRUBwdDataCPUVerify` helpers: converted the raw
`dropout_inputTensor` / `dropout_outputTensor` declarations to
`TensorDescGuard` (mirroring the `cpu_rnn.hpp` change for the LSTM/RNN
helpers). |
| `test/gtest/softmax_find20.cpp` | Fixed | Changed `Finalize()` to take
the `std::vector<miopenSolution_t>&` and destroy each solution via
`miopenDestroySolution()` before destroying the problem. Updated all 6
`TEST(...)` callers to pass the solutions vector. Fixes Find2.0 softmax
solution/kernel leaks (report #25-#27). |
| `test/gtest/rnn_seq_api.hpp` | Fixed | Hoisted `dropout_state_buf` so
the dropout path can `hipFree` it at the end. Added
`DestroyInternalRnnDropoutDesc(&rnnDesc)` before
`miopenSetRNNDescriptor*` and (non-dropout path only) at end of the run
to free the internal `DropoutDescriptor` allocations leaked by
`RNNDescriptor` copy-assignment. Same pattern as the LSTM/GRU fixes
(report #29-#30). |
| `test/cpu_rnn.hpp` | Fixed | Converted 6 raw `miopenTensorDescriptor_t
dropout_input/outputTensor` declarations across the LSTM/RNN CPU
verification helpers (`LSTMFwdCPUVerify`, `LSTMBwdDataCPUVerify`,
`RNNFwdTrainCPUVerify`, `RNNBwdDataCPUVerify`, `GRUFwdCPUVerify`,
`GRUBwdDataCPUVerify`) to the shared `TensorDescGuard` from
`gtest_desc_guard.hpp`. Removed the redundant
`miopenCreateTensorDescriptor` calls and updated 12 `miopen::deref(...)`
sites to `.get()`. (Note: the GRU helpers in this header are stale
duplicates; the live ones are inside `test/gtest/gru_test.cpp` and were
updated there too.) Fixes the LSTM/GRU CPU-verify tensor descriptor
leaks (report #1, #2, #14, #15, #32, #33). |
| `test/gtest/rnn_vanilla_common.hpp` | Fixed | Added
`DestroyInternalRnnDropoutDesc(rnnDesc)` calls before
`miopenSetRNNDescriptor*` and (non-dropout path only) at end of the run.
The `RNNDescGuard` / `DropoutDescGuard` usage was already in place from
an earlier commit and now resolves to the shared definitions in
`gtest_desc_guard.hpp`. Same pattern as LSTM/GRU/rnn_seq_api fixes
(report #14, #15, #32, #33). |
| `test/gtest/graphapi_gtest_common.hpp` | Skipped | File no longer
exists in the codebase. The GraphApi test infrastructure has been
removed; only the leak report (a stale snapshot) still references it. No
fix possible against the current source tree. |
| `test/gtest/graphapi_execution_plan.cpp` | Skipped | File no longer
exists in the codebase (GraphApi removed). The leak it represented was
largely an external hipblaslt bug anyway; the remaining test-side
portion is not fixable against the current source. |
| `test/gtest/na_train.cpp` / `na_inference.cpp` / `na_*_find2.cpp` |
Skipped | Leaks come from the internal MIOpen solver/kernel cache living
on the global singleton handle, which is never destroyed. Not easily
fixable without redesigning global handle lifecycle. Should be
suppressed in the ASan suppression file. |
| hipblaslt / rocblaslt (external) | Skipped |
`SolutionCache::addKernel` and `preloadCustomKernels` leak via
`_rocblaslt_handle` constructor. Called from
`miopen::Handle::CreateHipblasLtHandle`. This is an upstream bug in
hipblaslt/rocblaslt, not fixable in MIOpen. Affects suites that
initialize a handle (report Category 2, Category 7). |
| CLR / HIP runtime (external) | Skipped | `amd::Context` and
`amd::roc::Device` global initialization leaks from
`rocclr/platform/context.cpp`. HIP runtime internals, not fixable in
MIOpen. |
| `src/hipoc/hipoc_program.cpp` | Skipped | `HIPOCProgramImpl` objects
leak during kernel compilation/caching (line ~178). This is an internal
MIOpen kernel cache lifecycle issue that requires deeper architectural
changes to fix. Contributes small amounts to MHA and Softmax Find2.0
leaks. |
| `test/gtest/conv_api.cpp` | Fixed | Already clean against current
source — `miopenDestroyConvolutionDescriptor(conv_desc)` call exists
(line 24) inside the test loop. ASAN run reports no leaks. The leak
report was based on a stale snapshot. The hipblaslt handle init portion
is tracked under the external-skipped row. |
| `test/gtest/log_test.cpp` (CPU_LOG_TEST_FUSION / CPU_LOG_TEST_NEG) |
Fixed | Already clean against current source — `Tensor`, `Conv`,
`CreateCBAFusionPlan`, `CreateBNormFusionPlan` all have proper
destructors that call the corresponding `miopenDestroy*` APIs in
`log.cpp`. ASAN run on `CPU_LOG_TEST_*` (11 tests across log_test.cpp +
log_test_neg.cpp) reports no leaks. The hipblaslt handle init portion is
tracked separately under the external-skipped row. |
| `test/gtest/fusion_test.cpp` (CPU_FusionCreateOpConvForward) | Fixed |
File renamed from `fusion.cpp` to `fusion_test.cpp`. Already clean
against current source — uses `TensorDescGuard`/`ConvDescGuard` for
tensor/conv descriptors and calls
`miopenDestroyFusionPlan(fusionPlanDesc)` on the fusion plan (line 195).
ASAN run on `CPU_FusionCreateOpConvForward_FP32.*` reports no leaks. |
| `test/gtest/deterministic_conv_api.cpp` | Fixed | Already clean
against current source — uses `ConvDescGuard` (line 66) for the conv
descriptor. ASAN run on `*CPU_DeterministicConvApi*` reports no leaks. |
| `test/gtest/fusion_aux.cpp` (GPU_FusionAux) | Fixed | Already clean
against current source — uses `ConvDescGuard` plus stack-allocated
internal C++ objects (`miopen::TensorDescriptor`,
`miopen::FusionPlanDescriptor`) which have proper destructors. The
`convoOp` handle is owned by the fusion plan. ASAN run on
`*GPU_FusionAux*` reports no leaks. |
| `test/gtest/backend_api.cpp` (CPU_BackendApi) | Skipped | File no
longer exists in the codebase. The backend API test infrastructure (part
of the removed GraphApi suite) was removed; no fix possible against the
current source tree. |


### High-level notes

New shared infrastructure (test/gtest/gtest_desc_guard.hpp)
- DescGuard<DescType, CreateFn, DestroyFn> — a single RAII template
parameterized on the descriptor type and its create/destroy entry
points. Aliases provide TensorDescGuard, ConvDescGuard,
DropoutDescGuard, and RNNDescGuard, replacing the four near-identical
guard structs
  that were copy-pasted across test files in the initial implementation.
- HandleGuard — separate RAII wrapper for miopenHandle_t (couldn't reuse
the template because miopenCreateWithStream takes an extra hipStream_t
argument). Supports lazy create(stream) so callers that only need a
handle in the dropout branch can default-construct one and
  populate it conditionally.
- DestroyInternalRnnDropoutDesc(rnnDesc) — frees the internal
DropoutDescriptor that miopenCreateRNNDescriptor allocates and that
miopenSetRNNDescriptor* then orphans. Replaces the equivalent inline
blocks that LSTM/GRU/RNN tests were each carrying. The header documents
the
two call-sites: before each Set* (always safe) and at end-of-run only on
the non-dropout path (the dropout path aliases the user-owned
descriptor, so freeing would double-free).

Recurring patterns enabled by the refactor
- The "leak from Set* overwriting the default-constructed internal
dropout descriptor" fix collapsed from per-file code to a one-line
helper call, applied uniformly across lstm.hpp, gru_test.cpp,
rnn_seq_api.hpp, and rnn_vanilla_common.hpp.
- mio_handle ownership in LSTM/GRU is now expressed via HandleGuard
rather than a manual miopenDestroy at the end of the dropout branch —
eliminates a class of forgotten-cleanup bugs.
- dropout_state_buf is consistently hoisted out of the dropout if block
so an end-of-run hipFree can release it; deletion of the buffer pairs
visibly with its allocation.

Notable non-RNN change
- softmax_find20.cpp was the only Find2.0 leak fix in this commit:
Finalize() now takes the solutions vector and calls
miopenDestroySolution() for each before destroying the problem. Same
shape applied to all 6 tests in the file.


## Test Plan

Run the tests beforehand to observe the ASan leak errors and then again
afterward to verify the fixes have resolved the problem.

## Test Result

List from ROCM-21512:

| # | Test Name | Status | Leak Status |

|---:|------------------------------------------------------|------------------------------------------------|-----------------|
| 1 | Smoke/GPU_RNNVanillaDropout_FP32 | PASSED (4 tests) | No leaks |
| 2 | Smoke/GPU_RNNVanillaDropout_FP16 | PASSED (4 tests) | No leaks |
| 3 | Full/GPU_LSTM_dropout_FP32 | PASSED (4 tests) | No leaks |
| 4 | Full/GPU_LSTM_dropout_FP16 | PASSED (4 tests) | No leaks |
| 5 | CPU_GraphApiExecutionPlanBuilder_NONE | REMOVED (PR #5603,
2026-03-26) | n/a (deleted) |
| 6 | Full/GPU_LSTM_dropout_FP64 | REMOVED (PR #5750, 2026-03-26) | n/a
(deleted) |
| 7 | Unit/CPU_GraphApiPointwise_NONE | REMOVED (PR #5603, 2026-03-26) |
n/a (deleted) |
| 8 | Full/GPU_LstmMSRnn_FP32 | PASSED (1152 tests) | No leaks |
| 9 | Smoke/GPU_Bwd_Mha_FP32 | PASSED (12 tests) | No leaks |
| 10 | Full/GPU_LstmMSRnn_FP16 | PASSED (864 tests) | No leaks |
| 11 | Smoke/GPU_Fwd_Mha_FP32 | PASSED (15 tests) | No leaks |
| 12 | Full/GPU_Bwd_Mha_FP32 | PASSED (6 tests) | No leaks |
| 13 | Full/GPU_Fwd_Mha_FP32 | PASSED (7 tests) | No leaks |
| 14 | Full/GPU_RNNVanilla_FP32 | PASSED (96 tests) | No leaks |
| 15 | Full/GPU_RNNVanilla_FP16 | PASSED (96 tests) | No leaks |
| 16 | GPU_TestMhaFind20_FP32 | PASSED (2 tests) | No leaks |
| 17 | Full/GPU_LSTM_FP32 | PASSED (32 tests) | No leaks |
| 18 | Full/GPU_LSTM_FP16 | PASSED (32 tests) | No leaks |
| 19 | Full/GPU_LSTM_extra_FP32 | PASSED (30 tests) | No leaks |
| 20 | Full/GPU_LSTM_extra_FP16 | PASSED (30 tests) | No leaks |
| 21 | Full/GPU_DeepBench_LSTM_FP16 | PASSED (22 tests) | No leaks |
| 22 | Full/GPU_DeepBench_LSTM_FP32 | PASSED (22 tests) | No leaks |
| 23 | CPU_LOG_TEST_FUSION_NONE | PASSED (2 tests) | No leaks |
| 24 | CPU_LOG_TEST_NEG_NONE | PASSED (4 tests) | No leaks |
| 25 | GPU_SoftmaxFind20_BFP16 | PASSED (2 tests) | No leaks |
| 26 | GPU_SoftmaxFind20_FP16 | PASSED (2 tests) | No leaks |
| 27 | GPU_SoftmaxFind20_FP32 | PASSED (2 tests) | No leaks |
| 28 | CPU_ConvApi_NONE | PASSED (1 test) | No leaks |
| 29 | Full/GPU_RNNSeqApi_FP16 | PASSED (16 tests) | No leaks |
| 30 | Full/GPU_RNNSeqApi_FP32 | PASSED (16 tests) | No leaks |
| 31 | UnitVAN/CPU_GraphApiRng_NONE | REMOVED (PR #5603, 2026-03-26) |
n/a (deleted) |
| 32 | Smoke/GPU_RNNVanilla_FP16 | PASSED (4 tests) | No leaks |
| 33 | Smoke/GPU_RNNVanilla_FP32 | PASSED (4 tests) | No leaks |
| 34 | CPU_FusionCreateOpConvForward_FP32 | PASSED (1 test) | No leaks |
| 35 | CPU_GraphApiOperationReduction_NONE | REMOVED (PR #5603,
2026-03-26) | n/a (deleted) |
| 36 | Unit2IV1/CPU_GraphApiOperationPointwiseBwd_NONE | REMOVED (PR
#5603, 2026-03-26) | n/a (deleted) |
| 37 | Unit2IV1/CPU_GraphApiOperationPointwiseTwoInput | REMOVED (PR
#5603, 2026-03-26) | n/a (deleted) |
| 38 | Unit2IV2/CPU_GraphApiOperationPointwiseBwd_NONE | REMOVED (PR
#5603, 2026-03-26) | n/a (deleted) |
| 39 | Unit2IV2/CPU_GraphApiOperationPointwiseTwoInput | REMOVED (PR
#5603, 2026-03-26) | n/a (deleted) |
| 40 | UnitVAB/CPU_GraphApiRng_NONE | REMOVED (PR #5603, 2026-03-26) |
n/a (deleted) |
| 41 | Smoke/GPU_RNNVanillaDropout_FP16 (duplicate of #2) | (see #2) |
(see #2) |
| 42 | CPU_GraphApiOperationGraphDescriptor_NONE | REMOVED (PR #5603,
2026-03-26) | n/a (deleted) |
| 43 | UnitVA/CPU_GraphApiVariantPack_NONE | REMOVED (PR #5603,
2026-03-26) | n/a (deleted) |
| 44 | UnitVAU/CPU_GraphApiRng_NONE | REMOVED (PR #5603, 2026-03-26) |
n/a (deleted) |
| 45 | CPU_GraphApiOperationReshape_NONE | REMOVED (PR #5603,
2026-03-26) | n/a (deleted) |
| 46 | Smoke/CPU_DeterministicConvApi_NONE | PASSED (1 test) | No leaks
|
| 47 | Smoke/GPU_FusionAux_FP32 | PASSED (1 test) | No leaks |
| 48 | CPU_GraphApiEngineHeur_NONE | REMOVED (PR #5603, 2026-03-26) |
n/a (deleted) |
| 49 | Unit/CPU_GraphApiReduction_NONE | REMOVED (PR #5603, 2026-03-26)
| n/a (deleted) |
| 50 | CPU_GraphApiEngineCfg_NONE | REMOVED (PR #5603, 2026-03-26) | n/a
(deleted) |
| 51 | Unit/CPU_GraphApiMatMul_NONE | REMOVED (PR #5603, 2026-03-26) |
n/a (deleted) |
| 52 | CPU_BackendApi_NONE | REMOVED (PR #5603, 2026-03-26) | n/a
(deleted) |
| 53 | UnitIV/CPU_GraphApiOperationPointwiseOneInput | REMOVED (PR
#5603, 2026-03-26) | n/a (deleted) |
| 54 | Unit3IV/CPU_GraphApiOperationPointwiseThreeInput | REMOVED (PR
#5603, 2026-03-26) | n/a (deleted) |
| 55 | UnitVA/CPU_GraphApiOperationMatmul_NONE | REMOVED (PR #5603,
2026-03-26) | n/a (deleted) |
| 56 | UnitVA/CPU_GraphApiOperationRng_NONE | REMOVED (PR #5603,
2026-03-26) | n/a (deleted) |


| Outcome                            | Count |
|------------------------------------|------:|
| Passed, no leaks                   |    39 |
| Passed, leaks detected             |     0 |
| Failed                             |     0 |
| Crashed / timed out                |     0 |
| Removed — GraphAPI purge (#5603)   |    15 |
| Removed — FP64 LSTM purge (#5750)  |     1 |
| Duplicate (not re-run)             |     1 |
| **Total rows**                     |  **56** |

## Risk Assessment
Low

---------

Co-authored-by: JonathanLichtnerAMD <195780826+JonathanLichtnerAMD@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

migration Tasks or issues tied to migration to this monorepo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant