Skip to content

[hipblaslt] Fix GR distribution and spacing in multi-partition configurations (subtiling)#2

Open
sebvince wants to merge 60 commits into
pgr0_1from
bf16_multi_partition_fixes
Open

[hipblaslt] Fix GR distribution and spacing in multi-partition configurations (subtiling)#2
sebvince wants to merge 60 commits into
pgr0_1from
bf16_multi_partition_fixes

Conversation

@sebvince
Copy link
Copy Markdown
Owner

@sebvince sebvince commented May 6, 2026

Summary

  • Fix GR distribution bug: Corrected the place_GRs algorithm to evenly space global-read atoms across slots using a stride-based approach (i * numSlots // nAtoms), replacing the previous sequential fill that could bunch loads into early slots
  • Clamp atoms to upper bounds: Each atom now carries its upper-bound slot, ensuring it never gets placed past the last valid slot for its tensor/MT pair

Performance

Fixes a performance regression on certain macro-tile sizes with high partition counts in BF16 GEMM (typically with odd or prime number of tiles). Some data (gfx950) :

MT0 MT1 wg0 wg1 before (Pflops) after (Pflops) speedup
352 288 2 2 1.144 1.623 +41.8%
352 256 2 2 1.065 1.590 +49.2%
288 352 2 2 1.130 1.621 +43.5%
256 352 2 2 1.031 1.592 +54.4%
320 304 4 1 1.113 1.595 +43.3%
320 272 4 1 1.076 1.594 +48.1%
256 368 4 1 1.031 1.473 +42.9%

Tests

  • Update following test as GR distribution has changed:

    • test_1x1_k1_DU512
    • test_2x2_k1_DU256
    • test_2x2_k1_DU512
    • test_128x128_fp4_partition_1x1 & test_128x128_fp4_pgr1
  • Add 1x19 BF16 partition test

@sebvince sebvince force-pushed the bf16_multi_partition_fixes branch from 6489969 to 8190761 Compare May 7, 2026 16:43
sebvince pushed a commit that referenced this pull request May 11, 2026
…OCm#7008)

## Motivation

Several memory leaks were detected in MIOpen gtests using ASan. Some of
the tests were blacklisted and others were not. This change looks to fix
all of the low hanging fruit, which are the majority of the leaks found.
This includes all of the critical leaks (>100MB) that were reported.
Some other leaks were identified as needing a larger refactor to
resolve.

After fixing the w_supertensor.cpp leak, the supertensor tests hit
virtual memory area limit errors. These changes are to enable running
them with ASan, but only with a subset of the tests in order to not hit
limits. The test coverage being lost is fairly negligible and the full
tests are still run without ASan enabled.

## Technical Details

Here is a summary of the files looked at and the changes made:

| File | Status | Notes |
|------|--------|-------|
| `test/gtest/gpu_mha_forward.cpp` | Fixed | Added
`miopenDestroySolution()` loop after using solutions. Fixes MHA forward
solution descriptor leaks (report ROCm#11, ROCm#13). |
| `test/gtest/gpu_mha_backward.cpp` | Fixed | Added
`miopenDestroySolution()` loop after using solutions. Fixes MHA backward
solution descriptor leaks (report ROCm#9, ROCm#12). |
| `test/gtest/mha_find20.cpp` | Fixed | Added `miopenDestroySolution()`
loop in both `MhaForward` and `MhaBackward` tests. Fixes MHA Find2.0
solution leak (report ROCm#16). |
| `test/gtest/gtest_desc_guard.hpp` | Fixed | New shared header
introduced by the refactor. Provides a single `DescGuard<DescType,
CreateFn, DestroyFn>` template (with `TensorDescGuard`, `ConvDescGuard`,
`DropoutDescGuard`, `RNNDescGuard` aliases), a `HandleGuard` RAII
wrapper for `miopenHandle_t`, and the
`DestroyInternalRnnDropoutDesc(rnnDesc)` helper used by every
RNN/LSTM/GRU test to free the internal `DropoutDescriptor` that
`miopenCreateRNNDescriptor` allocates and `miopenSetRNNDescriptor*` then
leaks. Replaces the per-file ad-hoc guard structs from the initial
implementation. |
| `test/gtest/w_supertensor.cpp` | Fixed | Switched raw descriptors to
the shared `RNNDescGuard` / `TensorDescGuard` from
`gtest_desc_guard.hpp`. Added a class-local `DestroyDropoutDesc()`
(called from `TearDown` and before `miopenSetRNNDescriptor`) to prevent
the `miopenSetRNNDescriptor` overwrite leak. Reduced test parameter
space under ASan to avoid OOM. Removed unused `seqLen` parameter and
dead `param_dev_out`/`bias_dev_out` allocations. |
| `test/gtest/lstm.hpp` | Fixed | Switched `rnnDesc` → `RNNDescGuard`,
`DropoutDesc` → `DropoutDescGuard`, and `mio_handle` → `HandleGuard`
(which now owns the `miopenDestroy` call), all from the shared
`gtest_desc_guard.hpp`. Hoisted `dropout_state_buf` so it can be
`hipFree`d at the end of the dropout path. Added
`DestroyInternalRnnDropoutDesc(rnnDesc)` before
`miopenSetRNNDescriptor*` and (in the non-dropout path only) at end of
`Run`, which frees the internal `DropoutDescriptor` that the Set call
would otherwise leak. Fixes ~615 MB dropout leaks and ~16.5 KB
non-dropout descriptor leaks (report #3, #4, ROCm#6, ROCm#17-ROCm#22). |
| `test/gtest/gru_test.cpp` | Fixed | In the GRU test class: switched
`rnnDesc` → `RNNDescGuard`, `DropoutDesc` → `DropoutDescGuard`,
`mio_handle` → `HandleGuard`, and added
`DestroyInternalRnnDropoutDesc(rnnDesc)` before
`miopenSetRNNDescriptor*` and (non-dropout path only) at end of `Run`
plus `hipFree(dropout_state_buf)` for the dropout path. In the in-file
`GRUFwdCPUVerify` / `GRUBwdDataCPUVerify` helpers: converted the raw
`dropout_inputTensor` / `dropout_outputTensor` declarations to
`TensorDescGuard` (mirroring the `cpu_rnn.hpp` change for the LSTM/RNN
helpers). |
| `test/gtest/softmax_find20.cpp` | Fixed | Changed `Finalize()` to take
the `std::vector<miopenSolution_t>&` and destroy each solution via
`miopenDestroySolution()` before destroying the problem. Updated all 6
`TEST(...)` callers to pass the solutions vector. Fixes Find2.0 softmax
solution/kernel leaks (report ROCm#25-ROCm#27). |
| `test/gtest/rnn_seq_api.hpp` | Fixed | Hoisted `dropout_state_buf` so
the dropout path can `hipFree` it at the end. Added
`DestroyInternalRnnDropoutDesc(&rnnDesc)` before
`miopenSetRNNDescriptor*` and (non-dropout path only) at end of the run
to free the internal `DropoutDescriptor` allocations leaked by
`RNNDescriptor` copy-assignment. Same pattern as the LSTM/GRU fixes
(report ROCm#29-ROCm#30). |
| `test/cpu_rnn.hpp` | Fixed | Converted 6 raw `miopenTensorDescriptor_t
dropout_input/outputTensor` declarations across the LSTM/RNN CPU
verification helpers (`LSTMFwdCPUVerify`, `LSTMBwdDataCPUVerify`,
`RNNFwdTrainCPUVerify`, `RNNBwdDataCPUVerify`, `GRUFwdCPUVerify`,
`GRUBwdDataCPUVerify`) to the shared `TensorDescGuard` from
`gtest_desc_guard.hpp`. Removed the redundant
`miopenCreateTensorDescriptor` calls and updated 12 `miopen::deref(...)`
sites to `.get()`. (Note: the GRU helpers in this header are stale
duplicates; the live ones are inside `test/gtest/gru_test.cpp` and were
updated there too.) Fixes the LSTM/GRU CPU-verify tensor descriptor
leaks (report #1, #2, ROCm#14, ROCm#15, ROCm#32, ROCm#33). |
| `test/gtest/rnn_vanilla_common.hpp` | Fixed | Added
`DestroyInternalRnnDropoutDesc(rnnDesc)` calls before
`miopenSetRNNDescriptor*` and (non-dropout path only) at end of the run.
The `RNNDescGuard` / `DropoutDescGuard` usage was already in place from
an earlier commit and now resolves to the shared definitions in
`gtest_desc_guard.hpp`. Same pattern as LSTM/GRU/rnn_seq_api fixes
(report ROCm#14, ROCm#15, ROCm#32, ROCm#33). |
| `test/gtest/graphapi_gtest_common.hpp` | Skipped | File no longer
exists in the codebase. The GraphApi test infrastructure has been
removed; only the leak report (a stale snapshot) still references it. No
fix possible against the current source tree. |
| `test/gtest/graphapi_execution_plan.cpp` | Skipped | File no longer
exists in the codebase (GraphApi removed). The leak it represented was
largely an external hipblaslt bug anyway; the remaining test-side
portion is not fixable against the current source. |
| `test/gtest/na_train.cpp` / `na_inference.cpp` / `na_*_find2.cpp` |
Skipped | Leaks come from the internal MIOpen solver/kernel cache living
on the global singleton handle, which is never destroyed. Not easily
fixable without redesigning global handle lifecycle. Should be
suppressed in the ASan suppression file. |
| hipblaslt / rocblaslt (external) | Skipped |
`SolutionCache::addKernel` and `preloadCustomKernels` leak via
`_rocblaslt_handle` constructor. Called from
`miopen::Handle::CreateHipblasLtHandle`. This is an upstream bug in
hipblaslt/rocblaslt, not fixable in MIOpen. Affects suites that
initialize a handle (report Category 2, Category 7). |
| CLR / HIP runtime (external) | Skipped | `amd::Context` and
`amd::roc::Device` global initialization leaks from
`rocclr/platform/context.cpp`. HIP runtime internals, not fixable in
MIOpen. |
| `src/hipoc/hipoc_program.cpp` | Skipped | `HIPOCProgramImpl` objects
leak during kernel compilation/caching (line ~178). This is an internal
MIOpen kernel cache lifecycle issue that requires deeper architectural
changes to fix. Contributes small amounts to MHA and Softmax Find2.0
leaks. |
| `test/gtest/conv_api.cpp` | Fixed | Already clean against current
source — `miopenDestroyConvolutionDescriptor(conv_desc)` call exists
(line 24) inside the test loop. ASAN run reports no leaks. The leak
report was based on a stale snapshot. The hipblaslt handle init portion
is tracked under the external-skipped row. |
| `test/gtest/log_test.cpp` (CPU_LOG_TEST_FUSION / CPU_LOG_TEST_NEG) |
Fixed | Already clean against current source — `Tensor`, `Conv`,
`CreateCBAFusionPlan`, `CreateBNormFusionPlan` all have proper
destructors that call the corresponding `miopenDestroy*` APIs in
`log.cpp`. ASAN run on `CPU_LOG_TEST_*` (11 tests across log_test.cpp +
log_test_neg.cpp) reports no leaks. The hipblaslt handle init portion is
tracked separately under the external-skipped row. |
| `test/gtest/fusion_test.cpp` (CPU_FusionCreateOpConvForward) | Fixed |
File renamed from `fusion.cpp` to `fusion_test.cpp`. Already clean
against current source — uses `TensorDescGuard`/`ConvDescGuard` for
tensor/conv descriptors and calls
`miopenDestroyFusionPlan(fusionPlanDesc)` on the fusion plan (line 195).
ASAN run on `CPU_FusionCreateOpConvForward_FP32.*` reports no leaks. |
| `test/gtest/deterministic_conv_api.cpp` | Fixed | Already clean
against current source — uses `ConvDescGuard` (line 66) for the conv
descriptor. ASAN run on `*CPU_DeterministicConvApi*` reports no leaks. |
| `test/gtest/fusion_aux.cpp` (GPU_FusionAux) | Fixed | Already clean
against current source — uses `ConvDescGuard` plus stack-allocated
internal C++ objects (`miopen::TensorDescriptor`,
`miopen::FusionPlanDescriptor`) which have proper destructors. The
`convoOp` handle is owned by the fusion plan. ASAN run on
`*GPU_FusionAux*` reports no leaks. |
| `test/gtest/backend_api.cpp` (CPU_BackendApi) | Skipped | File no
longer exists in the codebase. The backend API test infrastructure (part
of the removed GraphApi suite) was removed; no fix possible against the
current source tree. |


### High-level notes

New shared infrastructure (test/gtest/gtest_desc_guard.hpp)
- DescGuard<DescType, CreateFn, DestroyFn> — a single RAII template
parameterized on the descriptor type and its create/destroy entry
points. Aliases provide TensorDescGuard, ConvDescGuard,
DropoutDescGuard, and RNNDescGuard, replacing the four near-identical
guard structs
  that were copy-pasted across test files in the initial implementation.
- HandleGuard — separate RAII wrapper for miopenHandle_t (couldn't reuse
the template because miopenCreateWithStream takes an extra hipStream_t
argument). Supports lazy create(stream) so callers that only need a
handle in the dropout branch can default-construct one and
  populate it conditionally.
- DestroyInternalRnnDropoutDesc(rnnDesc) — frees the internal
DropoutDescriptor that miopenCreateRNNDescriptor allocates and that
miopenSetRNNDescriptor* then orphans. Replaces the equivalent inline
blocks that LSTM/GRU/RNN tests were each carrying. The header documents
the
two call-sites: before each Set* (always safe) and at end-of-run only on
the non-dropout path (the dropout path aliases the user-owned
descriptor, so freeing would double-free).

Recurring patterns enabled by the refactor
- The "leak from Set* overwriting the default-constructed internal
dropout descriptor" fix collapsed from per-file code to a one-line
helper call, applied uniformly across lstm.hpp, gru_test.cpp,
rnn_seq_api.hpp, and rnn_vanilla_common.hpp.
- mio_handle ownership in LSTM/GRU is now expressed via HandleGuard
rather than a manual miopenDestroy at the end of the dropout branch —
eliminates a class of forgotten-cleanup bugs.
- dropout_state_buf is consistently hoisted out of the dropout if block
so an end-of-run hipFree can release it; deletion of the buffer pairs
visibly with its allocation.

Notable non-RNN change
- softmax_find20.cpp was the only Find2.0 leak fix in this commit:
Finalize() now takes the solutions vector and calls
miopenDestroySolution() for each before destroying the problem. Same
shape applied to all 6 tests in the file.


## Test Plan

Run the tests beforehand to observe the ASan leak errors and then again
afterward to verify the fixes have resolved the problem.

## Test Result

List from ROCM-21512:

| # | Test Name | Status | Leak Status |

|---:|------------------------------------------------------|------------------------------------------------|-----------------|
| 1 | Smoke/GPU_RNNVanillaDropout_FP32 | PASSED (4 tests) | No leaks |
| 2 | Smoke/GPU_RNNVanillaDropout_FP16 | PASSED (4 tests) | No leaks |
| 3 | Full/GPU_LSTM_dropout_FP32 | PASSED (4 tests) | No leaks |
| 4 | Full/GPU_LSTM_dropout_FP16 | PASSED (4 tests) | No leaks |
| 5 | CPU_GraphApiExecutionPlanBuilder_NONE | REMOVED (PR ROCm#5603,
2026-03-26) | n/a (deleted) |
| 6 | Full/GPU_LSTM_dropout_FP64 | REMOVED (PR ROCm#5750, 2026-03-26) | n/a
(deleted) |
| 7 | Unit/CPU_GraphApiPointwise_NONE | REMOVED (PR ROCm#5603, 2026-03-26) |
n/a (deleted) |
| 8 | Full/GPU_LstmMSRnn_FP32 | PASSED (1152 tests) | No leaks |
| 9 | Smoke/GPU_Bwd_Mha_FP32 | PASSED (12 tests) | No leaks |
| 10 | Full/GPU_LstmMSRnn_FP16 | PASSED (864 tests) | No leaks |
| 11 | Smoke/GPU_Fwd_Mha_FP32 | PASSED (15 tests) | No leaks |
| 12 | Full/GPU_Bwd_Mha_FP32 | PASSED (6 tests) | No leaks |
| 13 | Full/GPU_Fwd_Mha_FP32 | PASSED (7 tests) | No leaks |
| 14 | Full/GPU_RNNVanilla_FP32 | PASSED (96 tests) | No leaks |
| 15 | Full/GPU_RNNVanilla_FP16 | PASSED (96 tests) | No leaks |
| 16 | GPU_TestMhaFind20_FP32 | PASSED (2 tests) | No leaks |
| 17 | Full/GPU_LSTM_FP32 | PASSED (32 tests) | No leaks |
| 18 | Full/GPU_LSTM_FP16 | PASSED (32 tests) | No leaks |
| 19 | Full/GPU_LSTM_extra_FP32 | PASSED (30 tests) | No leaks |
| 20 | Full/GPU_LSTM_extra_FP16 | PASSED (30 tests) | No leaks |
| 21 | Full/GPU_DeepBench_LSTM_FP16 | PASSED (22 tests) | No leaks |
| 22 | Full/GPU_DeepBench_LSTM_FP32 | PASSED (22 tests) | No leaks |
| 23 | CPU_LOG_TEST_FUSION_NONE | PASSED (2 tests) | No leaks |
| 24 | CPU_LOG_TEST_NEG_NONE | PASSED (4 tests) | No leaks |
| 25 | GPU_SoftmaxFind20_BFP16 | PASSED (2 tests) | No leaks |
| 26 | GPU_SoftmaxFind20_FP16 | PASSED (2 tests) | No leaks |
| 27 | GPU_SoftmaxFind20_FP32 | PASSED (2 tests) | No leaks |
| 28 | CPU_ConvApi_NONE | PASSED (1 test) | No leaks |
| 29 | Full/GPU_RNNSeqApi_FP16 | PASSED (16 tests) | No leaks |
| 30 | Full/GPU_RNNSeqApi_FP32 | PASSED (16 tests) | No leaks |
| 31 | UnitVAN/CPU_GraphApiRng_NONE | REMOVED (PR ROCm#5603, 2026-03-26) |
n/a (deleted) |
| 32 | Smoke/GPU_RNNVanilla_FP16 | PASSED (4 tests) | No leaks |
| 33 | Smoke/GPU_RNNVanilla_FP32 | PASSED (4 tests) | No leaks |
| 34 | CPU_FusionCreateOpConvForward_FP32 | PASSED (1 test) | No leaks |
| 35 | CPU_GraphApiOperationReduction_NONE | REMOVED (PR ROCm#5603,
2026-03-26) | n/a (deleted) |
| 36 | Unit2IV1/CPU_GraphApiOperationPointwiseBwd_NONE | REMOVED (PR
ROCm#5603, 2026-03-26) | n/a (deleted) |
| 37 | Unit2IV1/CPU_GraphApiOperationPointwiseTwoInput | REMOVED (PR
ROCm#5603, 2026-03-26) | n/a (deleted) |
| 38 | Unit2IV2/CPU_GraphApiOperationPointwiseBwd_NONE | REMOVED (PR
ROCm#5603, 2026-03-26) | n/a (deleted) |
| 39 | Unit2IV2/CPU_GraphApiOperationPointwiseTwoInput | REMOVED (PR
ROCm#5603, 2026-03-26) | n/a (deleted) |
| 40 | UnitVAB/CPU_GraphApiRng_NONE | REMOVED (PR ROCm#5603, 2026-03-26) |
n/a (deleted) |
| 41 | Smoke/GPU_RNNVanillaDropout_FP16 (duplicate of #2) | (see #2) |
(see #2) |
| 42 | CPU_GraphApiOperationGraphDescriptor_NONE | REMOVED (PR ROCm#5603,
2026-03-26) | n/a (deleted) |
| 43 | UnitVA/CPU_GraphApiVariantPack_NONE | REMOVED (PR ROCm#5603,
2026-03-26) | n/a (deleted) |
| 44 | UnitVAU/CPU_GraphApiRng_NONE | REMOVED (PR ROCm#5603, 2026-03-26) |
n/a (deleted) |
| 45 | CPU_GraphApiOperationReshape_NONE | REMOVED (PR ROCm#5603,
2026-03-26) | n/a (deleted) |
| 46 | Smoke/CPU_DeterministicConvApi_NONE | PASSED (1 test) | No leaks
|
| 47 | Smoke/GPU_FusionAux_FP32 | PASSED (1 test) | No leaks |
| 48 | CPU_GraphApiEngineHeur_NONE | REMOVED (PR ROCm#5603, 2026-03-26) |
n/a (deleted) |
| 49 | Unit/CPU_GraphApiReduction_NONE | REMOVED (PR ROCm#5603, 2026-03-26)
| n/a (deleted) |
| 50 | CPU_GraphApiEngineCfg_NONE | REMOVED (PR ROCm#5603, 2026-03-26) | n/a
(deleted) |
| 51 | Unit/CPU_GraphApiMatMul_NONE | REMOVED (PR ROCm#5603, 2026-03-26) |
n/a (deleted) |
| 52 | CPU_BackendApi_NONE | REMOVED (PR ROCm#5603, 2026-03-26) | n/a
(deleted) |
| 53 | UnitIV/CPU_GraphApiOperationPointwiseOneInput | REMOVED (PR
ROCm#5603, 2026-03-26) | n/a (deleted) |
| 54 | Unit3IV/CPU_GraphApiOperationPointwiseThreeInput | REMOVED (PR
ROCm#5603, 2026-03-26) | n/a (deleted) |
| 55 | UnitVA/CPU_GraphApiOperationMatmul_NONE | REMOVED (PR ROCm#5603,
2026-03-26) | n/a (deleted) |
| 56 | UnitVA/CPU_GraphApiOperationRng_NONE | REMOVED (PR ROCm#5603,
2026-03-26) | n/a (deleted) |


| Outcome                            | Count |
|------------------------------------|------:|
| Passed, no leaks                   |    39 |
| Passed, leaks detected             |     0 |
| Failed                             |     0 |
| Crashed / timed out                |     0 |
| Removed — GraphAPI purge (ROCm#5603)   |    15 |
| Removed — FP64 LSTM purge (ROCm#5750)  |     1 |
| Duplicate (not re-run)             |     1 |
| **Total rows**                     |  **56** |

## Risk Assessment
Low

---------

Co-authored-by: JonathanLichtnerAMD <195780826+JonathanLichtnerAMD@users.noreply.github.com>
@sebvince sebvince force-pushed the bf16_multi_partition_fixes branch from 8190761 to 2ed636d Compare May 11, 2026 08:09
@sebvince sebvince marked this pull request as ready for review May 12, 2026 15:15
@sebvince sebvince force-pushed the bf16_multi_partition_fixes branch from 2ed636d to 5954924 Compare May 13, 2026 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant