Skip to content

Initial workflows#1

Merged
jayhawk-commits merged 2 commits into
developfrom
initialWorkflows
Apr 23, 2025
Merged

Initial workflows#1
jayhawk-commits merged 2 commits into
developfrom
initialWorkflows

Conversation

@jayhawk-commits
Copy link
Copy Markdown
Collaborator

No description provided.

@jayhawk-commits jayhawk-commits requested a review from Copilot April 23, 2025 18:53
@jayhawk-commits jayhawk-commits self-assigned this Apr 23, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces initial workflows and supporting scripts to manage the ROCm monorepo, including setups for subtree synchronization, CODEOWNERS merging, and GitHub label management.

  • Updated README.md with a quick start guide for developers
  • Added multiple GitHub Actions workflows for monorepo setup, label collection/application, and subtree synchronization
  • Introduced Python scripts to merge CODEOWNERS and manage labels across repositories

Reviewed Changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
README.md Updated guide with details on sparse checkout and repo history
.github/workflows/update-subtrees.yml New workflow to synchronize subtrees for updating repositories
.github/workflows/merge-codeowners.yml Workflow to merge CODEOWNERS files from sub-repos
.github/workflows/initial-setup.yml Workflow to initialize the monorepo by adding repositories via subtree
.github/workflows/collect-labels.yml Workflow for collecting labels from source repos
.github/workflows/apply-labels.yml Workflow for applying and updating labels in the monorepo
.github/scripts/merge-codeowners.py Python script to merge CODEOWNERS entries across subdirectories
.github/scripts/collect-labels.py Python script to fetch and aggregate labels from GitHub repos
.github/scripts/apply-labels.py Python script to create or update labels based on the generated YAML
Files not reviewed (1)
  • .github/repos-config.json: Language not supported

Comment thread .github/workflows/update-subtrees.yml Outdated
@jayhawk-commits jayhawk-commits merged commit 6c1a640 into develop Apr 23, 2025
@jayhawk-commits jayhawk-commits deleted the initialWorkflows branch April 23, 2025 18:54
assistant-librarian Bot pushed a commit that referenced this pull request May 13, 2025
* finished sort keys no size tests

* completed sort pair no size functions

* completed sort pair with size functions

* updated changelog
@jayhawk-commits jayhawk-commits added github actions migration Tasks or issues tied to migration to this monorepo labels May 17, 2025
assistant-librarian Bot pushed a commit that referenced this pull request May 28, 2025
* Bump cryptography from 43.0.1 to 44.0.1 in /docs/sphinx (#611)

Bumps [cryptography](https://github.com/pyca/cryptography) from 43.0.1 to 44.0.1.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](pyca/cryptography@43.0.1...44.0.1)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* added test_discrete_distribution file

* finished discrete_alias tests

* temp commit

* merged in cmake fix

* started updating functions

* added cdf tests (not done need to modify)

* reformated tests

* finalized discrete_distribution tests

* updated changelog

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
idass1990 pushed a commit that referenced this pull request Jun 13, 2025
* Add cmake files



[ROCm/hipBLAS-common commit: 9666c57]
rahulc-gh pushed a commit that referenced this pull request Jun 25, 2025
ammallya pushed a commit that referenced this pull request Jun 25, 2025
[ROCm/mxDataGenerator commit: e74b2c1]

[ROCm/mxDataGenerator commit: 8efb356]
ammallya pushed a commit that referenced this pull request Jul 24, 2025
[ROCm/mxDataGenerator commit: e74b2c1]
eidenyoshida pushed a commit that referenced this pull request Jul 25, 2025
ammallya pushed a commit that referenced this pull request Sep 24, 2025
…coverage (#1)

* test if I can push directly to develop anymore

* add boost dependency in the backend dll

* setup lfs, bring over attribute types from miopen for now

* add a make target called check which will discover all tests via cmake and then run them using ctest

* enable code coverage via cmake and a make target of code_coverage

* make ignore regex use deps for path
ammallya pushed a commit that referenced this pull request Sep 24, 2025
…coverage (#1)

* test if I can push directly to develop anymore

* add boost dependency in the backend dll

* setup lfs, bring over attribute types from miopen for now

* add a make target called check which will discover all tests via cmake and then run them using ctest

* enable code coverage via cmake and a make target of code_coverage

* make ignore regex use deps for path

[ROCm/hipDNN commit: d2f26e6]
JoeLiuAMD added a commit that referenced this pull request Oct 8, 2025
…nel configurations (#1978)

# Fix GenericSearch warm-up bias: apply warm-up to all configurations

>** 📝 Note**: This PR has follow up PR #1993
## Motivation

MIOpen's generic search algorithm suffers from a **race condition** that
causes optimal kernels to be randomly rejected, leading to 3-4x
performance degradation in some cases.

### Problem Description

When running the same convolution workload multiple times as sample
below:
```bash
MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 -p 1 -q 1 \
  -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC \
  -m conv -g 1 -t 1 -F 2
```

**Observed behavior:**
- **Lucky case**: Selected optimal kernel → **0.099 ms** per operation
- **Unlucky case**: Selected suboptimal kernel → **0.332 ms** per
operation (**3.35x slower**)


[Lucky_Joe_20250930.log](https://github.com/user-attachments/files/22714922/Lucky_Joe_20250930.log)

[Normal_Joe_20250930.log](https://github.com/user-attachments/files/22714923/Normal_Joe_20250930.log)

### Root Cause

**Cold-start bias in warm-up logic** (`generic_search.hpp`, lines
559-564):

```cpp
// Original buggy code
if(n_current == 0)  // ❌ Only first config gets warm-up
{
    invoker(profile_h, invoke_ctx);
    profile_h.ResetKernelTime();
}
```

This condition creates an **unfair advantage** for the first
configuration tested:
- **First kernel** (n_current == 0): Gets warm-up → Fair performance
measurement
- **Subsequent kernels** (n_current > 0): No warm-up → Cold-start
penalty (up to **100x slower** in extreme cases)

### Impact

- **High false negative rate**: Up to 40% chance of rejecting the
optimal kernel
- **Performance degradation**: 4x slower execution when suboptimal
kernel is selected
- **Non-deterministic behavior**: Kernel selection depends on which
configuration is tested first

### Example from Production Logs

**Environment**: MI355X (gfx950), ROCm 7.0.2

```
AI generated 4 kernel configurations for testing:

Kernel #0 (128,128,32,32,8,8...): 10 samples → avg 0.343166 ms → selected as "best"
Kernel #1 (64,64,64,32,8,8...):   1 sample  → 1.219 ms         → rejected (cold-start!)
Kernel #2 (64,64,16,32,8,8...):   1 sample  → 3.0267 ms        → rejected (cold-start!)
Kernel #3 (64,16,64,32,8,8...):   1 sample  → 0.482 ms         → rejected (cold-start!)

Final execution: 0.332 ms (using Kernel #0)

Issue: Kernel #2 suffered from cold-start bias (3.0267 ms first sample)
       With proper warm-up, its true performance is ~0.099 ms (3.4x faster than selected kernel)
```

**Detailed timing from Normal_Joe_20250930.log:**

Optimal kernel (incorrectly rejected due to cold-start):
-
`DeviceGroupedConvBwdData_Xdl_CShuffle_v1<64,64,16,32,8,8,Default,16,16,4,1,8,1,1,1>+1`
- Sample 1: **3.027 ms** ← Cold start! (30x slower than true
performance)
- Samples 2-11: 0.369, 0.349, 0.366, 0.352, 0.353, 0.365, 0.352, 0.359,
0.347, 0.352 ms
- **True mean**: 0.354 ms (excluding cold-start outlier)
- **Decision**: Rejected by early-stop (3.027 > 0.377 × 1.1)
- **Wrong outcome**: Best kernel discarded due to unfair cold-start
penalty

---

## Technical Details

### Changes

This PR contains **only the bug fix** - removing the unfair warm-up
condition:

```diff
-                // Warm-up run for first time invoker is used
-                if(n_current == 0)
-                {
-                    invoker(profile_h, invoke_ctx);
-                    profile_h.ResetKernelTime();
-                }
+                // Warm-up run for every configuration to eliminate cold-start bias
+                invoker(profile_h, invoke_ctx);
+                profile_h.ResetKernelTime();
```

**File modified:**
`projects/miopen/src/include/miopen/generic_search.hpp` (lines 559-564)

**Change summary:**
- 3 insertions(+), 6 deletions(-)
- Removes `if(n_current == 0)` condition
- Ensures every configuration receives one warm-up run before
measurement

### Why This is Low Risk

1. **Minimal code change**: Only 4 lines changed
2. **No algorithm change**: Same sampling strategy, same early-stop
logic
3. **Only ensures fairness**: All configs now receive identical warm-up
treatment
4. **No performance regression**: Adds one extra kernel call per config
(~0.3ms overhead per config)
5. **Negligible overhead**: For typical 4-config search, adds 1.2ms
total (kernel compilation takes 10-30 seconds, so overhead is <0.01%)

---

## Test Plan

### Test Environment
- **Hardware**: MI355X (gfx950)
- **ROCm Version**: 7.0.2 (HIP 7.0.51831)
- **Workload**: Grouped convolution backward data (NHWC layout, 4 kernel
configurations)

### Test Command
```bash
export MIOPEN_LOG_LEVEL=5
export MIOPEN_FIND_MODE=1
./bin/MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 \
  -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 \
  --in_layout NHWC --fil_layout NHWC --out_layout NHWC \
  -m conv -g 1 -t 1 -F 2
```

### Test Results

#### Before Fix (with bug):
- **Success rate**: 6/10 runs selected optimal kernel (40% error rate)
- **Failure pattern**: Optimal kernel rejected when its cold-start time
triggered early-stop
- **Performance impact**: Up to 4x slower when wrong kernel selected
(0.332ms vs 0.099ms)

#### After Fix:
- **Success rate**: 10/10 runs selected optimal kernel (0% error rate)
- **Consistency**: All configurations receive fair warm-up
- **Performance**: Optimal kernel always selected, no degradation
- **Overhead**: +1.2ms for 4 configs (negligible vs 10-30s compilation
time)

---

## Performance Impact

### Search Time Overhead
- **Additional cost**: 1 warm-up run per configuration (only for configs
beyond the first)

### Accuracy Improvement
- **Before**: 60% success rate (6/10 runs correct)
- **After**: 100% success rate (10/10 runs correct)
- **Performance gain**: Eliminates 4x slowdown from selecting wrong
kernel

---

## Backward Compatibility

✅ **Fully compatible** 
- No API changes
- No behavior changes except for fixing the bug
- All existing tests pass
- No impact on already-cached kernels (find database not affected)

---

## Why Split into Two PRs?

Following reviewer feedback, this work has been split into two separate
PRs:

### **PR1 (This PR) - Bug Fix: Warm-up Bias**
- **Risk**: Low (4 lines changed)
- **Impact**: Fixes root cause of unfair kernel comparison
- **Decision**: Ready for immediate merge
- **Rationale**: Without fair warm-up, no amount of threshold tuning can
fix the problem. Cold-start penalties (30-100x slower) make any single
threshold value inadequate.

### **PR2 (Separate PR) - Optimization: Early-Stop Strategy**
- **Branch**: `users/JoeLiuAMD/miopen-generic-search-optimization` #1993
- **Changes**: Dual-sample testing + 1.2x threshold + enhanced logging
- **Risk**: Medium (affects benchmark timing)
- **Decision**: Needs more validation and benchmarking
- **Rationale**: These optimizations improve accuracy further (40% → 0%
error rate) but add ~2 kernel executions per config. The performance
impact needs separate evaluation.

### Why This Approach?
- **PR1 can merge immediately**: Fixes the critical bug with minimal
risk
- **PR2 can be validated thoroughly**: Performance trade-offs can be
evaluated independently
- **Easier to isolate regressions**: If issues arise, we know which
change caused them
- **Progressive improvement**: Get the bug fix deployed while
optimizations are being validated


---

## Submission Checklist

- [x] I have read and agreed with the [contributing
guidelines](CONTRIBUTING.md)
- [x] The changes are minimal and focused on the bug fix
- [x] All existing tests pass
- [x] The fix has been verified on target hardware (MI355X/gfx950)
- [x] The fix eliminates the non-deterministic kernel selection issue
- [x] No performance regression introduced
- [x] Documentation (commit message) clearly explains the problem and
solution
- [x] Backward compatibility confirmed
- [x] Test data and logs provided for verification
ibrahimw1 added a commit that referenced this pull request Nov 21, 2025
@NB4444 NB4444 mentioned this pull request Nov 27, 2025
1 task
nakajee pushed a commit to nakajee/rocm-libraries that referenced this pull request Mar 31, 2026
* Add sample subtile impl

* Move allocOffsetRegisters before setupNewTile

* Start adding GR offset calculation

* Rest of logic (no swizzling)

* refacto

* spgr offsets

* Add newserial code

* Add script to debug offsets

* Add unit test for GR offset calculation

* Grid display

* Fix both code and ref test function

* Add DPP quad perm to rocisa

* Apply swizzling (no rotation yet)

* Function swizzling + rotation + test

* Refactor test to have a single output array + add test for SGPRs

* Add debug mode to test + add dynamic wavegroup calculation based on MT

* Fix test runtime issue and check all vgpr offsets

* Add ref test code for 1x4 & 4x1

* Fix tests

* Fixed SGPR offset calculation for 2x2

* Fix more tests

* Add more tests

* Refactor tests

* simplify tests

* Remove unused script

* cleanup

* fix camelCase in ref test code

* cleanup

* Fix typo

---------

Co-authored-by: brianshi <brianshi@amd.com>
nakajee pushed a commit to nakajee/rocm-libraries that referenced this pull request Mar 31, 2026
…ernel (ROCm#6)

* Add sample subtile impl

* Fix issues when disabling subtile impl

* GR Offset calculation (ROCm#1)

* Add sample subtile impl

* Move allocOffsetRegisters before setupNewTile

* Start adding GR offset calculation

* Rest of logic (no swizzling)

* refacto

* spgr offsets

* Add newserial code

* Add script to debug offsets

* Add unit test for GR offset calculation

* Grid display

* Fix both code and ref test function

* Add DPP quad perm to rocisa

* Apply swizzling (no rotation yet)

* Function swizzling + rotation + test

* Refactor test to have a single output array + add test for SGPRs

* Add debug mode to test + add dynamic wavegroup calculation based on MT

* Fix test runtime issue and check all vgpr offsets

* Add ref test code for 1x4 & 4x1

* Fix tests

* Fixed SGPR offset calculation for 2x2

* Fix more tests

* Add more tests

* Refactor tests

* simplify tests

* Remove unused script

* cleanup

* fix camelCase in ref test code

* cleanup

* Fix typo

---------

Co-authored-by: brianshi <brianshi@amd.com>

* Enable post-loop code generation, and add some subroutines

* LR offset calculation (ROCm#2)

* Add tests

* as is

* Add permlane16_swap instruction to rocisa

* Ongoing progress

* Draft for partition A0/A1

* Wave partitioning

* Draft ref code in tests

* Handle 1x4 wavesplit param

* 2x2 test passing

* Draft 1x4 LR wave partitioning

* Fix alginement issue

* Integration testing

* Update integration test

* Fix swizzling pattern on GRA. Only swizzling on even LDS rows

* Subtile based test

* testing A

* Test both A and B

* Remove graonly mode

* Fix 1x4 case

* Move global offset for B after rest of the logic

* cleanup

* cleanup

* Fix ref test code for 4x1

* Fix spgr alloc issue

* Remove tmp test file

* Remove debug prints

* Add test case

* Add GR load emit logic, and misc fixes (ROCm#3)

* gr emit fix

* Emit LR + init ACCVGPR (ROCm#4)

* Emit ds_reads

* Add waits for LR and GR

* Init Acc VGPR to Zero

* Add missing bit_length on VLShiftLeftB32

* Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness

* Fix gra test ref code for 1x4

* Remove some debug prints

* Add loop and ptr update code

* Update scale offset

* Add tests

* Address review

* Add scale roundtrip e2e test and constraint assertions

Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency
across 4 tile configs x 2 matrices. Add power-of-2 assertion for
scaleBlockSize and matching scaleBlockSize assertions for A/B in
shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes
instead of re-deriving MIWaveGroup from tile dimensions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update fixes

* Fix scale being skipped

* Add flag to print layout

* Fix missed merge conflicts

* Fix missed merge conflicts

* Refactor scale rountrip test with gpu helper fns

* Fix extra spaces

* Fix tests

---------

Co-authored-by: brianshi <brianshi@amd.com>
Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com>
Co-authored-by: b-shi <bbbrianme@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
sebvince referenced this pull request in sebvince/rocm-libraries Apr 3, 2026
* Add sample subtile impl

* Move allocOffsetRegisters before setupNewTile

* Start adding GR offset calculation

* Rest of logic (no swizzling)

* refacto

* spgr offsets

* Add newserial code

* Add script to debug offsets

* Add unit test for GR offset calculation

* Grid display

* Fix both code and ref test function

* Add DPP quad perm to rocisa

* Apply swizzling (no rotation yet)

* Function swizzling + rotation + test

* Refactor test to have a single output array + add test for SGPRs

* Add debug mode to test + add dynamic wavegroup calculation based on MT

* Fix test runtime issue and check all vgpr offsets

* Add ref test code for 1x4 & 4x1

* Fix tests

* Fixed SGPR offset calculation for 2x2

* Fix more tests

* Add more tests

* Refactor tests

* simplify tests

* Remove unused script

* cleanup

* fix camelCase in ref test code

* cleanup

* Fix typo

---------

Co-authored-by: brianshi <brianshi@amd.com>
sebvince referenced this pull request in sebvince/rocm-libraries Apr 3, 2026
…ernel (ROCm#6)

* Add sample subtile impl

* Fix issues when disabling subtile impl

* GR Offset calculation (#1)

* Add sample subtile impl

* Move allocOffsetRegisters before setupNewTile

* Start adding GR offset calculation

* Rest of logic (no swizzling)

* refacto

* spgr offsets

* Add newserial code

* Add script to debug offsets

* Add unit test for GR offset calculation

* Grid display

* Fix both code and ref test function

* Add DPP quad perm to rocisa

* Apply swizzling (no rotation yet)

* Function swizzling + rotation + test

* Refactor test to have a single output array + add test for SGPRs

* Add debug mode to test + add dynamic wavegroup calculation based on MT

* Fix test runtime issue and check all vgpr offsets

* Add ref test code for 1x4 & 4x1

* Fix tests

* Fixed SGPR offset calculation for 2x2

* Fix more tests

* Add more tests

* Refactor tests

* simplify tests

* Remove unused script

* cleanup

* fix camelCase in ref test code

* cleanup

* Fix typo

---------

Co-authored-by: brianshi <brianshi@amd.com>

* Enable post-loop code generation, and add some subroutines

* LR offset calculation (#2)

* Add tests

* as is

* Add permlane16_swap instruction to rocisa

* Ongoing progress

* Draft for partition A0/A1

* Wave partitioning

* Draft ref code in tests

* Handle 1x4 wavesplit param

* 2x2 test passing

* Draft 1x4 LR wave partitioning

* Fix alginement issue

* Integration testing

* Update integration test

* Fix swizzling pattern on GRA. Only swizzling on even LDS rows

* Subtile based test

* testing A

* Test both A and B

* Remove graonly mode

* Fix 1x4 case

* Move global offset for B after rest of the logic

* cleanup

* cleanup

* Fix ref test code for 4x1

* Fix spgr alloc issue

* Remove tmp test file

* Remove debug prints

* Add test case

* Add GR load emit logic, and misc fixes (#3)

* gr emit fix

* Emit LR + init ACCVGPR (#4)

* Emit ds_reads

* Add waits for LR and GR

* Init Acc VGPR to Zero

* Add missing bit_length on VLShiftLeftB32

* Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness

* Fix gra test ref code for 1x4

* Remove some debug prints

* Add loop and ptr update code

* Update scale offset

* Add tests

* Address review

* Add scale roundtrip e2e test and constraint assertions

Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency
across 4 tile configs x 2 matrices. Add power-of-2 assertion for
scaleBlockSize and matching scaleBlockSize assertions for A/B in
shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes
instead of re-deriving MIWaveGroup from tile dimensions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update fixes

* Fix scale being skipped

* Add flag to print layout

* Fix missed merge conflicts

* Fix missed merge conflicts

* Refactor scale rountrip test with gpu helper fns

* Fix extra spaces

* Fix tests

---------

Co-authored-by: brianshi <brianshi@amd.com>
Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com>
Co-authored-by: b-shi <bbbrianme@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
newling pushed a commit to newling/rocm-libraries that referenced this pull request Apr 9, 2026
* Add sample subtile impl

* Move allocOffsetRegisters before setupNewTile

* Start adding GR offset calculation

* Rest of logic (no swizzling)

* refacto

* spgr offsets

* Add newserial code

* Add script to debug offsets

* Add unit test for GR offset calculation

* Grid display

* Fix both code and ref test function

* Add DPP quad perm to rocisa

* Apply swizzling (no rotation yet)

* Function swizzling + rotation + test

* Refactor test to have a single output array + add test for SGPRs

* Add debug mode to test + add dynamic wavegroup calculation based on MT

* Fix test runtime issue and check all vgpr offsets

* Add ref test code for 1x4 & 4x1

* Fix tests

* Fixed SGPR offset calculation for 2x2

* Fix more tests

* Add more tests

* Refactor tests

* simplify tests

* Remove unused script

* cleanup

* fix camelCase in ref test code

* cleanup

* Fix typo

---------

Co-authored-by: brianshi <brianshi@amd.com>
newling pushed a commit to newling/rocm-libraries that referenced this pull request Apr 9, 2026
…ernel (ROCm#6)

* Add sample subtile impl

* Fix issues when disabling subtile impl

* GR Offset calculation (ROCm#1)

* Add sample subtile impl

* Move allocOffsetRegisters before setupNewTile

* Start adding GR offset calculation

* Rest of logic (no swizzling)

* refacto

* spgr offsets

* Add newserial code

* Add script to debug offsets

* Add unit test for GR offset calculation

* Grid display

* Fix both code and ref test function

* Add DPP quad perm to rocisa

* Apply swizzling (no rotation yet)

* Function swizzling + rotation + test

* Refactor test to have a single output array + add test for SGPRs

* Add debug mode to test + add dynamic wavegroup calculation based on MT

* Fix test runtime issue and check all vgpr offsets

* Add ref test code for 1x4 & 4x1

* Fix tests

* Fixed SGPR offset calculation for 2x2

* Fix more tests

* Add more tests

* Refactor tests

* simplify tests

* Remove unused script

* cleanup

* fix camelCase in ref test code

* cleanup

* Fix typo

---------

Co-authored-by: brianshi <brianshi@amd.com>

* Enable post-loop code generation, and add some subroutines

* LR offset calculation (ROCm#2)

* Add tests

* as is

* Add permlane16_swap instruction to rocisa

* Ongoing progress

* Draft for partition A0/A1

* Wave partitioning

* Draft ref code in tests

* Handle 1x4 wavesplit param

* 2x2 test passing

* Draft 1x4 LR wave partitioning

* Fix alginement issue

* Integration testing

* Update integration test

* Fix swizzling pattern on GRA. Only swizzling on even LDS rows

* Subtile based test

* testing A

* Test both A and B

* Remove graonly mode

* Fix 1x4 case

* Move global offset for B after rest of the logic

* cleanup

* cleanup

* Fix ref test code for 4x1

* Fix spgr alloc issue

* Remove tmp test file

* Remove debug prints

* Add test case

* Add GR load emit logic, and misc fixes (ROCm#3)

* gr emit fix

* Emit LR + init ACCVGPR (ROCm#4)

* Emit ds_reads

* Add waits for LR and GR

* Init Acc VGPR to Zero

* Add missing bit_length on VLShiftLeftB32

* Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness

* Fix gra test ref code for 1x4

* Remove some debug prints

* Add loop and ptr update code

* Update scale offset

* Add tests

* Address review

* Add scale roundtrip e2e test and constraint assertions

Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency
across 4 tile configs x 2 matrices. Add power-of-2 assertion for
scaleBlockSize and matching scaleBlockSize assertions for A/B in
shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes
instead of re-deriving MIWaveGroup from tile dimensions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update fixes

* Fix scale being skipped

* Add flag to print layout

* Fix missed merge conflicts

* Fix missed merge conflicts

* Refactor scale rountrip test with gpu helper fns

* Fix extra spaces

* Fix tests

---------

Co-authored-by: brianshi <brianshi@amd.com>
Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com>
Co-authored-by: b-shi <bbbrianme@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
vidyasagar-amd pushed a commit that referenced this pull request Apr 9, 2026
)

## Motivation
Fix a bug in the smart-build --ctest-only filter that was incorrectly
excluding tests with numbers less than 100.

## Technical Details
The issue was caused by CTest formatting test numbers with variable
spacing based on the number of digits:
  - "Test   `#1`: name (3 spaces for tests 1-9)"
  - "Test  `#79`: name (2 spaces for tests 10-99)"
  - "Test `#100`: name (1 space for tests 100+)"

The previous code used `line.strip().startswith("Test #")` which only
matched tests with a single space (i.e., test numbers >= 100).

This caused tests like ck_tile_unit_sequence (Test #79) to be excluded
from smart-build test selection, resulting in CTest failures when the
binary wasn't built.

Solution: Replace string matching with a regex pattern that handles
all spacing variations: r'^\s*Test\s+#\d+:\s*(.+)$'

## Test Plan
Tested with test numbers from 1 to 12345.

## Test Result
  - Before: 48 tests selected (only tests #100+)
  - After: 146 tests selected (all CTest-registered tests)



## Submission Checklist

- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Alex-Vasile added a commit that referenced this pull request Apr 10, 2026
…sibling]

Pure refactor: generalize columnMajorGemm from float-only to
template<AccumT, MathOpAccumT> with defaults <float, float>.

When MathOpAccumT != AccumT, each A/B operand is cast through
MathOpAccumT before multiply (e.g. XFloat32 truncates mantissa).
With default args, casts are no-ops and codegen is identical.

No behavior change — existing tests validate.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 10, 2026
Add --type tf32 CLI option: uses float storage with XFloat32 math-op
truncation (10-bit mantissa). The slow reference path already handles
f32XdlMathOp == XFloat32 via ReferenceSolution<TypedGemm_S_S_S, float,
XFloat32>.

Changes:
- Add isTF32 parameter to runGemm, sets f32XdlMathOp on contraction
- Golden reference uses columnMajorGemm<float, XFloat32> for TF32
- TF32 validation tolerance set to 1.0f (13 mantissa bits lost)
- Console output shows MathOp=XFloat32 for TF32 runs
- 10 new slow-path tf32 tests (transpose combos, beta, bias, features,
  scaleAB)

Fast path still rejects XFloat32 — fast-path tf32 tests come next.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 10, 2026
#8, stacked on #1, #3]

Template solveCPUFastInF32 on MathOpAccumT so the inner reduction
applies XFloat32 truncation when MathOpAccumT=XFloat32. Remove the
XFloat32 rejection guard from isFastPathEligible. Update SolveGemmCPU
dispatch to branch on f32XdlMathOp.

When MathOpAccumT=float (default), casts are no-ops with identical
codegen to the previous implementation.

Changes:
- solveCPUFastInF32 gains template<MathOpAccumT=float>
- innerReduction uses float(MathOpAccumT(val)) cast chain
- isFastPathEligible no longer rejects XFloat32
- SolveGemmCPU dispatches solveCPUFastInF32<XFloat32> for TF32
- 10 new fast-path tf32 tests (20 total tf32 tests now)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 10, 2026
…, #12, stacked on #1, #2]

Add double-precision GEMM support to the CPU GEMM driver:

- Add TypeTraits<double> mapping to DataType::Double
- Add --type f64 branch calling runGemm<double, double>
- Type all auxiliary buffers (bias, scaleAlphaVec, scaleAB) as
  AccumulateT so the slow path's GetValue(alphaType, ...) reads
  the correct byte width when alphaType=Double
- Set alphaType/betaType to accumDtypeEnum so TypeId dispatch
  matches TypedGemm_D_D_D for the slow reference solver
- Register bias/scale tensors with accumDtypeEnum
- Add 10 f64 slow-path tests: 4 transpose combos, Beta, Bias,
  AllFeatures, TN_AllFeatures, ScaleAB Scalar/Vector

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 10, 2026
…#3, #5, stacked on #1, #2, #5]

Generalize the fast path to support double-precision accumulation:

- Template ShadowBuffer<AccumT>: storage, pointer, and element access
  are all AccumT. Float/Double inputs zero-copy when AccumT matches;
  sub-float types go through float then widen.
- Template loadTo<AccumT, SrcType> and storeFrom<AccumT, DstType>
  (renamed from loadToFloat/storeFromFloat).
- Rename solveCPUFastInF32 → solveCPUFast<AccumT, MathOpAccumT>:
  all tile registers, inner reduction, epilogue, alpha/beta extraction,
  bias reading, and activation args use AccumT.
- Add Double to isFastPathEligible's supported input/output types.
- SolveGemmCPU dispatch: route Double to solveCPUFast<double>.
- Add 10 f64 fast-path tests (transpose combos, Beta, Bias,
  AllFeatures, TN_AllFeatures, ScaleAB Scalar/Vector).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 21, 2026
…sibling]

Pure refactor: generalize columnMajorGemm from float-only to
template<AccumT, MathOpAccumT> with defaults <float, float>.

When MathOpAccumT != AccumT, each A/B operand is cast through
MathOpAccumT before multiply (e.g. XFloat32 truncates mantissa).
With default args, casts are no-ops and codegen is identical.

No behavior change — existing tests validate.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 21, 2026
Add --type tf32 CLI option: uses float storage with XFloat32 math-op
truncation (10-bit mantissa). The slow reference path already handles
f32XdlMathOp == XFloat32 via ReferenceSolution<TypedGemm_S_S_S, float,
XFloat32>.

Changes:
- Add isTF32 parameter to runGemm, sets f32XdlMathOp on contraction
- Golden reference uses columnMajorGemm<float, XFloat32> for TF32
- TF32 validation tolerance set to 1.0f (13 mantissa bits lost)
- Console output shows MathOp=XFloat32 for TF32 runs
- 10 new slow-path tf32 tests (transpose combos, beta, bias, features,
  scaleAB)

Fast path still rejects XFloat32 — fast-path tf32 tests come next.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 21, 2026
#8, stacked on #1, #3]

Template solveCPUFastInF32 on MathOpAccumT so the inner reduction
applies XFloat32 truncation when MathOpAccumT=XFloat32. Remove the
XFloat32 rejection guard from isFastPathEligible. Update SolveGemmCPU
dispatch to branch on f32XdlMathOp.

When MathOpAccumT=float (default), casts are no-ops with identical
codegen to the previous implementation.

Changes:
- solveCPUFastInF32 gains template<MathOpAccumT=float>
- innerReduction uses float(MathOpAccumT(val)) cast chain
- isFastPathEligible no longer rejects XFloat32
- SolveGemmCPU dispatches solveCPUFastInF32<XFloat32> for TF32
- 10 new fast-path tf32 tests (20 total tf32 tests now)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 21, 2026
…, #12, stacked on #1, #2]

Add double-precision GEMM support to the CPU GEMM driver:

- Add TypeTraits<double> mapping to DataType::Double
- Add --type f64 branch calling runGemm<double, double>
- Type all auxiliary buffers (bias, scaleAlphaVec, scaleAB) as
  AccumulateT so the slow path's GetValue(alphaType, ...) reads
  the correct byte width when alphaType=Double
- Set alphaType/betaType to accumDtypeEnum so TypeId dispatch
  matches TypedGemm_D_D_D for the slow reference solver
- Register bias/scale tensors with accumDtypeEnum
- Add 10 f64 slow-path tests: 4 transpose combos, Beta, Bias,
  AllFeatures, TN_AllFeatures, ScaleAB Scalar/Vector

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 21, 2026
…#3, #5, stacked on #1, #2, #5]

Generalize the fast path to support double-precision accumulation:

- Template ShadowBuffer<AccumT>: storage, pointer, and element access
  are all AccumT. Float/Double inputs zero-copy when AccumT matches;
  sub-float types go through float then widen.
- Template loadTo<AccumT, SrcType> and storeFrom<AccumT, DstType>
  (renamed from loadToFloat/storeFromFloat).
- Rename solveCPUFastInF32 → solveCPUFast<AccumT, MathOpAccumT>:
  all tile registers, inner reduction, epilogue, alpha/beta extraction,
  bias reading, and activation args use AccumT.
- Add Double to isFastPathEligible's supported input/output types.
- SolveGemmCPU dispatch: route Double to solveCPUFast<double>.
- Add 10 f64 fast-path tests (transpose combos, Beta, Bias,
  AllFeatures, TN_AllFeatures, ScaleAB Scalar/Vector).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 29, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Alex-Vasile added a commit that referenced this pull request Apr 30, 2026
Records the 5 approaches considered for symbolic-vs-numeric register
robustness:
  1. Name-resolution table (brittle, complex)
  2. Symbolic-only normalization (doesn't solve actual problem)
  3. Numeric-only resolution (assembly-time dependency)
  4. Equivalence-class comparison (loses precision)
  5. Render-string identity (matches GPU view, robust)

And the rationale for picking #5.

Documents the known limitation: same logical reg with different
identifiers across captures still differs. Doesn't arise in practice
because both captures consume the same writer state; future work if
needed would add approach #1 (name-resolution table) on top.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

github actions migration Tasks or issues tied to migration to this monorepo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants