Initial workflows#1
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR introduces initial workflows and supporting scripts to manage the ROCm monorepo, including setups for subtree synchronization, CODEOWNERS merging, and GitHub label management.
- Updated README.md with a quick start guide for developers
- Added multiple GitHub Actions workflows for monorepo setup, label collection/application, and subtree synchronization
- Introduced Python scripts to merge CODEOWNERS and manage labels across repositories
Reviewed Changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| README.md | Updated guide with details on sparse checkout and repo history |
| .github/workflows/update-subtrees.yml | New workflow to synchronize subtrees for updating repositories |
| .github/workflows/merge-codeowners.yml | Workflow to merge CODEOWNERS files from sub-repos |
| .github/workflows/initial-setup.yml | Workflow to initialize the monorepo by adding repositories via subtree |
| .github/workflows/collect-labels.yml | Workflow for collecting labels from source repos |
| .github/workflows/apply-labels.yml | Workflow for applying and updating labels in the monorepo |
| .github/scripts/merge-codeowners.py | Python script to merge CODEOWNERS entries across subdirectories |
| .github/scripts/collect-labels.py | Python script to fetch and aggregate labels from GitHub repos |
| .github/scripts/apply-labels.py | Python script to create or update labels based on the generated YAML |
Files not reviewed (1)
- .github/repos-config.json: Language not supported
assistant-librarian Bot
pushed a commit
that referenced
this pull request
May 13, 2025
* finished sort keys no size tests * completed sort pair no size functions * completed sort pair with size functions * updated changelog
assistant-librarian Bot
pushed a commit
that referenced
this pull request
May 28, 2025
* Bump cryptography from 43.0.1 to 44.0.1 in /docs/sphinx (#611) Bumps [cryptography](https://github.com/pyca/cryptography) from 43.0.1 to 44.0.1. - [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst) - [Commits](pyca/cryptography@43.0.1...44.0.1) --- updated-dependencies: - dependency-name: cryptography dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * added test_discrete_distribution file * finished discrete_alias tests * temp commit * merged in cmake fix * started updating functions * added cdf tests (not done need to modify) * reformated tests * finalized discrete_distribution tests * updated changelog --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
idass1990
pushed a commit
that referenced
this pull request
Jun 13, 2025
* Add cmake files [ROCm/hipBLAS-common commit: 9666c57]
ammallya
pushed a commit
that referenced
this pull request
Jun 25, 2025
This was referenced Jun 30, 2025
ammallya
pushed a commit
that referenced
this pull request
Jul 24, 2025
[ROCm/mxDataGenerator commit: e74b2c1]
eidenyoshida
pushed a commit
that referenced
this pull request
Jul 25, 2025
ammallya
pushed a commit
that referenced
this pull request
Sep 24, 2025
…coverage (#1) * test if I can push directly to develop anymore * add boost dependency in the backend dll * setup lfs, bring over attribute types from miopen for now * add a make target called check which will discover all tests via cmake and then run them using ctest * enable code coverage via cmake and a make target of code_coverage * make ignore regex use deps for path
ammallya
pushed a commit
that referenced
this pull request
Sep 24, 2025
…coverage (#1) * test if I can push directly to develop anymore * add boost dependency in the backend dll * setup lfs, bring over attribute types from miopen for now * add a make target called check which will discover all tests via cmake and then run them using ctest * enable code coverage via cmake and a make target of code_coverage * make ignore regex use deps for path [ROCm/hipDNN commit: d2f26e6]
JoeLiuAMD
added a commit
that referenced
this pull request
Oct 8, 2025
…nel configurations (#1978) # Fix GenericSearch warm-up bias: apply warm-up to all configurations >** 📝 Note**: This PR has follow up PR #1993 ## Motivation MIOpen's generic search algorithm suffers from a **race condition** that causes optimal kernels to be randomly rejected, leading to 3-4x performance degradation in some cases. ### Problem Description When running the same convolution workload multiple times as sample below: ```bash MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 -p 1 -q 1 \ -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC \ -m conv -g 1 -t 1 -F 2 ``` **Observed behavior:** - **Lucky case**: Selected optimal kernel → **0.099 ms** per operation - **Unlucky case**: Selected suboptimal kernel → **0.332 ms** per operation (**3.35x slower**) [Lucky_Joe_20250930.log](https://github.com/user-attachments/files/22714922/Lucky_Joe_20250930.log) [Normal_Joe_20250930.log](https://github.com/user-attachments/files/22714923/Normal_Joe_20250930.log) ### Root Cause **Cold-start bias in warm-up logic** (`generic_search.hpp`, lines 559-564): ```cpp // Original buggy code if(n_current == 0) // ❌ Only first config gets warm-up { invoker(profile_h, invoke_ctx); profile_h.ResetKernelTime(); } ``` This condition creates an **unfair advantage** for the first configuration tested: - **First kernel** (n_current == 0): Gets warm-up → Fair performance measurement - **Subsequent kernels** (n_current > 0): No warm-up → Cold-start penalty (up to **100x slower** in extreme cases) ### Impact - **High false negative rate**: Up to 40% chance of rejecting the optimal kernel - **Performance degradation**: 4x slower execution when suboptimal kernel is selected - **Non-deterministic behavior**: Kernel selection depends on which configuration is tested first ### Example from Production Logs **Environment**: MI355X (gfx950), ROCm 7.0.2 ``` AI generated 4 kernel configurations for testing: Kernel #0 (128,128,32,32,8,8...): 10 samples → avg 0.343166 ms → selected as "best" Kernel #1 (64,64,64,32,8,8...): 1 sample → 1.219 ms → rejected (cold-start!) Kernel #2 (64,64,16,32,8,8...): 1 sample → 3.0267 ms → rejected (cold-start!) Kernel #3 (64,16,64,32,8,8...): 1 sample → 0.482 ms → rejected (cold-start!) Final execution: 0.332 ms (using Kernel #0) Issue: Kernel #2 suffered from cold-start bias (3.0267 ms first sample) With proper warm-up, its true performance is ~0.099 ms (3.4x faster than selected kernel) ``` **Detailed timing from Normal_Joe_20250930.log:** Optimal kernel (incorrectly rejected due to cold-start): - `DeviceGroupedConvBwdData_Xdl_CShuffle_v1<64,64,16,32,8,8,Default,16,16,4,1,8,1,1,1>+1` - Sample 1: **3.027 ms** ← Cold start! (30x slower than true performance) - Samples 2-11: 0.369, 0.349, 0.366, 0.352, 0.353, 0.365, 0.352, 0.359, 0.347, 0.352 ms - **True mean**: 0.354 ms (excluding cold-start outlier) - **Decision**: Rejected by early-stop (3.027 > 0.377 × 1.1) - **Wrong outcome**: Best kernel discarded due to unfair cold-start penalty --- ## Technical Details ### Changes This PR contains **only the bug fix** - removing the unfair warm-up condition: ```diff - // Warm-up run for first time invoker is used - if(n_current == 0) - { - invoker(profile_h, invoke_ctx); - profile_h.ResetKernelTime(); - } + // Warm-up run for every configuration to eliminate cold-start bias + invoker(profile_h, invoke_ctx); + profile_h.ResetKernelTime(); ``` **File modified:** `projects/miopen/src/include/miopen/generic_search.hpp` (lines 559-564) **Change summary:** - 3 insertions(+), 6 deletions(-) - Removes `if(n_current == 0)` condition - Ensures every configuration receives one warm-up run before measurement ### Why This is Low Risk 1. **Minimal code change**: Only 4 lines changed 2. **No algorithm change**: Same sampling strategy, same early-stop logic 3. **Only ensures fairness**: All configs now receive identical warm-up treatment 4. **No performance regression**: Adds one extra kernel call per config (~0.3ms overhead per config) 5. **Negligible overhead**: For typical 4-config search, adds 1.2ms total (kernel compilation takes 10-30 seconds, so overhead is <0.01%) --- ## Test Plan ### Test Environment - **Hardware**: MI355X (gfx950) - **ROCm Version**: 7.0.2 (HIP 7.0.51831) - **Workload**: Grouped convolution backward data (NHWC layout, 4 kernel configurations) ### Test Command ```bash export MIOPEN_LOG_LEVEL=5 export MIOPEN_FIND_MODE=1 ./bin/MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 \ -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 \ --in_layout NHWC --fil_layout NHWC --out_layout NHWC \ -m conv -g 1 -t 1 -F 2 ``` ### Test Results #### Before Fix (with bug): - **Success rate**: 6/10 runs selected optimal kernel (40% error rate) - **Failure pattern**: Optimal kernel rejected when its cold-start time triggered early-stop - **Performance impact**: Up to 4x slower when wrong kernel selected (0.332ms vs 0.099ms) #### After Fix: - **Success rate**: 10/10 runs selected optimal kernel (0% error rate) - **Consistency**: All configurations receive fair warm-up - **Performance**: Optimal kernel always selected, no degradation - **Overhead**: +1.2ms for 4 configs (negligible vs 10-30s compilation time) --- ## Performance Impact ### Search Time Overhead - **Additional cost**: 1 warm-up run per configuration (only for configs beyond the first) ### Accuracy Improvement - **Before**: 60% success rate (6/10 runs correct) - **After**: 100% success rate (10/10 runs correct) - **Performance gain**: Eliminates 4x slowdown from selecting wrong kernel --- ## Backward Compatibility ✅ **Fully compatible** - No API changes - No behavior changes except for fixing the bug - All existing tests pass - No impact on already-cached kernels (find database not affected) --- ## Why Split into Two PRs? Following reviewer feedback, this work has been split into two separate PRs: ### **PR1 (This PR) - Bug Fix: Warm-up Bias** - **Risk**: Low (4 lines changed) - **Impact**: Fixes root cause of unfair kernel comparison - **Decision**: Ready for immediate merge - **Rationale**: Without fair warm-up, no amount of threshold tuning can fix the problem. Cold-start penalties (30-100x slower) make any single threshold value inadequate. ### **PR2 (Separate PR) - Optimization: Early-Stop Strategy** - **Branch**: `users/JoeLiuAMD/miopen-generic-search-optimization` #1993 - **Changes**: Dual-sample testing + 1.2x threshold + enhanced logging - **Risk**: Medium (affects benchmark timing) - **Decision**: Needs more validation and benchmarking - **Rationale**: These optimizations improve accuracy further (40% → 0% error rate) but add ~2 kernel executions per config. The performance impact needs separate evaluation. ### Why This Approach? - **PR1 can merge immediately**: Fixes the critical bug with minimal risk - **PR2 can be validated thoroughly**: Performance trade-offs can be evaluated independently - **Easier to isolate regressions**: If issues arise, we know which change caused them - **Progressive improvement**: Get the bug fix deployed while optimizations are being validated --- ## Submission Checklist - [x] I have read and agreed with the [contributing guidelines](CONTRIBUTING.md) - [x] The changes are minimal and focused on the bug fix - [x] All existing tests pass - [x] The fix has been verified on target hardware (MI355X/gfx950) - [x] The fix eliminates the non-deterministic kernel selection issue - [x] No performance regression introduced - [x] Documentation (commit message) clearly explains the problem and solution - [x] Backward compatibility confirmed - [x] Test data and logs provided for verification
10 tasks
1 task
ibrahimw1
added a commit
that referenced
this pull request
Nov 21, 2025
nakajee
pushed a commit
to nakajee/rocm-libraries
that referenced
this pull request
Mar 31, 2026
* Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com>
nakajee
pushed a commit
to nakajee/rocm-libraries
that referenced
this pull request
Mar 31, 2026
…ernel (ROCm#6) * Add sample subtile impl * Fix issues when disabling subtile impl * GR Offset calculation (ROCm#1) * Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com> * Enable post-loop code generation, and add some subroutines * LR offset calculation (ROCm#2) * Add tests * as is * Add permlane16_swap instruction to rocisa * Ongoing progress * Draft for partition A0/A1 * Wave partitioning * Draft ref code in tests * Handle 1x4 wavesplit param * 2x2 test passing * Draft 1x4 LR wave partitioning * Fix alginement issue * Integration testing * Update integration test * Fix swizzling pattern on GRA. Only swizzling on even LDS rows * Subtile based test * testing A * Test both A and B * Remove graonly mode * Fix 1x4 case * Move global offset for B after rest of the logic * cleanup * cleanup * Fix ref test code for 4x1 * Fix spgr alloc issue * Remove tmp test file * Remove debug prints * Add test case * Add GR load emit logic, and misc fixes (ROCm#3) * gr emit fix * Emit LR + init ACCVGPR (ROCm#4) * Emit ds_reads * Add waits for LR and GR * Init Acc VGPR to Zero * Add missing bit_length on VLShiftLeftB32 * Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness * Fix gra test ref code for 1x4 * Remove some debug prints * Add loop and ptr update code * Update scale offset * Add tests * Address review * Add scale roundtrip e2e test and constraint assertions Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency across 4 tile configs x 2 matrices. Add power-of-2 assertion for scaleBlockSize and matching scaleBlockSize assertions for A/B in shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes instead of re-deriving MIWaveGroup from tile dimensions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update fixes * Fix scale being skipped * Add flag to print layout * Fix missed merge conflicts * Fix missed merge conflicts * Refactor scale rountrip test with gpu helper fns * Fix extra spaces * Fix tests --------- Co-authored-by: brianshi <brianshi@amd.com> Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com> Co-authored-by: b-shi <bbbrianme@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
sebvince
referenced
this pull request
in sebvince/rocm-libraries
Apr 3, 2026
* Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com>
sebvince
referenced
this pull request
in sebvince/rocm-libraries
Apr 3, 2026
…ernel (ROCm#6) * Add sample subtile impl * Fix issues when disabling subtile impl * GR Offset calculation (#1) * Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com> * Enable post-loop code generation, and add some subroutines * LR offset calculation (#2) * Add tests * as is * Add permlane16_swap instruction to rocisa * Ongoing progress * Draft for partition A0/A1 * Wave partitioning * Draft ref code in tests * Handle 1x4 wavesplit param * 2x2 test passing * Draft 1x4 LR wave partitioning * Fix alginement issue * Integration testing * Update integration test * Fix swizzling pattern on GRA. Only swizzling on even LDS rows * Subtile based test * testing A * Test both A and B * Remove graonly mode * Fix 1x4 case * Move global offset for B after rest of the logic * cleanup * cleanup * Fix ref test code for 4x1 * Fix spgr alloc issue * Remove tmp test file * Remove debug prints * Add test case * Add GR load emit logic, and misc fixes (#3) * gr emit fix * Emit LR + init ACCVGPR (#4) * Emit ds_reads * Add waits for LR and GR * Init Acc VGPR to Zero * Add missing bit_length on VLShiftLeftB32 * Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness * Fix gra test ref code for 1x4 * Remove some debug prints * Add loop and ptr update code * Update scale offset * Add tests * Address review * Add scale roundtrip e2e test and constraint assertions Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency across 4 tile configs x 2 matrices. Add power-of-2 assertion for scaleBlockSize and matching scaleBlockSize assertions for A/B in shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes instead of re-deriving MIWaveGroup from tile dimensions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update fixes * Fix scale being skipped * Add flag to print layout * Fix missed merge conflicts * Fix missed merge conflicts * Refactor scale rountrip test with gpu helper fns * Fix extra spaces * Fix tests --------- Co-authored-by: brianshi <brianshi@amd.com> Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com> Co-authored-by: b-shi <bbbrianme@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This was referenced Apr 6, 2026
newling
pushed a commit
to newling/rocm-libraries
that referenced
this pull request
Apr 9, 2026
* Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com>
newling
pushed a commit
to newling/rocm-libraries
that referenced
this pull request
Apr 9, 2026
…ernel (ROCm#6) * Add sample subtile impl * Fix issues when disabling subtile impl * GR Offset calculation (ROCm#1) * Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com> * Enable post-loop code generation, and add some subroutines * LR offset calculation (ROCm#2) * Add tests * as is * Add permlane16_swap instruction to rocisa * Ongoing progress * Draft for partition A0/A1 * Wave partitioning * Draft ref code in tests * Handle 1x4 wavesplit param * 2x2 test passing * Draft 1x4 LR wave partitioning * Fix alginement issue * Integration testing * Update integration test * Fix swizzling pattern on GRA. Only swizzling on even LDS rows * Subtile based test * testing A * Test both A and B * Remove graonly mode * Fix 1x4 case * Move global offset for B after rest of the logic * cleanup * cleanup * Fix ref test code for 4x1 * Fix spgr alloc issue * Remove tmp test file * Remove debug prints * Add test case * Add GR load emit logic, and misc fixes (ROCm#3) * gr emit fix * Emit LR + init ACCVGPR (ROCm#4) * Emit ds_reads * Add waits for LR and GR * Init Acc VGPR to Zero * Add missing bit_length on VLShiftLeftB32 * Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness * Fix gra test ref code for 1x4 * Remove some debug prints * Add loop and ptr update code * Update scale offset * Add tests * Address review * Add scale roundtrip e2e test and constraint assertions Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency across 4 tile configs x 2 matrices. Add power-of-2 assertion for scaleBlockSize and matching scaleBlockSize assertions for A/B in shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes instead of re-deriving MIWaveGroup from tile dimensions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update fixes * Fix scale being skipped * Add flag to print layout * Fix missed merge conflicts * Fix missed merge conflicts * Refactor scale rountrip test with gpu helper fns * Fix extra spaces * Fix tests --------- Co-authored-by: brianshi <brianshi@amd.com> Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com> Co-authored-by: b-shi <bbbrianme@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
vidyasagar-amd
pushed a commit
that referenced
this pull request
Apr 9, 2026
) ## Motivation Fix a bug in the smart-build --ctest-only filter that was incorrectly excluding tests with numbers less than 100. ## Technical Details The issue was caused by CTest formatting test numbers with variable spacing based on the number of digits: - "Test `#1`: name (3 spaces for tests 1-9)" - "Test `#79`: name (2 spaces for tests 10-99)" - "Test `#100`: name (1 space for tests 100+)" The previous code used `line.strip().startswith("Test #")` which only matched tests with a single space (i.e., test numbers >= 100). This caused tests like ck_tile_unit_sequence (Test #79) to be excluded from smart-build test selection, resulting in CTest failures when the binary wasn't built. Solution: Replace string matching with a regex pattern that handles all spacing variations: r'^\s*Test\s+#\d+:\s*(.+)$' ## Test Plan Tested with test numbers from 1 to 12345. ## Test Result - Before: 48 tests selected (only tests #100+) - After: 146 tests selected (all CTest-registered tests) ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Alex-Vasile
added a commit
that referenced
this pull request
Apr 10, 2026
…sibling] Pure refactor: generalize columnMajorGemm from float-only to template<AccumT, MathOpAccumT> with defaults <float, float>. When MathOpAccumT != AccumT, each A/B operand is cast through MathOpAccumT before multiply (e.g. XFloat32 truncates mantissa). With default args, casts are no-ops and codegen is identical. No behavior change — existing tests validate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile
added a commit
that referenced
this pull request
Apr 10, 2026
Add --type tf32 CLI option: uses float storage with XFloat32 math-op truncation (10-bit mantissa). The slow reference path already handles f32XdlMathOp == XFloat32 via ReferenceSolution<TypedGemm_S_S_S, float, XFloat32>. Changes: - Add isTF32 parameter to runGemm, sets f32XdlMathOp on contraction - Golden reference uses columnMajorGemm<float, XFloat32> for TF32 - TF32 validation tolerance set to 1.0f (13 mantissa bits lost) - Console output shows MathOp=XFloat32 for TF32 runs - 10 new slow-path tf32 tests (transpose combos, beta, bias, features, scaleAB) Fast path still rejects XFloat32 — fast-path tf32 tests come next. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile
added a commit
that referenced
this pull request
Apr 10, 2026
#8, stacked on #1, #3] Template solveCPUFastInF32 on MathOpAccumT so the inner reduction applies XFloat32 truncation when MathOpAccumT=XFloat32. Remove the XFloat32 rejection guard from isFastPathEligible. Update SolveGemmCPU dispatch to branch on f32XdlMathOp. When MathOpAccumT=float (default), casts are no-ops with identical codegen to the previous implementation. Changes: - solveCPUFastInF32 gains template<MathOpAccumT=float> - innerReduction uses float(MathOpAccumT(val)) cast chain - isFastPathEligible no longer rejects XFloat32 - SolveGemmCPU dispatches solveCPUFastInF32<XFloat32> for TF32 - 10 new fast-path tf32 tests (20 total tf32 tests now) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile
added a commit
that referenced
this pull request
Apr 10, 2026
…, #12, stacked on #1, #2] Add double-precision GEMM support to the CPU GEMM driver: - Add TypeTraits<double> mapping to DataType::Double - Add --type f64 branch calling runGemm<double, double> - Type all auxiliary buffers (bias, scaleAlphaVec, scaleAB) as AccumulateT so the slow path's GetValue(alphaType, ...) reads the correct byte width when alphaType=Double - Set alphaType/betaType to accumDtypeEnum so TypeId dispatch matches TypedGemm_D_D_D for the slow reference solver - Register bias/scale tensors with accumDtypeEnum - Add 10 f64 slow-path tests: 4 transpose combos, Beta, Bias, AllFeatures, TN_AllFeatures, ScaleAB Scalar/Vector Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile
added a commit
that referenced
this pull request
Apr 10, 2026
…#3, #5, stacked on #1, #2, #5] Generalize the fast path to support double-precision accumulation: - Template ShadowBuffer<AccumT>: storage, pointer, and element access are all AccumT. Float/Double inputs zero-copy when AccumT matches; sub-float types go through float then widen. - Template loadTo<AccumT, SrcType> and storeFrom<AccumT, DstType> (renamed from loadToFloat/storeFromFloat). - Rename solveCPUFastInF32 → solveCPUFast<AccumT, MathOpAccumT>: all tile registers, inner reduction, epilogue, alpha/beta extraction, bias reading, and activation args use AccumT. - Add Double to isFastPathEligible's supported input/output types. - SolveGemmCPU dispatch: route Double to solveCPUFast<double>. - Add 10 f64 fast-path tests (transpose combos, Beta, Bias, AllFeatures, TN_AllFeatures, ScaleAB Scalar/Vector). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile
added a commit
that referenced
this pull request
Apr 21, 2026
…sibling] Pure refactor: generalize columnMajorGemm from float-only to template<AccumT, MathOpAccumT> with defaults <float, float>. When MathOpAccumT != AccumT, each A/B operand is cast through MathOpAccumT before multiply (e.g. XFloat32 truncates mantissa). With default args, casts are no-ops and codegen is identical. No behavior change — existing tests validate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile
added a commit
that referenced
this pull request
Apr 21, 2026
Add --type tf32 CLI option: uses float storage with XFloat32 math-op truncation (10-bit mantissa). The slow reference path already handles f32XdlMathOp == XFloat32 via ReferenceSolution<TypedGemm_S_S_S, float, XFloat32>. Changes: - Add isTF32 parameter to runGemm, sets f32XdlMathOp on contraction - Golden reference uses columnMajorGemm<float, XFloat32> for TF32 - TF32 validation tolerance set to 1.0f (13 mantissa bits lost) - Console output shows MathOp=XFloat32 for TF32 runs - 10 new slow-path tf32 tests (transpose combos, beta, bias, features, scaleAB) Fast path still rejects XFloat32 — fast-path tf32 tests come next. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile
added a commit
that referenced
this pull request
Apr 21, 2026
#8, stacked on #1, #3] Template solveCPUFastInF32 on MathOpAccumT so the inner reduction applies XFloat32 truncation when MathOpAccumT=XFloat32. Remove the XFloat32 rejection guard from isFastPathEligible. Update SolveGemmCPU dispatch to branch on f32XdlMathOp. When MathOpAccumT=float (default), casts are no-ops with identical codegen to the previous implementation. Changes: - solveCPUFastInF32 gains template<MathOpAccumT=float> - innerReduction uses float(MathOpAccumT(val)) cast chain - isFastPathEligible no longer rejects XFloat32 - SolveGemmCPU dispatches solveCPUFastInF32<XFloat32> for TF32 - 10 new fast-path tf32 tests (20 total tf32 tests now) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile
added a commit
that referenced
this pull request
Apr 21, 2026
…, #12, stacked on #1, #2] Add double-precision GEMM support to the CPU GEMM driver: - Add TypeTraits<double> mapping to DataType::Double - Add --type f64 branch calling runGemm<double, double> - Type all auxiliary buffers (bias, scaleAlphaVec, scaleAB) as AccumulateT so the slow path's GetValue(alphaType, ...) reads the correct byte width when alphaType=Double - Set alphaType/betaType to accumDtypeEnum so TypeId dispatch matches TypedGemm_D_D_D for the slow reference solver - Register bias/scale tensors with accumDtypeEnum - Add 10 f64 slow-path tests: 4 transpose combos, Beta, Bias, AllFeatures, TN_AllFeatures, ScaleAB Scalar/Vector Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile
added a commit
that referenced
this pull request
Apr 21, 2026
…#3, #5, stacked on #1, #2, #5] Generalize the fast path to support double-precision accumulation: - Template ShadowBuffer<AccumT>: storage, pointer, and element access are all AccumT. Float/Double inputs zero-copy when AccumT matches; sub-float types go through float then widen. - Template loadTo<AccumT, SrcType> and storeFrom<AccumT, DstType> (renamed from loadToFloat/storeFromFloat). - Rename solveCPUFastInF32 → solveCPUFast<AccumT, MathOpAccumT>: all tile registers, inner reduction, epilogue, alpha/beta extraction, bias reading, and activation args use AccumT. - Add Double to isFastPathEligible's supported input/output types. - SolveGemmCPU dispatch: route Double to solveCPUFast<double>. - Add 10 f64 fast-path tests (transpose combos, Beta, Bias, AllFeatures, TN_AllFeatures, ScaleAB Scalar/Vector). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alex-Vasile
added a commit
that referenced
this pull request
Apr 29, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Alex-Vasile
added a commit
that referenced
this pull request
Apr 30, 2026
Records the 5 approaches considered for symbolic-vs-numeric register robustness: 1. Name-resolution table (brittle, complex) 2. Symbolic-only normalization (doesn't solve actual problem) 3. Numeric-only resolution (assembly-time dependency) 4. Equivalence-class comparison (loses precision) 5. Render-string identity (matches GPU view, robust) And the rationale for picking #5. Documents the known limitation: same logical reg with different identifiers across captures still differs. Doesn't arise in practice because both captures consume the same writer state; future work if needed would add approach #1 (name-resolution table) on top.
1 task
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.