Initial workflows by jayhawk-commits · Pull Request #1 · ROCm/rocm-libraries

jayhawk-commits · 2025-04-23T18:53:09Z

No description provided.

Copilot

Pull Request Overview

This PR introduces initial workflows and supporting scripts to manage the ROCm monorepo, including setups for subtree synchronization, CODEOWNERS merging, and GitHub label management.

Updated README.md with a quick start guide for developers
Added multiple GitHub Actions workflows for monorepo setup, label collection/application, and subtree synchronization
Introduced Python scripts to merge CODEOWNERS and manage labels across repositories

Reviewed Changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
README.md	Updated guide with details on sparse checkout and repo history
.github/workflows/update-subtrees.yml	New workflow to synchronize subtrees for updating repositories
.github/workflows/merge-codeowners.yml	Workflow to merge CODEOWNERS files from sub-repos
.github/workflows/initial-setup.yml	Workflow to initialize the monorepo by adding repositories via subtree
.github/workflows/collect-labels.yml	Workflow for collecting labels from source repos
.github/workflows/apply-labels.yml	Workflow for applying and updating labels in the monorepo
.github/scripts/merge-codeowners.py	Python script to merge CODEOWNERS entries across subdirectories
.github/scripts/collect-labels.py	Python script to fetch and aggregate labels from GitHub repos
.github/scripts/apply-labels.py	Python script to create or update labels based on the generated YAML

Files not reviewed (1)

.github/repos-config.json: Language not supported

* finished sort keys no size tests * completed sort pair no size functions * completed sort pair with size functions * updated changelog

* Bump cryptography from 43.0.1 to 44.0.1 in /docs/sphinx (#611) Bumps [cryptography](https://github.com/pyca/cryptography) from 43.0.1 to 44.0.1. - [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst) - [Commits](pyca/cryptography@43.0.1...44.0.1) --- updated-dependencies: - dependency-name: cryptography dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * added test_discrete_distribution file * finished discrete_alias tests * temp commit * merged in cmake fix * started updating functions * added cdf tests (not done need to modify) * reformated tests * finalized discrete_distribution tests * updated changelog --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add cmake files [ROCm/hipBLAS-common commit: 9666c57]

[ROCm/mxDataGenerator commit: e74b2c1] [ROCm/mxDataGenerator commit: 8efb356]

[ROCm/mxDataGenerator commit: e74b2c1]

…coverage (#1) * test if I can push directly to develop anymore * add boost dependency in the backend dll * setup lfs, bring over attribute types from miopen for now * add a make target called check which will discover all tests via cmake and then run them using ctest * enable code coverage via cmake and a make target of code_coverage * make ignore regex use deps for path

…coverage (#1) * test if I can push directly to develop anymore * add boost dependency in the backend dll * setup lfs, bring over attribute types from miopen for now * add a make target called check which will discover all tests via cmake and then run them using ctest * enable code coverage via cmake and a make target of code_coverage * make ignore regex use deps for path [ROCm/hipDNN commit: d2f26e6]

…nel configurations (#1978) # Fix GenericSearch warm-up bias: apply warm-up to all configurations >** 📝 Note**: This PR has follow up PR #1993 ## Motivation MIOpen's generic search algorithm suffers from a **race condition** that causes optimal kernels to be randomly rejected, leading to 3-4x performance degradation in some cases. ### Problem Description When running the same convolution workload multiple times as sample below: ```bash MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 -p 1 -q 1 \ -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC \ -m conv -g 1 -t 1 -F 2 ``` **Observed behavior:** - **Lucky case**: Selected optimal kernel → **0.099 ms** per operation - **Unlucky case**: Selected suboptimal kernel → **0.332 ms** per operation (**3.35x slower**) [Lucky_Joe_20250930.log](https://github.com/user-attachments/files/22714922/Lucky_Joe_20250930.log) [Normal_Joe_20250930.log](https://github.com/user-attachments/files/22714923/Normal_Joe_20250930.log) ### Root Cause **Cold-start bias in warm-up logic** (`generic_search.hpp`, lines 559-564): ```cpp // Original buggy code if(n_current == 0) // ❌ Only first config gets warm-up { invoker(profile_h, invoke_ctx); profile_h.ResetKernelTime(); } ``` This condition creates an **unfair advantage** for the first configuration tested: - **First kernel** (n_current == 0): Gets warm-up → Fair performance measurement - **Subsequent kernels** (n_current > 0): No warm-up → Cold-start penalty (up to **100x slower** in extreme cases) ### Impact - **High false negative rate**: Up to 40% chance of rejecting the optimal kernel - **Performance degradation**: 4x slower execution when suboptimal kernel is selected - **Non-deterministic behavior**: Kernel selection depends on which configuration is tested first ### Example from Production Logs **Environment**: MI355X (gfx950), ROCm 7.0.2 ``` AI generated 4 kernel configurations for testing: Kernel #0 (128,128,32,32,8,8...): 10 samples → avg 0.343166 ms → selected as "best" Kernel #1 (64,64,64,32,8,8...): 1 sample → 1.219 ms → rejected (cold-start!) Kernel #2 (64,64,16,32,8,8...): 1 sample → 3.0267 ms → rejected (cold-start!) Kernel #3 (64,16,64,32,8,8...): 1 sample → 0.482 ms → rejected (cold-start!) Final execution: 0.332 ms (using Kernel #0) Issue: Kernel #2 suffered from cold-start bias (3.0267 ms first sample) With proper warm-up, its true performance is ~0.099 ms (3.4x faster than selected kernel) ``` **Detailed timing from Normal_Joe_20250930.log:** Optimal kernel (incorrectly rejected due to cold-start): - `DeviceGroupedConvBwdData_Xdl_CShuffle_v1<64,64,16,32,8,8,Default,16,16,4,1,8,1,1,1>+1` - Sample 1: **3.027 ms** ← Cold start! (30x slower than true performance) - Samples 2-11: 0.369, 0.349, 0.366, 0.352, 0.353, 0.365, 0.352, 0.359, 0.347, 0.352 ms - **True mean**: 0.354 ms (excluding cold-start outlier) - **Decision**: Rejected by early-stop (3.027 > 0.377 × 1.1) - **Wrong outcome**: Best kernel discarded due to unfair cold-start penalty --- ## Technical Details ### Changes This PR contains **only the bug fix** - removing the unfair warm-up condition: ```diff - // Warm-up run for first time invoker is used - if(n_current == 0) - { - invoker(profile_h, invoke_ctx); - profile_h.ResetKernelTime(); - } + // Warm-up run for every configuration to eliminate cold-start bias + invoker(profile_h, invoke_ctx); + profile_h.ResetKernelTime(); ``` **File modified:** `projects/miopen/src/include/miopen/generic_search.hpp` (lines 559-564) **Change summary:** - 3 insertions(+), 6 deletions(-) - Removes `if(n_current == 0)` condition - Ensures every configuration receives one warm-up run before measurement ### Why This is Low Risk 1. **Minimal code change**: Only 4 lines changed 2. **No algorithm change**: Same sampling strategy, same early-stop logic 3. **Only ensures fairness**: All configs now receive identical warm-up treatment 4. **No performance regression**: Adds one extra kernel call per config (~0.3ms overhead per config) 5. **Negligible overhead**: For typical 4-config search, adds 1.2ms total (kernel compilation takes 10-30 seconds, so overhead is <0.01%) --- ## Test Plan ### Test Environment - **Hardware**: MI355X (gfx950) - **ROCm Version**: 7.0.2 (HIP 7.0.51831) - **Workload**: Grouped convolution backward data (NHWC layout, 4 kernel configurations) ### Test Command ```bash export MIOPEN_LOG_LEVEL=5 export MIOPEN_FIND_MODE=1 ./bin/MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 \ -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 \ --in_layout NHWC --fil_layout NHWC --out_layout NHWC \ -m conv -g 1 -t 1 -F 2 ``` ### Test Results #### Before Fix (with bug): - **Success rate**: 6/10 runs selected optimal kernel (40% error rate) - **Failure pattern**: Optimal kernel rejected when its cold-start time triggered early-stop - **Performance impact**: Up to 4x slower when wrong kernel selected (0.332ms vs 0.099ms) #### After Fix: - **Success rate**: 10/10 runs selected optimal kernel (0% error rate) - **Consistency**: All configurations receive fair warm-up - **Performance**: Optimal kernel always selected, no degradation - **Overhead**: +1.2ms for 4 configs (negligible vs 10-30s compilation time) --- ## Performance Impact ### Search Time Overhead - **Additional cost**: 1 warm-up run per configuration (only for configs beyond the first) ### Accuracy Improvement - **Before**: 60% success rate (6/10 runs correct) - **After**: 100% success rate (10/10 runs correct) - **Performance gain**: Eliminates 4x slowdown from selecting wrong kernel --- ## Backward Compatibility ✅ **Fully compatible** - No API changes - No behavior changes except for fixing the bug - All existing tests pass - No impact on already-cached kernels (find database not affected) --- ## Why Split into Two PRs? Following reviewer feedback, this work has been split into two separate PRs: ### **PR1 (This PR) - Bug Fix: Warm-up Bias** - **Risk**: Low (4 lines changed) - **Impact**: Fixes root cause of unfair kernel comparison - **Decision**: Ready for immediate merge - **Rationale**: Without fair warm-up, no amount of threshold tuning can fix the problem. Cold-start penalties (30-100x slower) make any single threshold value inadequate. ### **PR2 (Separate PR) - Optimization: Early-Stop Strategy** - **Branch**: `users/JoeLiuAMD/miopen-generic-search-optimization` #1993 - **Changes**: Dual-sample testing + 1.2x threshold + enhanced logging - **Risk**: Medium (affects benchmark timing) - **Decision**: Needs more validation and benchmarking - **Rationale**: These optimizations improve accuracy further (40% → 0% error rate) but add ~2 kernel executions per config. The performance impact needs separate evaluation. ### Why This Approach? - **PR1 can merge immediately**: Fixes the critical bug with minimal risk - **PR2 can be validated thoroughly**: Performance trade-offs can be evaluated independently - **Easier to isolate regressions**: If issues arise, we know which change caused them - **Progressive improvement**: Get the bug fix deployed while optimizations are being validated --- ## Submission Checklist - [x] I have read and agreed with the [contributing guidelines](CONTRIBUTING.md) - [x] The changes are minimal and focused on the bug fix - [x] All existing tests pass - [x] The fix has been verified on target hardware (MI355X/gfx950) - [x] The fix eliminates the non-deterministic kernel selection issue - [x] No performance regression introduced - [x] Documentation (commit message) clearly explains the problem and solution - [x] Backward compatibility confirmed - [x] Test data and logs provided for verification

* Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com>

…ernel (ROCm#6) * Add sample subtile impl * Fix issues when disabling subtile impl * GR Offset calculation (ROCm#1) * Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com> * Enable post-loop code generation, and add some subroutines * LR offset calculation (ROCm#2) * Add tests * as is * Add permlane16_swap instruction to rocisa * Ongoing progress * Draft for partition A0/A1 * Wave partitioning * Draft ref code in tests * Handle 1x4 wavesplit param * 2x2 test passing * Draft 1x4 LR wave partitioning * Fix alginement issue * Integration testing * Update integration test * Fix swizzling pattern on GRA. Only swizzling on even LDS rows * Subtile based test * testing A * Test both A and B * Remove graonly mode * Fix 1x4 case * Move global offset for B after rest of the logic * cleanup * cleanup * Fix ref test code for 4x1 * Fix spgr alloc issue * Remove tmp test file * Remove debug prints * Add test case * Add GR load emit logic, and misc fixes (ROCm#3) * gr emit fix * Emit LR + init ACCVGPR (ROCm#4) * Emit ds_reads * Add waits for LR and GR * Init Acc VGPR to Zero * Add missing bit_length on VLShiftLeftB32 * Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness * Fix gra test ref code for 1x4 * Remove some debug prints * Add loop and ptr update code * Update scale offset * Add tests * Address review * Add scale roundtrip e2e test and constraint assertions Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency across 4 tile configs x 2 matrices. Add power-of-2 assertion for scaleBlockSize and matching scaleBlockSize assertions for A/B in shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes instead of re-deriving MIWaveGroup from tile dimensions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update fixes * Fix scale being skipped * Add flag to print layout * Fix missed merge conflicts * Fix missed merge conflicts * Refactor scale rountrip test with gpu helper fns * Fix extra spaces * Fix tests --------- Co-authored-by: brianshi <brianshi@amd.com> Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com> Co-authored-by: b-shi <bbbrianme@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com>

…ernel (ROCm#6) * Add sample subtile impl * Fix issues when disabling subtile impl * GR Offset calculation (#1) * Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com> * Enable post-loop code generation, and add some subroutines * LR offset calculation (#2) * Add tests * as is * Add permlane16_swap instruction to rocisa * Ongoing progress * Draft for partition A0/A1 * Wave partitioning * Draft ref code in tests * Handle 1x4 wavesplit param * 2x2 test passing * Draft 1x4 LR wave partitioning * Fix alginement issue * Integration testing * Update integration test * Fix swizzling pattern on GRA. Only swizzling on even LDS rows * Subtile based test * testing A * Test both A and B * Remove graonly mode * Fix 1x4 case * Move global offset for B after rest of the logic * cleanup * cleanup * Fix ref test code for 4x1 * Fix spgr alloc issue * Remove tmp test file * Remove debug prints * Add test case * Add GR load emit logic, and misc fixes (#3) * gr emit fix * Emit LR + init ACCVGPR (#4) * Emit ds_reads * Add waits for LR and GR * Init Acc VGPR to Zero * Add missing bit_length on VLShiftLeftB32 * Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness * Fix gra test ref code for 1x4 * Remove some debug prints * Add loop and ptr update code * Update scale offset * Add tests * Address review * Add scale roundtrip e2e test and constraint assertions Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency across 4 tile configs x 2 matrices. Add power-of-2 assertion for scaleBlockSize and matching scaleBlockSize assertions for A/B in shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes instead of re-deriving MIWaveGroup from tile dimensions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update fixes * Fix scale being skipped * Add flag to print layout * Fix missed merge conflicts * Fix missed merge conflicts * Refactor scale rountrip test with gpu helper fns * Fix extra spaces * Fix tests --------- Co-authored-by: brianshi <brianshi@amd.com> Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com> Co-authored-by: b-shi <bbbrianme@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com>

…ernel (ROCm#6) * Add sample subtile impl * Fix issues when disabling subtile impl * GR Offset calculation (ROCm#1) * Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com> * Enable post-loop code generation, and add some subroutines * LR offset calculation (ROCm#2) * Add tests * as is * Add permlane16_swap instruction to rocisa * Ongoing progress * Draft for partition A0/A1 * Wave partitioning * Draft ref code in tests * Handle 1x4 wavesplit param * 2x2 test passing * Draft 1x4 LR wave partitioning * Fix alginement issue * Integration testing * Update integration test * Fix swizzling pattern on GRA. Only swizzling on even LDS rows * Subtile based test * testing A * Test both A and B * Remove graonly mode * Fix 1x4 case * Move global offset for B after rest of the logic * cleanup * cleanup * Fix ref test code for 4x1 * Fix spgr alloc issue * Remove tmp test file * Remove debug prints * Add test case * Add GR load emit logic, and misc fixes (ROCm#3) * gr emit fix * Emit LR + init ACCVGPR (ROCm#4) * Emit ds_reads * Add waits for LR and GR * Init Acc VGPR to Zero * Add missing bit_length on VLShiftLeftB32 * Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness * Fix gra test ref code for 1x4 * Remove some debug prints * Add loop and ptr update code * Update scale offset * Add tests * Address review * Add scale roundtrip e2e test and constraint assertions Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency across 4 tile configs x 2 matrices. Add power-of-2 assertion for scaleBlockSize and matching scaleBlockSize assertions for A/B in shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes instead of re-deriving MIWaveGroup from tile dimensions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update fixes * Fix scale being skipped * Add flag to print layout * Fix missed merge conflicts * Fix missed merge conflicts * Refactor scale rountrip test with gpu helper fns * Fix extra spaces * Fix tests --------- Co-authored-by: brianshi <brianshi@amd.com> Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com> Co-authored-by: b-shi <bbbrianme@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

) ## Motivation Fix a bug in the smart-build --ctest-only filter that was incorrectly excluding tests with numbers less than 100. ## Technical Details The issue was caused by CTest formatting test numbers with variable spacing based on the number of digits: - "Test `#1`: name (3 spaces for tests 1-9)" - "Test `#79`: name (2 spaces for tests 10-99)" - "Test `#100`: name (1 space for tests 100+)" The previous code used `line.strip().startswith("Test #")` which only matched tests with a single space (i.e., test numbers >= 100). This caused tests like ck_tile_unit_sequence (Test #79) to be excluded from smart-build test selection, resulting in CTest failures when the binary wasn't built. Solution: Replace string matching with a regex pattern that handles all spacing variations: r'^\s*Test\s+#\d+:\s*(.+)$' ## Test Plan Tested with test numbers from 1 to 12345. ## Test Result - Before: 48 tests selected (only tests #100+) - After: 146 tests selected (all CTest-registered tests) ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

…sibling] Pure refactor: generalize columnMajorGemm from float-only to template<AccumT, MathOpAccumT> with defaults <float, float>. When MathOpAccumT != AccumT, each A/B operand is cast through MathOpAccumT before multiply (e.g. XFloat32 truncates mantissa). With default args, casts are no-ops and codegen is identical. No behavior change — existing tests validate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add --type tf32 CLI option: uses float storage with XFloat32 math-op truncation (10-bit mantissa). The slow reference path already handles f32XdlMathOp == XFloat32 via ReferenceSolution<TypedGemm_S_S_S, float, XFloat32>. Changes: - Add isTF32 parameter to runGemm, sets f32XdlMathOp on contraction - Golden reference uses columnMajorGemm<float, XFloat32> for TF32 - TF32 validation tolerance set to 1.0f (13 mantissa bits lost) - Console output shows MathOp=XFloat32 for TF32 runs - 10 new slow-path tf32 tests (transpose combos, beta, bias, features, scaleAB) Fast path still rejects XFloat32 — fast-path tf32 tests come next. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

#8, stacked on #1, #3] Template solveCPUFastInF32 on MathOpAccumT so the inner reduction applies XFloat32 truncation when MathOpAccumT=XFloat32. Remove the XFloat32 rejection guard from isFastPathEligible. Update SolveGemmCPU dispatch to branch on f32XdlMathOp. When MathOpAccumT=float (default), casts are no-ops with identical codegen to the previous implementation. Changes: - solveCPUFastInF32 gains template<MathOpAccumT=float> - innerReduction uses float(MathOpAccumT(val)) cast chain - isFastPathEligible no longer rejects XFloat32 - SolveGemmCPU dispatches solveCPUFastInF32<XFloat32> for TF32 - 10 new fast-path tf32 tests (20 total tf32 tests now) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…, #12, stacked on #1, #2] Add double-precision GEMM support to the CPU GEMM driver: - Add TypeTraits<double> mapping to DataType::Double - Add --type f64 branch calling runGemm<double, double> - Type all auxiliary buffers (bias, scaleAlphaVec, scaleAB) as AccumulateT so the slow path's GetValue(alphaType, ...) reads the correct byte width when alphaType=Double - Set alphaType/betaType to accumDtypeEnum so TypeId dispatch matches TypedGemm_D_D_D for the slow reference solver - Register bias/scale tensors with accumDtypeEnum - Add 10 f64 slow-path tests: 4 transpose combos, Beta, Bias, AllFeatures, TN_AllFeatures, ScaleAB Scalar/Vector Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…#3, #5, stacked on #1, #2, #5] Generalize the fast path to support double-precision accumulation: - Template ShadowBuffer<AccumT>: storage, pointer, and element access are all AccumT. Float/Double inputs zero-copy when AccumT matches; sub-float types go through float then widen. - Template loadTo<AccumT, SrcType> and storeFrom<AccumT, DstType> (renamed from loadToFloat/storeFromFloat). - Rename solveCPUFastInF32 → solveCPUFast<AccumT, MathOpAccumT>: all tile registers, inner reduction, epilogue, alpha/beta extraction, bias reading, and activation args use AccumT. - Add Double to isFastPathEligible's supported input/output types. - SolveGemmCPU dispatch: route Double to solveCPUFast<double>. - Add 10 f64 fast-path tests (transpose combos, Beta, Bias, AllFeatures, TN_AllFeatures, ScaleAB Scalar/Vector). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…sibling] Pure refactor: generalize columnMajorGemm from float-only to template<AccumT, MathOpAccumT> with defaults <float, float>. When MathOpAccumT != AccumT, each A/B operand is cast through MathOpAccumT before multiply (e.g. XFloat32 truncates mantissa). With default args, casts are no-ops and codegen is identical. No behavior change — existing tests validate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add --type tf32 CLI option: uses float storage with XFloat32 math-op truncation (10-bit mantissa). The slow reference path already handles f32XdlMathOp == XFloat32 via ReferenceSolution<TypedGemm_S_S_S, float, XFloat32>. Changes: - Add isTF32 parameter to runGemm, sets f32XdlMathOp on contraction - Golden reference uses columnMajorGemm<float, XFloat32> for TF32 - TF32 validation tolerance set to 1.0f (13 mantissa bits lost) - Console output shows MathOp=XFloat32 for TF32 runs - 10 new slow-path tf32 tests (transpose combos, beta, bias, features, scaleAB) Fast path still rejects XFloat32 — fast-path tf32 tests come next. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

#8, stacked on #1, #3] Template solveCPUFastInF32 on MathOpAccumT so the inner reduction applies XFloat32 truncation when MathOpAccumT=XFloat32. Remove the XFloat32 rejection guard from isFastPathEligible. Update SolveGemmCPU dispatch to branch on f32XdlMathOp. When MathOpAccumT=float (default), casts are no-ops with identical codegen to the previous implementation. Changes: - solveCPUFastInF32 gains template<MathOpAccumT=float> - innerReduction uses float(MathOpAccumT(val)) cast chain - isFastPathEligible no longer rejects XFloat32 - SolveGemmCPU dispatches solveCPUFastInF32<XFloat32> for TF32 - 10 new fast-path tf32 tests (20 total tf32 tests now) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…, #12, stacked on #1, #2] Add double-precision GEMM support to the CPU GEMM driver: - Add TypeTraits<double> mapping to DataType::Double - Add --type f64 branch calling runGemm<double, double> - Type all auxiliary buffers (bias, scaleAlphaVec, scaleAB) as AccumulateT so the slow path's GetValue(alphaType, ...) reads the correct byte width when alphaType=Double - Set alphaType/betaType to accumDtypeEnum so TypeId dispatch matches TypedGemm_D_D_D for the slow reference solver - Register bias/scale tensors with accumDtypeEnum - Add 10 f64 slow-path tests: 4 transpose combos, Beta, Bias, AllFeatures, TN_AllFeatures, ScaleAB Scalar/Vector Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…#3, #5, stacked on #1, #2, #5] Generalize the fast path to support double-precision accumulation: - Template ShadowBuffer<AccumT>: storage, pointer, and element access are all AccumT. Float/Double inputs zero-copy when AccumT matches; sub-float types go through float then widen. - Template loadTo<AccumT, SrcType> and storeFrom<AccumT, DstType> (renamed from loadToFloat/storeFromFloat). - Rename solveCPUFastInF32 → solveCPUFast<AccumT, MathOpAccumT>: all tile registers, inner reduction, epilogue, alpha/beta extraction, bias reading, and activation args use AccumT. - Add Double to isFastPathEligible's supported input/output types. - SolveGemmCPU dispatch: route Double to solveCPUFast<double>. - Add 10 f64 fast-path tests (transpose combos, Beta, Bias, AllFeatures, TN_AllFeatures, ScaleAB Scalar/Vector). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Records the 5 approaches considered for symbolic-vs-numeric register robustness: 1. Name-resolution table (brittle, complex) 2. Symbolic-only normalization (doesn't solve actual problem) 3. Numeric-only resolution (assembly-time dependency) 4. Equivalence-class comparison (loses precision) 5. Render-string identity (matches GPU view, robust) And the rationale for picking #5. Documents the known limitation: same logical reg with different identifiers across captures still differs. Doesn't arise in practice because both captures consume the same writer state; future work if needed would add approach #1 (name-resolution table) on top.

Initial workflows

145c189

jayhawk-commits requested a review from Copilot April 23, 2025 18:53

jayhawk-commits self-assigned this Apr 23, 2025

Copilot AI reviewed Apr 23, 2025

View reviewed changes

Comment thread .github/workflows/update-subtrees.yml Outdated

Remove false comment

9fac04f

jayhawk-commits merged commit 6c1a640 into develop Apr 23, 2025

jayhawk-commits deleted the initialWorkflows branch April 23, 2025 18:54

assistant-librarian Bot pushed a commit that referenced this pull request May 13, 2025

Zenguyen/block sort missing unit test (#1)

83ffa6e

* finished sort keys no size tests * completed sort pair no size functions * completed sort pair with size functions * updated changelog

jayhawk-commits added github actions migration Tasks or issues tied to migration to this monorepo labels May 17, 2025

idass1990 pushed a commit that referenced this pull request Jun 13, 2025

Add cmake files (#1)

0b42ee6

* Add cmake files [ROCm/hipBLAS-common commit: 9666c57]

rahulc-gh pushed a commit that referenced this pull request Jun 25, 2025

Add license (#1)

e74b2c1

ammallya pushed a commit that referenced this pull request Jun 25, 2025

Add license (#1)

9141871

[ROCm/mxDataGenerator commit: e74b2c1] [ROCm/mxDataGenerator commit: 8efb356]

This was referenced Jun 30, 2025

Invalid instruction oops for gfx1151 on Fedora rawhide #458

Open

kernel oops for gfx1201 on Fedora Rawhide #459

Open

assistant-librarian Bot mentioned this pull request Jul 14, 2025

rocsparse-test fails to run on Fedora #627

Closed

assistant-librarian Bot mentioned this pull request Jul 23, 2025

hipblas-test double free error on Fedora. #807

Closed

ammallya pushed a commit that referenced this pull request Jul 24, 2025

Add license (#1)

8efb356

[ROCm/mxDataGenerator commit: e74b2c1]

eidenyoshida pushed a commit that referenced this pull request Jul 25, 2025

Updating CODEOWNERS for gfx1250 branch (#1)

fe0970a

evbo mentioned this pull request Aug 14, 2025

[Feature] Tensile gfx1032 support #1202

Open

assistant-librarian Bot mentioned this pull request Sep 18, 2025

Kernrl oops on illegal instruction for gfx1151 on Fedora Rawhide #1675

Closed

randyspauldingamd mentioned this pull request Oct 23, 2025

[MIOpen] Improve GenericSearch early-stop strategy with dual-sample testing #1993

Closed

10 tasks

bstefanuk mentioned this pull request Nov 10, 2025

[rocblas] Fix schedule name for gfx1103 logics #2539

Merged

1 task

ibrahimw1 added a commit that referenced this pull request Nov 21, 2025

Integrate PR comments #1

5b947b2

NB4444 mentioned this pull request Nov 27, 2025

[rocPRIM] Config modernization #2955

Merged

1 task

This was referenced Apr 6, 2026

Fix for type mismatch: AIHPBLAS-1465 #6177

Closed

Fix code generation when UseBeta is false #6202

Merged

Alex-Vasile added a commit that referenced this pull request Apr 29, 2026

[plan] Remove resolved known issues #1 and #6, renumber remaining

b55d84d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

NolanHannaAMD mentioned this pull request May 1, 2026

[MIOpen] [ROCM-21512] [ALMIOPEN-1814] Fix MIOpen ASan memory leaks #7008

Merged

bnemanich mentioned this pull request May 3, 2026

Add gfx950 MXFP4 and BF16 Subtile-based kernel implementation #7017

Merged

1 task

BrianHarrisonAMD mentioned this pull request May 4, 2026

[hipDNN] Heuristic Policy Framework + Default Plugins (RFC 0007 - Part 2/3) #6605

Merged

davidd-amd mentioned this pull request May 5, 2026

Fix/hipblaslt _APIAlgoIndex regression on per-arch mapping installs #7009

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial workflows#1

Initial workflows#1
jayhawk-commits merged 2 commits into
developfrom
initialWorkflows

jayhawk-commits commented Apr 23, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jayhawk-commits commented Apr 23, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants