[Origami] Improve Origami's Test Coverage#3301
Conversation
## Testing Plan - Azure CI (deprecated): Runs all the Catch2 and new Python tests. ✅ - TheRock CI (integration): - CI workflow: https://github.com/ROCm/TheRock/actions/runs/20316557318 ✅ - TheRock branch: https://github.com/ROCm/TheRock/tree/users/neoblizz/minsukim-refactor ✅ (Some tests fail but build passed) - TheRock CI (libraries: hipblaslt, etc.) ✅ - Math CI (performance/hipblaslt) ✅ - Reviews from older PRs are addressed with the few exceptions. ✅ ## PRs History - Original PR: #1859 - Rebased PR: #2718 - Reverted PR: #3416 (comment) - Hot-fix PR: #3417 (review) ([TheRock CI Report](https://github.com/ROCm/TheRock/actions/runs/20293546191)) - **And now this!** ## Technical Details ### Refactor - [x] New file `types.hpp`, consolidates various origami types. - [x] New structs to replace the growing tuple. - [x] API updates to make it scalable, down from 20+ parameters to a few. - [x] Lots of redundant code removal. - [x] Reorganize into `types.hpp`; data types (see `hardware.hpp` for what needs to be moved) - [x] Refactor extract APIs - [x] Add enums for transpose - [x] Refactor debug/log reporting - [x] Remove mutable and statics - [x] Decouple `latency` out of `config_t` - [x] Rebase develop into this branch ### Python APIs & Unit Tests - [x] Update the tests - [x] Update rocroller - [x] hipblaslt/tensilelite/scripts to use the new API ### Testing Infrastructure - Replace YAML-based tests with Catch2 C++ tests - Replace GTest with Catch2 framework - Convert YAML-driven parameterized tests to pure C++ tests - Update common.hpp with direct C++ helper functions - Update CMakeLists.txt to use Catch2 instead of GTest - Remove Boost dependency (was only for YAML) - All test data now hardcoded in C++ for type safety - Better IDE support, debugging, and error messages ### Questions - Rocroller: coordinate on the API design, should they be reused? - Reuse dim3_t from hip_runtime.h (does HIP's dim3 have the same functionality)? - Should transpose/data-types be part of config as well for precomputing? ## Motivation This PR reapplies Origami project's refactor. 1. Make it simpler to make model updates. 2. Well-defined, scalable, clean interface. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. ## Work Moved to Future PRs - Logger APIs - Runtime Options - Move instruction map into separate file - Additional tests: #3301 - Minor batch-count specific changes: #3289 - Move Origami's Azure CI -> TheRock CI @ibrahimw1 --------- Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com> Co-authored-by: neoblizz <osama94@gmail.com>
There was a problem hiding this comment.
Pull request overview
This pull request adds comprehensive unit tests for the Origami library to improve code coverage and catch potential failures. The changes focus on testing functions in origami.cpp and gemm.cpp that were previously untested or under-tested.
Key changes include:
- Added unit tests for origami functions including rank_configs, select_topk_configs, select_config_mnk, and select_workgroup_mapping
- Added unit tests for gemm functions covering work/output utilization, CU occupancy, memory bandwidth, L2 hit rates, arithmetic intensity, LDS capacity checks, and cache estimation
- Refactored the make_hardware helper function to use architecture-specific defaults
- Exposed previously static functions (compute_cvt_overhead, arithmetic_intensity, etc.) for testing by adding them to the public API
- Added zero-denominator checks in arithmetic intensity calculations for safety
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 23 comments.
Show a summary per file
| File | Description |
|---|---|
| shared/origami/tests/test_origami.cpp | Added comprehensive unit tests for rank_configs, select_topk_configs, select_config_mnk, and select_workgroup_mapping; updated existing tests to use refactored make_hardware; corrected test naming from "GEMM" to "origami" |
| shared/origami/tests/test_gemm.cpp | Added extensive unit tests for gemm functions including calculate_work_utilization, calculate_output_utilization, compute_cu_occupancy, compute_mem_bw_from_occupancy, compute_l2_hit_rate_global, round_elements_to_128B, compute_cvt_overhead, arithmetic_intensity, check_lds_capacity, and estimate_l2_hit/estimate_mall_hit |
| shared/origami/tests/include/common.hpp | Refactored make_hardware to use architecture-specific default values (942 and 950) instead of parameterized defaults |
| shared/origami/src/origami/gemm.cpp | Added zero-denominator safety checks in arithmetic_intensity and emulated_tf32_arithmetic_intensity; changed compute_cvt_overhead from static to public |
| shared/origami/include/origami/gemm.hpp | Added function declarations for previously internal/static functions to enable unit testing; added comprehensive documentation for newly exposed functions; removed declarations for compute_A_loads and compute_B_loads |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
AlexBrownAMD
left a comment
There was a problem hiding this comment.
Flagged a few test cases that look like they contain typos, REQUIRE statements may be testing the wrong values.
e7248ff to
136b6b9
Compare
perfci run on commit 136b6b946a46eb0641a111b750f1a0f36fa565d9 |
32ada08 to
05b17f3
Compare
perfci run on commit 3d5542eea7b90b6850d73e263e41fa7e22f79683 |
ryanswann-amd
left a comment
There was a problem hiding this comment.
I think this is looking good. There are still some key functions a little undertested. Primarily:
- compute_memory_latency
- compute_tile_latency
Some ideas on things that are undertested. (I came up with these example on the fly so if you find they are wrong feel free to change the specific values)
Compute Memory Latency
Test 1.1: All Transpose Combinations
- Test all 4: NN, NT, TN, TT with same problem size (4096×4096×1024)
- Config: 256×256×64 tile
- Verify: Different MT_M/N/K_rounded_128bytes values based on transpose mode
- Expected: Each transpose mode produces different memory access patterns
Test 1.2: MX Block Sizes
- Problem: 4096×4096×1024
- Test 1: FP8 with a_mx_block_size=32, b_mx_block_size=32
- Test 2: FP16 with mx_block_size=0 (no MX blocks)
- Verify: FP8 adds scale bytes to Ld_CU_bytes calculation
- Expected: MX datatypes have additional overhead for scales
Test 1.3: Problem Size Scaling
- Small: 512×512×256, Medium: 2048×2048×1024, Large: 8192×8192×2048
- Config: 256×256×64 tile (constant)
- Verify: Memory latency increases with problem size
- Expected: Large > Medium > Small latency
Test 1.4: Occupancy and Bandwidth Limiting
- Problem: 4096×4096×1024
- Config: 256×256×64 tile
- Test with num_active_cus: 50, 150, 304 (gfx942)
- Verify:
bw_limitedandmem_bw_occ_limitedvariables scale correctly - Expected: Higher occupancy → better bandwidth utilization → lower latency
Test 1.5: Splitting Factor Impact
- Problem: 4096×4096×1024
- Config: 256×256×64 tile
- Test splitting_factor: 1, 2, 4, 8
- Verify: Splitting affects L2 hit rate and memory access patterns
- Expected: Different splitting factors produce measurable latency differences
Edge Cases:
Test 1.6: Zero Dimensions
- Problem: 0×4096×1024 (or other zero dimension)
- Expected: Graceful handling, no crashes
Test 1.7: Single Tile Problem
- Problem: 128×128×32 (fits in one tile)
- Config: 256×256×64 tile
- Expected: Minimal memory latency
Test 1.8: Skinny Matrices
- Test 1: 64×8192×1024 (tall and skinny)
- Test 2: 8192×64×1024 (wide and short)
- Config: 256×256×64 tile
- Expected: Different memory access patterns, reasonable latency values
Compute Tile Latency
Test 2.1: Small K (One Iteration)
- Problem: 4096×4096×64 (K = MT_K, single iteration)
- Config: 256×256×64 tile
- Verify: Total latency > single compute iteration (includes prologue/epilogue)
- Expected: L_tile_total > L_tile_single due to setup overhead
Test 2.2: K-Split Reduction Overhead
- Problem: 4096×4096×512 (small K)
- Config: 256×256×64 tile
- Test splitting_factor: 1, 2, 8
- Verify: When K is small and we split, latency increases due to reduction overhead
- Expected: Factor=8 > Factor=2 > Factor=1 (for small K problems)
Test 2.3: Work Utilization Penalty
- Problem 1: 4096×4096×1024 (perfect alignment, utilization=1.0)
- Problem 2: 4351×3839×959 (poor alignment, utilization≈0.5)
- Config: 256×256×64 tile
- Verify: Poor alignment increases latency via effective_tile_penalty
- Expected: Poor alignment → ~2× higher latency
Test 2.4: K-Dimension Zero Padding
- Problem 1: K=1024 (divisible by MT_K=64) → no penalty
- Problem 2: K=1025 (K % MT_K = 1, worst case) → maximum penalty
- Problem 3: K=1023 (K % MT_K = MT_K-1) → small penalty
- Config: 256×256×64 tile
- Verify: Padding penalty scales with K % MT_K
-
Expected: 1023=1024<1025 and
$L_{1024}-L_{1023}$ <$L_{1025}-L_{1024}$
Test 2.5: K-Splitting Performance Benefit
- Problem: 256×256×4096 (large K, single tile in M/N)
- Config: 256×256×64 tile
- Test splitting_factor: 1 vs 4
- Verify: Large K problem with splitting should be faster (parallelism benefit)
- Expected: Split=4 latency < Split=1 latency (for large K)
Test 2.6: Iteration Count Validation
- Problem 1: K=64, MT_K=64 → num_iter=1
- Problem 2: K=256, MT_K=64 → num_iter=3 (4 total, -1 for epilogue)
- Problem 3: K=512, MT_K=64, split=2 → num_iter based on k_per_split
- Config: 256×256×64 tile
- Verify: Iteration count calculation is correct
-
Expected: num_iter matches formula: ceil(k_per_split/MT_K) - 1.
$L_{K512} > L_{K256}> L_{K64}$
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
05b17f3 to
1507899
Compare
perfci run on commit 1507899 |
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## develop #3301 +/- ##
============================================
- Coverage 55.53% 43.62% -11.91%
============================================
Files 521 30 -491
Lines 67464 11442 -56022
Branches 7949 1440 -6509
============================================
- Hits 37464 4991 -32473
+ Misses 26645 5984 -20661
+ Partials 3355 467 -2888
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
ryanswann-amd
left a comment
There was a problem hiding this comment.
PR getting big. Tests requested above will be added in another PR.
perfci run on commit a1b81a2 |
perfci run on commit 75a5646 |
Motivation
Expanded unit testing and improved analytical modeling coverage for Origami GEMM module. Enhanced documentation and robustness of selection/ranking utilities.
Technical Details
origami/gemm.hppfor modeling work/output utilization, CU occupancy, memory bandwidth, L2 hit rate, and arithmetic intensity. Improved Doxygen documentation and removed obsolete API.make_hardware,make_config) to ensure robust and architecture-specific test setups.test_origami.cppto:Test Result
All unit tests pass locally.