[MIOpen] bugfix: GenericSearch warm-up bias: apply warm-up to all kernel configurations#1978
Merged
JoeLiuAMD merged 3 commits intoOct 8, 2025
Merged
Conversation
BrianHarrisonAMD
approved these changes
Oct 6, 2025
Contributor
BrianHarrisonAMD
left a comment
There was a problem hiding this comment.
Changes look good to me!
Problem: The original code only applied a warm-up run to the first configuration (n_current == 0), leading to unfair comparison. The first configuration always benefited from warm-up, while subsequent configurations suffered from cold-start performance penalties. Impact: This caused up to 40% false negative rate in kernel selection, resulting in 4x performance degradation when the optimal kernel was incorrectly rejected due to cold-start bias. Solution: Remove the 'if(n_current == 0)' condition to ensure every configuration receives a warm-up run before performance measurement. This guarantees fair comparison across all kernel configurations. Test Results: Verified on MI355X (gfx950) with 100 test runs - optimal kernel is now consistently selected (10/10 runs vs 6/10 before the fix).
54117ff to
d99fd30
Compare
10 tasks
Contributor
BrianHarrisonAMD
left a comment
There was a problem hiding this comment.
LGTM lets get this merged.
JonathanLichtnerAMD
approved these changes
Oct 7, 2025
assistant-librarian Bot
pushed a commit
to ROCm/MIOpen
that referenced
this pull request
Oct 8, 2025
[MIOpen] bugfix: GenericSearch warm-up bias: apply warm-up to all kernel configurations (#1978) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit # Fix GenericSearch warm-up bias: apply warm-up to all configurations >** 📝 Note**: This PR has follow up PR #1993 ## Motivation MIOpen's generic search algorithm suffers from a **race condition** that causes optimal kernels to be randomly rejected, leading to 3-4x performance degradation in some cases. ### Problem Description When running the same convolution workload multiple times as sample below: ```bash MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 -p 1 -q 1 \ -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC \ -m conv -g 1 -t 1 -F 2 ``` **Observed behavior:** - **Lucky case**: Selected optimal kernel → **0.099 ms** per operation - **Unlucky case**: Selected suboptimal kernel → **0.332 ms** per operation (**3.35x slower**) [Lucky_Joe_20250930.log](https://github.com/user-attachments/files/22714922/Lucky_Joe_20250930.log) [Normal_Joe_20250930.log](https://github.com/user-attachments/files/22714923/Normal_Joe_20250930.log) ### Root Cause **Cold-start bias in warm-up logic** (`generic_search.hpp`, lines 559-564): ```cpp // Original buggy code if(n_current == 0) // ❌ Only first config gets warm-up { invoker(profile_h, invoke_ctx); profile_h.ResetKernelTime(); } ``` This condition creates an **unfair advantage** for the first configuration tested: - **First kernel** (n_current == 0): Gets warm-up → Fair performance measurement - **Subsequent kernels** (n_current > 0): No warm-up → Cold-start penalty (up to **100x slower** in extreme cases) ### Impact - **High false negative rate**: Up to 40% chance of rejecting the optimal kernel - **Performance degradation**: 4x slower execution when suboptimal kernel is selected - **Non-deterministic behavior**: Kernel selection depends on which configuration is tested first ### Example from Production Logs **Environment**: MI355X (gfx950), ROCm 7.0.2 ``` AI generated 4 kernel configurations for testing: Kernel #0 (128,128,32,32,8,8...): 10 samples → avg 0.343166 ms → selected as "best" Kernel #1 (64,64,64,32,8,8...): 1 sample → 1.219 ms → rejected (cold-start!) Kernel #2 (64,64,16,32,8,8...): 1 sample → 3.0267 ms → rejected (cold-start!) Kernel #3 (64,16,64,32,8,8...): 1 sample → 0.482 ms → rejected (cold-start!) Final execution: 0.332 ms (using Kernel #0) Issue: Kernel #2 suffered from cold-start bias (3.0267 ms first sample) With proper warm-up, its true performance is ~0.099 ms (3.4x faster than selected kernel) ``` **Detailed timing from Normal_Joe_20250930.log:** Optimal kernel (incorrectly rejected due to cold-start): - `DeviceGroupedConvBwdData_Xdl_CShuffle_v1<64,64,16,32,8,8,Default,16,16,4,1,8,1,1,1>+1` - Sample 1: **3.027 ms** ← Cold start! (30x slower than true performance) - Samples 2-11: 0.369, 0.349, 0.366, 0.352, 0.353, 0.365, 0.352, 0.359, 0.347, 0.352 ms - **True mean**: 0.354 ms (excluding cold-start outlier) - **Decision**: Rejected by early-stop (3.027 > 0.377 × 1.1) - **Wrong outcome**: Best kernel discarded due to unfair cold-start penalty
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix GenericSearch warm-up bias: apply warm-up to all configurations
Motivation
MIOpen's generic search algorithm suffers from a race condition that causes optimal kernels to be randomly rejected, leading to 3-4x performance degradation in some cases.
Problem Description
When running the same convolution workload multiple times as sample below:
Observed behavior:
Lucky_Joe_20250930.log
Normal_Joe_20250930.log
Root Cause
Cold-start bias in warm-up logic (
generic_search.hpp, lines 559-564):This condition creates an unfair advantage for the first configuration tested:
Impact
Example from Production Logs
Environment: MI355X (gfx950), ROCm 7.0.2
Detailed timing from Normal_Joe_20250930.log:
Optimal kernel (incorrectly rejected due to cold-start):
DeviceGroupedConvBwdData_Xdl_CShuffle_v1<64,64,16,32,8,8,Default,16,16,4,1,8,1,1,1>+1Technical Details
Changes
This PR contains only the bug fix - removing the unfair warm-up condition:
File modified:
projects/miopen/src/include/miopen/generic_search.hpp(lines 559-564)Change summary:
if(n_current == 0)conditionWhy This is Low Risk
Test Plan
Test Environment
Test Command
Test Results
Before Fix (with bug):
After Fix:
Performance Impact
Search Time Overhead
Accuracy Improvement
Backward Compatibility
✅ Fully compatible
Why Split into Two PRs?
Following reviewer feedback, this work has been split into two separate PRs:
PR1 (This PR) - Bug Fix: Warm-up Bias
PR2 (Separate PR) - Optimization: Early-Stop Strategy
users/JoeLiuAMD/miopen-generic-search-optimization[MIOpen] Improve GenericSearch early-stop strategy with dual-sample testing #1993Why This Approach?
Submission Checklist