[MIOpen] Improve GenericSearch early-stop strategy with dual-sample testing#1993
[MIOpen] Improve GenericSearch early-stop strategy with dual-sample testing#1993JoeLiuAMD wants to merge 3 commits into
Conversation
Problem: The original code only applied a warm-up run to the first configuration (n_current == 0), leading to unfair comparison. The first configuration always benefited from warm-up, while subsequent configurations suffered from cold-start performance penalties. Impact: This caused up to 40% false negative rate in kernel selection, resulting in 4x performance degradation when the optimal kernel was incorrectly rejected due to cold-start bias. Solution: Remove the 'if(n_current == 0)' condition to ensure every configuration receives a warm-up run before performance measurement. This guarantees fair comparison across all kernel configurations. Test Results: Verified on MI355X (gfx950) with 100 test runs - optimal kernel is now consistently selected (10/10 runs vs 6/10 before the fix).
Problem: - Only the first kernel configuration received warm-up, causing cold-start performance bias for subsequent configurations - The 1.1x early-stop threshold was too aggressive, sometimes discarding potentially optimal kernels due to first-sample variance (CV=11.9% of 100 samples) Solution: - Add warm-up run for every kernel configuration to eliminate cold-start bias - Implement 2 initial tests + take minimum strategy (2nd sample CV=3.1%) - Increase early-stop threshold from 1.1x to 1.2x to reduce false negatives Impact: - Ensures fair comparison across all kernel configurations - Reduces sampling noise from 11.9% to 3.1% coefficient of variation - Better balance between search speed and accuracy - Based on 100-run stability analysis on gfx950 with ROCm 7.9.0 Testing: - Verified on gfx950 with ROCm 7.9.0 - Tested with convolution backward data workloads (NHWC layout) - Confirmed stable performance across multiple runs - command:MIOPEN_FIND_ENFORCE=4 MIOPEN_ENABLE_LOGGING=1 MIOPEN_LOG_LEVEL=7 MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC -m conv -g 1 -t 1 -F 2 # Conflicts: # projects/miopen/src/include/miopen/generic_search.hpp
…nel configurations (#1978) # Fix GenericSearch warm-up bias: apply warm-up to all configurations >** 📝 Note**: This PR has follow up PR #1993 ## Motivation MIOpen's generic search algorithm suffers from a **race condition** that causes optimal kernels to be randomly rejected, leading to 3-4x performance degradation in some cases. ### Problem Description When running the same convolution workload multiple times as sample below: ```bash MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 -p 1 -q 1 \ -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC \ -m conv -g 1 -t 1 -F 2 ``` **Observed behavior:** - **Lucky case**: Selected optimal kernel → **0.099 ms** per operation - **Unlucky case**: Selected suboptimal kernel → **0.332 ms** per operation (**3.35x slower**) [Lucky_Joe_20250930.log](https://github.com/user-attachments/files/22714922/Lucky_Joe_20250930.log) [Normal_Joe_20250930.log](https://github.com/user-attachments/files/22714923/Normal_Joe_20250930.log) ### Root Cause **Cold-start bias in warm-up logic** (`generic_search.hpp`, lines 559-564): ```cpp // Original buggy code if(n_current == 0) // ❌ Only first config gets warm-up { invoker(profile_h, invoke_ctx); profile_h.ResetKernelTime(); } ``` This condition creates an **unfair advantage** for the first configuration tested: - **First kernel** (n_current == 0): Gets warm-up → Fair performance measurement - **Subsequent kernels** (n_current > 0): No warm-up → Cold-start penalty (up to **100x slower** in extreme cases) ### Impact - **High false negative rate**: Up to 40% chance of rejecting the optimal kernel - **Performance degradation**: 4x slower execution when suboptimal kernel is selected - **Non-deterministic behavior**: Kernel selection depends on which configuration is tested first ### Example from Production Logs **Environment**: MI355X (gfx950), ROCm 7.0.2 ``` AI generated 4 kernel configurations for testing: Kernel #0 (128,128,32,32,8,8...): 10 samples → avg 0.343166 ms → selected as "best" Kernel #1 (64,64,64,32,8,8...): 1 sample → 1.219 ms → rejected (cold-start!) Kernel #2 (64,64,16,32,8,8...): 1 sample → 3.0267 ms → rejected (cold-start!) Kernel #3 (64,16,64,32,8,8...): 1 sample → 0.482 ms → rejected (cold-start!) Final execution: 0.332 ms (using Kernel #0) Issue: Kernel #2 suffered from cold-start bias (3.0267 ms first sample) With proper warm-up, its true performance is ~0.099 ms (3.4x faster than selected kernel) ``` **Detailed timing from Normal_Joe_20250930.log:** Optimal kernel (incorrectly rejected due to cold-start): - `DeviceGroupedConvBwdData_Xdl_CShuffle_v1<64,64,16,32,8,8,Default,16,16,4,1,8,1,1,1>+1` - Sample 1: **3.027 ms** ← Cold start! (30x slower than true performance) - Samples 2-11: 0.369, 0.349, 0.366, 0.352, 0.353, 0.365, 0.352, 0.359, 0.347, 0.352 ms - **True mean**: 0.354 ms (excluding cold-start outlier) - **Decision**: Rejected by early-stop (3.027 > 0.377 × 1.1) - **Wrong outcome**: Best kernel discarded due to unfair cold-start penalty --- ## Technical Details ### Changes This PR contains **only the bug fix** - removing the unfair warm-up condition: ```diff - // Warm-up run for first time invoker is used - if(n_current == 0) - { - invoker(profile_h, invoke_ctx); - profile_h.ResetKernelTime(); - } + // Warm-up run for every configuration to eliminate cold-start bias + invoker(profile_h, invoke_ctx); + profile_h.ResetKernelTime(); ``` **File modified:** `projects/miopen/src/include/miopen/generic_search.hpp` (lines 559-564) **Change summary:** - 3 insertions(+), 6 deletions(-) - Removes `if(n_current == 0)` condition - Ensures every configuration receives one warm-up run before measurement ### Why This is Low Risk 1. **Minimal code change**: Only 4 lines changed 2. **No algorithm change**: Same sampling strategy, same early-stop logic 3. **Only ensures fairness**: All configs now receive identical warm-up treatment 4. **No performance regression**: Adds one extra kernel call per config (~0.3ms overhead per config) 5. **Negligible overhead**: For typical 4-config search, adds 1.2ms total (kernel compilation takes 10-30 seconds, so overhead is <0.01%) --- ## Test Plan ### Test Environment - **Hardware**: MI355X (gfx950) - **ROCm Version**: 7.0.2 (HIP 7.0.51831) - **Workload**: Grouped convolution backward data (NHWC layout, 4 kernel configurations) ### Test Command ```bash export MIOPEN_LOG_LEVEL=5 export MIOPEN_FIND_MODE=1 ./bin/MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 \ -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 \ --in_layout NHWC --fil_layout NHWC --out_layout NHWC \ -m conv -g 1 -t 1 -F 2 ``` ### Test Results #### Before Fix (with bug): - **Success rate**: 6/10 runs selected optimal kernel (40% error rate) - **Failure pattern**: Optimal kernel rejected when its cold-start time triggered early-stop - **Performance impact**: Up to 4x slower when wrong kernel selected (0.332ms vs 0.099ms) #### After Fix: - **Success rate**: 10/10 runs selected optimal kernel (0% error rate) - **Consistency**: All configurations receive fair warm-up - **Performance**: Optimal kernel always selected, no degradation - **Overhead**: +1.2ms for 4 configs (negligible vs 10-30s compilation time) --- ## Performance Impact ### Search Time Overhead - **Additional cost**: 1 warm-up run per configuration (only for configs beyond the first) ### Accuracy Improvement - **Before**: 60% success rate (6/10 runs correct) - **After**: 100% success rate (10/10 runs correct) - **Performance gain**: Eliminates 4x slowdown from selecting wrong kernel --- ## Backward Compatibility ✅ **Fully compatible** - No API changes - No behavior changes except for fixing the bug - All existing tests pass - No impact on already-cached kernels (find database not affected) --- ## Why Split into Two PRs? Following reviewer feedback, this work has been split into two separate PRs: ### **PR1 (This PR) - Bug Fix: Warm-up Bias** - **Risk**: Low (4 lines changed) - **Impact**: Fixes root cause of unfair kernel comparison - **Decision**: Ready for immediate merge - **Rationale**: Without fair warm-up, no amount of threshold tuning can fix the problem. Cold-start penalties (30-100x slower) make any single threshold value inadequate. ### **PR2 (Separate PR) - Optimization: Early-Stop Strategy** - **Branch**: `users/JoeLiuAMD/miopen-generic-search-optimization` #1993 - **Changes**: Dual-sample testing + 1.2x threshold + enhanced logging - **Risk**: Medium (affects benchmark timing) - **Decision**: Needs more validation and benchmarking - **Rationale**: These optimizations improve accuracy further (40% → 0% error rate) but add ~2 kernel executions per config. The performance impact needs separate evaluation. ### Why This Approach? - **PR1 can merge immediately**: Fixes the critical bug with minimal risk - **PR2 can be validated thoroughly**: Performance trade-offs can be evaluated independently - **Easier to isolate regressions**: If issues arise, we know which change caused them - **Progressive improvement**: Get the bug fix deployed while optimizations are being validated --- ## Submission Checklist - [x] I have read and agreed with the [contributing guidelines](CONTRIBUTING.md) - [x] The changes are minimal and focused on the bug fix - [x] All existing tests pass - [x] The fix has been verified on target hardware (MI355X/gfx950) - [x] The fix eliminates the non-deterministic kernel selection issue - [x] No performance regression introduced - [x] Documentation (commit message) clearly explains the problem and solution - [x] Backward compatibility confirmed - [x] Test data and logs provided for verification
JonathanLichtnerAMD
left a comment
There was a problem hiding this comment.
I was talking to Randy and he really wanted to look at this PR, but he won't be back until the 20th. I'll put a "request changes" on this until he gets back.
I think we also will want the changes that Nolan and Chris are doing to either go in first or at the same time.
|
These changes aren't incompatible with the work in #1536. |
@randyspauldingamd is back, so he can take a look at this PR now. I will remove my blocking change request. |
Randy is back, so he can look at this now.
|
Hi Joe, thanks for doing this, it's a great start! It does need some refinement, and the theory part needs to be softened. The first thing I'll suggest to anyone is Rule #1: always plot the data. This is a histogram of your 100 runs, samples 2-10: Now, there are a few notable features:
Next, recommendations:
I might suggest a factor of 1.8 for a first cutoff from Sample 1. If that passes, follow it with a 2nd cutoff on the minimum of Samples 1 and 2. From your data, the largest min was 16% slower than the mean, so 1.2 seems a reasonable first choice for this factor. Finally, I will add it wouldn't hurt to repeat this exercise on a few hundred or thousand Solver/shape combinations, and greatly increase the number of runs to build confidence. Let's leave that for another day though :).
|
Hi Randy, Thanks for the histogram analysis. There's a fundamental difference in our analytical approaches that needs clarification: Different Statistical DimensionsYour histogram analysis: Pooled distribution of all Sample 2-10 values (900 data points mixing different sample positions) - this shows GPU timing complexity but isn't directly relevant to early-stop decisions. Our algorithm requires: Cross-run stability for individual sample positions - "Are the initial samples (Sample 1, Sample 2) from this run stable enough to be representative for cutoff decisions?" For example:
The pooled distribution characteristics don't affect this cross-run prediction problem. Adopting Empirical ApproachThe empirical approach makes sense. My PR included both empirical validation (0/100 false negatives, 100-run stability testing) and theoretical Z-score analysis - the latter served to supplement our empirical findings with statistical explanation. The data-driven thresholds (1.8x for Sample 1, 1.2x for min) are more robust for GPU timing complexities. Additional ObservationsI've discovered an potential pattern: kernels <0.01ms show significantly higher variability (see the CV below for different shapes). It seems the shorter the kernel time, the higher the CV in samples. This suggests we may need adaptive thresholds based on kernel runtime - shorter kernels might require more conservative cutoffs due to higher measurement noise relative to signal. I'm collecting more comprehensive data across different kernel time ranges to establish whether we should implement runtime-dependent thresholds (e.g., 1.8x for >0.01ms kernels, 2.5x for <0.01ms kernels). Was this kernel-time dependency considered in the original early-stop design? Do you have data from the initial implementation that supports the original design? Implementation PlanI'll update the PR to:
Best regards, |
|
A two-level cutoff (1.8x followed by 1.2x) for early stopping is preferable. If needed, we could extend to three levels or more, but more data statistics would be required to support them. |
|
Greetings @JoeLiuAMD! Thanks for all the analysis and effort to put this together. Very appreciated! The next steps are looking good. |
As Brian also suggested, I think the analysis you're doing is better and more comprehensive than any in the past. I don't know of any other data besides what he can dig up. More fun stuff:
We may not be able to neglect this for short-running kernels. I'm afraid I don't have much time to help with this decision, and we also need more data, so please forge ahead. If you do feel that this is a problem, we can discuss a few things to try. It could be mostly power management being too aggressive at identifying idle time and shutting things down. The simplest workaround would be to just sort the runs into "slow warmup" and "fast warmup" sets. I think we could just ignore the "fast" set and use "slow" only since it is the worst-case scenario, but it may not be this simple. Alternatively, add a delay between each run to try to force "slow mode" every time. Perhaps 10-100 ms. Note that there are likely driver settings that will reduce power management or disable it entirely, which should make the results more stable. While this is better from a scientific standpoint, it is probably not desirable here though, since we need these results to be representative of what users will experience. Back to your observations: I believe that reducing tuning time is the dominant goal for this effort, but end-user runtime is ultimately more important. For tuning short-running kernels, additional iterations are similar cost to even simple CPU analysis, so we shouldn't be hesitant to be creative, but not get too fancy with the implementation. A simple runtime-dependent threshold like you suggest is a very valid option. A couple other options would include:
Cheers, |
|
I've been focused on urgent customer GEMM work recently, which caused the delay of this PR. After collecting more data across different kernel time ranges, I've identified some patterns that may lead to a more effective search strategy. I need additional time to validate the approach and will update here with results. Thanks. |
Awesome, thank you again! |
|
This pull request has been inactive for 25 days and will be marked as stale. If you would like to keep this PR open, please:
This PR will be automatically closed in 5 days if no further activity occurs. |
|
This pull request has been automatically closed due to inactivity (30 days with no updates). If you'd like to continue working on this, feel free to reopen the PR or create a new one. |




Improve GenericSearch early-stop strategy with dual-sample testing
Motivation
Problem Description
After fixing the warm-up bias bug (PR #1978), the
GenericSearchalgorithm still exhibits suboptimal behavior due to high variance in initial performance measurements and an overly aggressive early-stop threshold.Context from Production Logs:
Even with fair warm-up applied to all configurations, the same convolution workload still shows some variability:
Observed behavior:
Lucky_Joe_20250930.log
Normal_Joe_20250930.log
Root Causes
Impact
Technical Details
Solution Overview
This PR implements three coordinated improvements to reduce measurement noise and make early-stop decisions more robust:
Detailed Changes
1. Dual-Sample Initial Testing (Lines 563-579)
Rationale: Statistical analysis shows taking the minimum of 2 samples reduces variance significantly (CV drops from 11.9% to 3.1%).
2. Relaxed Early-Stop Threshold (Lines 604-607)
Rationale: The 1.2x threshold provides adequate margin for measurement noise while still effectively filtering out poor configurations.
3. Sampling Loop Adjustment (Lines 616-617)
Rationale: Maintains 10 total samples (2 initial + 8 additional) for statistical stability.
4. Enhanced Logging (Lines 658-664)
Rationale: Provides visibility into why configurations are rejected, aiding in debugging and validation.
Statistical Analysis
Data Collection Methodology
100 independent test runs on MI355X (gfx950), ROCm 7.0.2:
Key Findings
Key insight: CV drops from 11.9% → 3.1% between Sample 1 and Sample 2, justifying dual-sample strategy.
Raw Data Sample
Full dataset: 100runs_sample.log
False Negative Rate Analysis
Theoretical calculation using Z-score methodology:
Assume timing measurements follow a normal distribution N(μ, σ²) where:
For two independent samples X₁, X₂ ~ N(μ, σ²):
Strategy Comparison:
Detailed calculation for new strategy:
Empirical validation:
Conclusion: Both theory (10⁻¹² probability) and empirical data (0/100 runs) confirm that the 1.2x threshold with dual-sample strategy makes false rejections virtually impossible.
Test Plan & Results
Environment: MI355X (gfx950), ROCm 7.0.2, grouped convolution backward data
Test command:
Results:
Performance Impact
Search Time Overhead: 1 extra initial test per config (typically <1ms vs 10-30s compilation) → <0.01% overhead
Accuracy Improvement:
Backward Compatibility
✅ Fully compatible
Alternative Solutions Considered
Rationale for Strategy A:
Why Two PRs?
Following reviewer feedback, the complete fix has been split into two stages:
Progressive Impact:
Benefits of splitting:
Submission Checklist