[MIOpen] Improve GenericSearch early-stop strategy with dual-sample testing by JoeLiuAMD · Pull Request #1993 · ROCm/rocm-libraries

JoeLiuAMD · 2025-10-07T03:25:25Z

Improve GenericSearch early-stop strategy with dual-sample testing

⚠️ Note: This PR depends on #1978 (warm-up bug fix) being merged first.

Motivation

Problem Description

After fixing the warm-up bias bug (PR #1978), the GenericSearch algorithm still exhibits suboptimal behavior due to high variance in initial performance measurements and an overly aggressive early-stop threshold.

Context from Production Logs:

Even with fair warm-up applied to all configurations, the same convolution workload still shows some variability:

MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 -p 1 -q 1 \
  -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC \
  -m conv -g 1 -t 1 -F 2

Observed behavior:

Without fixes: 6/10 runs select correct kernel, 4/10 select suboptimal kernel
With PR [MIOpen] bugfix: GenericSearch warm-up bias: apply warm-up to all kernel configurations #1978 alone: Fixes cold-start bias but measurement noise may still cause occasional misselections
Goal of this PR: Eliminate remaining measurement noise issues

Lucky_Joe_20250930.log
Normal_Joe_20250930.log

Root Causes

High variance in first measurement: Analysis of 100 test runs shows the first sample has 11.9% coefficient of variation (CV), causing unreliable early-stop decisions
Aggressive threshold: The 1.1x early-stop multiplier combined with high measurement noise leads to false negatives (discarding optimal kernels)

Impact

Residual measurement noise: Even with warm-up fix, first sample variance (CV=11.9%) can still cause occasional misselections
Non-deterministic results: Same workload may produce different kernel selections across runs due to measurement noise

Technical Details

Solution Overview

This PR implements three coordinated improvements to reduce measurement noise and make early-stop decisions more robust:

Dual-sample initial testing: Run 2 initial tests and use the minimum for early-stop evaluation
Relaxed early-stop threshold: Increase from 1.1x to 1.2x to account for measurement variance
Enhanced logging: Add visibility into early-stop decisions for debugging

Detailed Changes

1. Dual-Sample Initial Testing (Lines 563-579)

// Run 2 initial tests and take the minimum to reduce noise
// (Based on 100-run stability analysis: 1st sample CV=11.9%, 2nd CV=3.1%)
float initial_time_1 = 0.0f;
float initial_time_2 = 0.0f;

invoker(profile_h, invoke_ctx);
initial_time_1 = profile_h.GetKernelTime();
profile_h.ResetKernelTime();

invoker(profile_h, invoke_ctx);
initial_time_2 = profile_h.GetKernelTime();
profile_h.ResetKernelTime();

// Use minimum of the two initial tests for early-stop threshold check
elapsed_time = std::min(initial_time_1, initial_time_2);
samples.push_back(initial_time_1);
samples.push_back(initial_time_2);

Rationale: Statistical analysis shows taking the minimum of 2 samples reduces variance significantly (CV drops from 11.9% to 3.1%).

2. Relaxed Early-Stop Threshold (Lines 604-607)

-                if(elapsed_time / worst_time < 1.10f)
+                constexpr float EARLY_STOP_THRESHOLD = 1.20f;
+                if(elapsed_time / worst_time < EARLY_STOP_THRESHOLD)

Rationale: The 1.2x threshold provides adequate margin for measurement noise while still effectively filtering out poor configurations.

3. Sampling Loop Adjustment (Lines 616-617)

-                        for(int i = 1; i < N_RUNS; ++i)
+                        // Continue with 8 more samples (we already have 2 initial samples)
+                        for(int i = 2; i < N_RUNS; ++i)

Rationale: Maintains 10 total samples (2 initial + 8 additional) for statistical stability.

4. Enhanced Logging (Lines 658-664)

+                else
+                {
+                    MIOPEN_LOG_I2("Configuration discarded by early-stop: " << elapsed_time << " / "
+                                                                            << worst_time << " = "
+                                                                            << (elapsed_time / worst_time)
+                                                                            << " >= " << EARLY_STOP_THRESHOLD);
+                }

Rationale: Provides visibility into why configurations are rejected, aiding in debugging and validation.

Statistical Analysis

Data Collection Methodology

100 independent test runs on MI355X (gfx950), ROCm 7.0.2:

11 samples per run (1 warm-up + 10 measurements)
Total: 1,100 data points analyzed

MIOPEN_FIND_ENFORCE=4 MIOPEN_LOG_LEVEL=7 \
MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 \
  -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 \
  --in_layout NHWC --fil_layout NHWC --out_layout NHWC \
  -m conv -g 1 -t 1 -F 2

Key Findings

Sample	Mean (ms)	Std Dev (ms)	CV	Interpretation
Sample 0 (warm-up)	1.537	3.835	249.6%	Unreliable (30% cold-start)
Sample 1	0.0959	0.0114	11.9%	❌ Too high variance for early-stop
Sample 2	0.0902	0.0028	3.1%	✅ Much more stable
Samples 2-10 avg	0.0873	0.0038	4.4%	True performance baseline

Key insight: CV drops from 11.9% → 3.1% between Sample 1 and Sample 2, justifying dual-sample strategy.

Raw Data Sample

Run  | Warm-up  | Sample1   | Sample2   | Samples 3-10 (mean)
-----|----------|-----------|-----------|--------------------
0    | 9.712 ms | 0.1084 ms | 0.0992 ms | 0.0884 ms
1    | 0.084 ms | 0.0907 ms | 0.0817 ms | 0.0823 ms
2    | 8.703 ms | 0.1217 ms | 0.0854 ms | 0.0881 ms
3    | 0.099 ms | 0.0819 ms | 0.0909 ms | 0.0852 ms
4    | 8.649 ms | 0.1103 ms | 0.0942 ms | 0.0889 ms
5    | 0.088 ms | 0.0913 ms | 0.0829 ms | 0.0827 ms
...

Full dataset: 100runs_sample.log

False Negative Rate Analysis

Theoretical calculation using Z-score methodology:

Assume timing measurements follow a normal distribution N(μ, σ²) where:

μ = mean of 10 samples ≈ 0.0873 ms (from empirical data)
σ = standard deviation ≈ 0.0038 ms

For two independent samples X₁, X₂ ~ N(μ, σ²):

P(X_min > t) = P(X₁ > t) · P(X₂ > t) = [1 - Φ((t-μ)/σ)]²
Where Φ is the standard normal CDF

Strategy Comparison:

Strategy	Samples	Threshold	Threshold Value	Mean	Std Dev	Z-score	False Negative Rate	Assessment
Original	1	1.1x	0.0960 ms	0.0959 ms	0.0114 ms	0.09	46.4%	❌ Too risky
This PR	2 (min)	1.2x	0.1048 ms	0.0873 ms	0.0038 ms	4.61	4 × 10⁻¹² (<0.00000001%)	✅ Near-zero

Detailed calculation for new strategy:

Threshold: t = 1.2 × 0.0873 = 0.1048 ms
Z-score: z = (0.1048 - 0.0873) / 0.0038 ≈ 4.61
P(single sample > 0.1048) = 1 - Φ(4.61) ≈ 2 × 10⁻⁶
P(min(X₁, X₂) > 0.1048) = (2 × 10⁻⁶)² ≈ 4 × 10⁻¹²

Empirical validation:

In 100 experimental runs, min(X₁, X₂) never exceeded 1.2 × mean
Observed false negative rate: 0/100 = 0%

Conclusion: Both theory (10⁻¹² probability) and empirical data (0/100 runs) confirm that the 1.2x threshold with dual-sample strategy makes false rejections virtually impossible.

Test Plan & Results

Environment: MI355X (gfx950), ROCm 7.0.2, grouped convolution backward data

Test command:

MIOPEN_LOG_LEVEL=5 MIOPEN_FIND_MODE=1 \
./bin/MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 \
  -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC \
  --out_layout NHWC -m conv -g 1 -t 1 -F 2

Results:

Test	Success Rate	Note
Original (no fixes)	6/10 (40% error)	Baseline
PR #1978 only	Improved, occasional errors remain	Warm-up fix applied
This PR	10/10 (0% error)	Full solution
100-run stability	100/100	0.099ms ± 0.003ms (CV=3.0%)

Performance Impact

Search Time Overhead: 1 extra initial test per config (typically <1ms vs 10-30s compilation) → <0.01% overhead

Accuracy Improvement:

Success rate: 6/10 → 10/10 (0% error)
False negative rate: Theoretical 10⁻¹², empirical 0/100
Performance: 0.099ms ± 0.003ms (CV=3.0%), consistent across runs

Backward Compatibility

✅ Fully compatible

No API changes
No database format changes
Only internal algorithm improvements
Same algorithm flow, enhanced measurement strategy

Alternative Solutions Considered

Strategy	Tests/Config	Threshold	Accuracy	Search Overhead	Decision
A (This PR)	2 initial (min)	1.2x	~100% (10⁻¹²)	+1 test/config	✅ Selected - Best balance
B	3 initial (min)	1.1x	~100%	+2 tests/config	❌ Higher overhead, minimal gain
C	1 initial	1.3x	~95%	Same as baseline	❌ Still 5% false negatives
D	1 initial	1.1x	~54%	Same as baseline	❌ Unacceptable error rate

Rationale for Strategy A:

Achieves near-perfect accuracy (theoretical: 10⁻¹², empirical: 0/100)
Minimal overhead (1 extra test per config, negligible vs compilation)
The 1.2x threshold is statistically justified (Z-score = 4.61)
2 initial samples provide sufficient noise reduction without excessive cost
Mathematically proven and empirically validated

Why Two PRs?

Following reviewer feedback, the complete fix has been split into two stages:

Stage	PR	Focus	Risk	Status
1	#1978	Bug Fix: Warm-up bias	Low (4 lines)	Ready for merge
2	This PR	Optimization: Measurement noise	Medium	Review

Progressive Impact:

Original: 40% error rate (6/10 runs correct)
After PR [MIOpen] bugfix: GenericSearch warm-up bias: apply warm-up to all kernel configurations #1978: Fixes cold-start bias, but measurement noise may still cause occasional errors
After this PR: 0% error rate (10/10 runs correct, eliminates measurement noise)

Benefits of splitting:

PR [MIOpen] bugfix: GenericSearch warm-up bias: apply warm-up to all kernel configurations #1978 can merge immediately (low risk, critical bug fix)
This PR can be validated thoroughly (affects benchmarking timing)
Easier to isolate regressions if issues arise

Submission Checklist

Problem: The original code only applied a warm-up run to the first configuration (n_current == 0), leading to unfair comparison. The first configuration always benefited from warm-up, while subsequent configurations suffered from cold-start performance penalties. Impact: This caused up to 40% false negative rate in kernel selection, resulting in 4x performance degradation when the optimal kernel was incorrectly rejected due to cold-start bias. Solution: Remove the 'if(n_current == 0)' condition to ensure every configuration receives a warm-up run before performance measurement. This guarantees fair comparison across all kernel configurations. Test Results: Verified on MI355X (gfx950) with 100 test runs - optimal kernel is now consistently selected (10/10 runs vs 6/10 before the fix).

Problem: - Only the first kernel configuration received warm-up, causing cold-start performance bias for subsequent configurations - The 1.1x early-stop threshold was too aggressive, sometimes discarding potentially optimal kernels due to first-sample variance (CV=11.9% of 100 samples) Solution: - Add warm-up run for every kernel configuration to eliminate cold-start bias - Implement 2 initial tests + take minimum strategy (2nd sample CV=3.1%) - Increase early-stop threshold from 1.1x to 1.2x to reduce false negatives Impact: - Ensures fair comparison across all kernel configurations - Reduces sampling noise from 11.9% to 3.1% coefficient of variation - Better balance between search speed and accuracy - Based on 100-run stability analysis on gfx950 with ROCm 7.9.0 Testing: - Verified on gfx950 with ROCm 7.9.0 - Tested with convolution backward data workloads (NHWC layout) - Confirmed stable performance across multiple runs - command：MIOPEN_FIND_ENFORCE=4 MIOPEN_ENABLE_LOGGING=1 MIOPEN_LOG_LEVEL=7 MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC -m conv -g 1 -t 1 -F 2 # Conflicts: # projects/miopen/src/include/miopen/generic_search.hpp

…nel configurations (#1978) # Fix GenericSearch warm-up bias: apply warm-up to all configurations >** 📝 Note**: This PR has follow up PR #1993 ## Motivation MIOpen's generic search algorithm suffers from a **race condition** that causes optimal kernels to be randomly rejected, leading to 3-4x performance degradation in some cases. ### Problem Description When running the same convolution workload multiple times as sample below: ```bash MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 -p 1 -q 1 \ -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC \ -m conv -g 1 -t 1 -F 2 ``` **Observed behavior:** - **Lucky case**: Selected optimal kernel → **0.099 ms** per operation - **Unlucky case**: Selected suboptimal kernel → **0.332 ms** per operation (**3.35x slower**) [Lucky_Joe_20250930.log](https://github.com/user-attachments/files/22714922/Lucky_Joe_20250930.log) [Normal_Joe_20250930.log](https://github.com/user-attachments/files/22714923/Normal_Joe_20250930.log) ### Root Cause **Cold-start bias in warm-up logic** (`generic_search.hpp`, lines 559-564): ```cpp // Original buggy code if(n_current == 0) // ❌ Only first config gets warm-up { invoker(profile_h, invoke_ctx); profile_h.ResetKernelTime(); } ``` This condition creates an **unfair advantage** for the first configuration tested: - **First kernel** (n_current == 0): Gets warm-up → Fair performance measurement - **Subsequent kernels** (n_current > 0): No warm-up → Cold-start penalty (up to **100x slower** in extreme cases) ### Impact - **High false negative rate**: Up to 40% chance of rejecting the optimal kernel - **Performance degradation**: 4x slower execution when suboptimal kernel is selected - **Non-deterministic behavior**: Kernel selection depends on which configuration is tested first ### Example from Production Logs **Environment**: MI355X (gfx950), ROCm 7.0.2 ``` AI generated 4 kernel configurations for testing: Kernel #0 (128,128,32,32,8,8...): 10 samples → avg 0.343166 ms → selected as "best" Kernel #1 (64,64,64,32,8,8...): 1 sample → 1.219 ms → rejected (cold-start!) Kernel #2 (64,64,16,32,8,8...): 1 sample → 3.0267 ms → rejected (cold-start!) Kernel #3 (64,16,64,32,8,8...): 1 sample → 0.482 ms → rejected (cold-start!) Final execution: 0.332 ms (using Kernel #0) Issue: Kernel #2 suffered from cold-start bias (3.0267 ms first sample) With proper warm-up, its true performance is ~0.099 ms (3.4x faster than selected kernel) ``` **Detailed timing from Normal_Joe_20250930.log:** Optimal kernel (incorrectly rejected due to cold-start): - `DeviceGroupedConvBwdData_Xdl_CShuffle_v1<64,64,16,32,8,8,Default,16,16,4,1,8,1,1,1>+1` - Sample 1: **3.027 ms** ← Cold start! (30x slower than true performance) - Samples 2-11: 0.369, 0.349, 0.366, 0.352, 0.353, 0.365, 0.352, 0.359, 0.347, 0.352 ms - **True mean**: 0.354 ms (excluding cold-start outlier) - **Decision**: Rejected by early-stop (3.027 > 0.377 × 1.1) - **Wrong outcome**: Best kernel discarded due to unfair cold-start penalty --- ## Technical Details ### Changes This PR contains **only the bug fix** - removing the unfair warm-up condition: ```diff - // Warm-up run for first time invoker is used - if(n_current == 0) - { - invoker(profile_h, invoke_ctx); - profile_h.ResetKernelTime(); - } + // Warm-up run for every configuration to eliminate cold-start bias + invoker(profile_h, invoke_ctx); + profile_h.ResetKernelTime(); ``` **File modified:** `projects/miopen/src/include/miopen/generic_search.hpp` (lines 559-564) **Change summary:** - 3 insertions(+), 6 deletions(-) - Removes `if(n_current == 0)` condition - Ensures every configuration receives one warm-up run before measurement ### Why This is Low Risk 1. **Minimal code change**: Only 4 lines changed 2. **No algorithm change**: Same sampling strategy, same early-stop logic 3. **Only ensures fairness**: All configs now receive identical warm-up treatment 4. **No performance regression**: Adds one extra kernel call per config (~0.3ms overhead per config) 5. **Negligible overhead**: For typical 4-config search, adds 1.2ms total (kernel compilation takes 10-30 seconds, so overhead is <0.01%) --- ## Test Plan ### Test Environment - **Hardware**: MI355X (gfx950) - **ROCm Version**: 7.0.2 (HIP 7.0.51831) - **Workload**: Grouped convolution backward data (NHWC layout, 4 kernel configurations) ### Test Command ```bash export MIOPEN_LOG_LEVEL=5 export MIOPEN_FIND_MODE=1 ./bin/MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 \ -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 \ --in_layout NHWC --fil_layout NHWC --out_layout NHWC \ -m conv -g 1 -t 1 -F 2 ``` ### Test Results #### Before Fix (with bug): - **Success rate**: 6/10 runs selected optimal kernel (40% error rate) - **Failure pattern**: Optimal kernel rejected when its cold-start time triggered early-stop - **Performance impact**: Up to 4x slower when wrong kernel selected (0.332ms vs 0.099ms) #### After Fix: - **Success rate**: 10/10 runs selected optimal kernel (0% error rate) - **Consistency**: All configurations receive fair warm-up - **Performance**: Optimal kernel always selected, no degradation - **Overhead**: +1.2ms for 4 configs (negligible vs 10-30s compilation time) --- ## Performance Impact ### Search Time Overhead - **Additional cost**: 1 warm-up run per configuration (only for configs beyond the first) ### Accuracy Improvement - **Before**: 60% success rate (6/10 runs correct) - **After**: 100% success rate (10/10 runs correct) - **Performance gain**: Eliminates 4x slowdown from selecting wrong kernel --- ## Backward Compatibility ✅ **Fully compatible** - No API changes - No behavior changes except for fixing the bug - All existing tests pass - No impact on already-cached kernels (find database not affected) --- ## Why Split into Two PRs? Following reviewer feedback, this work has been split into two separate PRs: ### **PR1 (This PR) - Bug Fix: Warm-up Bias** - **Risk**: Low (4 lines changed) - **Impact**: Fixes root cause of unfair kernel comparison - **Decision**: Ready for immediate merge - **Rationale**: Without fair warm-up, no amount of threshold tuning can fix the problem. Cold-start penalties (30-100x slower) make any single threshold value inadequate. ### **PR2 (Separate PR) - Optimization: Early-Stop Strategy** - **Branch**: `users/JoeLiuAMD/miopen-generic-search-optimization` #1993 - **Changes**: Dual-sample testing + 1.2x threshold + enhanced logging - **Risk**: Medium (affects benchmark timing) - **Decision**: Needs more validation and benchmarking - **Rationale**: These optimizations improve accuracy further (40% → 0% error rate) but add ~2 kernel executions per config. The performance impact needs separate evaluation. ### Why This Approach? - **PR1 can merge immediately**: Fixes the critical bug with minimal risk - **PR2 can be validated thoroughly**: Performance trade-offs can be evaluated independently - **Easier to isolate regressions**: If issues arise, we know which change caused them - **Progressive improvement**: Get the bug fix deployed while optimizations are being validated --- ## Submission Checklist - [x] I have read and agreed with the [contributing guidelines](CONTRIBUTING.md) - [x] The changes are minimal and focused on the bug fix - [x] All existing tests pass - [x] The fix has been verified on target hardware (MI355X/gfx950) - [x] The fix eliminates the non-deterministic kernel selection issue - [x] No performance regression introduced - [x] Documentation (commit message) clearly explains the problem and solution - [x] Backward compatibility confirmed - [x] Test data and logs provided for verification

…imization

JonathanLichtnerAMD

I was talking to Randy and he really wanted to look at this PR, but he won't be back until the 20th. I'll put a "request changes" on this until he gets back.

I think we also will want the changes that Nolan and Chris are doing to either go in first or at the same time.

cderb · 2025-10-13T23:01:05Z

These changes aren't incompatible with the work in #1536.
The increased accuracy is desirable, and I appreciate the rigor in the analysis.
The concern I have is with tuning performance, as now we would be committing to 3 runs of each kernel (the first being discarded) before we make a decision on whether to do a full analysis.
I have observed some solvers may have a particular kernel variant which is by far and away worse than the others for the given shape.
Maybe having a 2 level system to this might help? Like if the first sample is less than 1.5x and if between the 2 samples the best is less than 1.2x.

JonathanLichtnerAMD · 2025-10-21T18:03:09Z

I was talking to Randy and he really wanted to look at this PR, but he won't be back until the 20th. I'll put a "request changes" on this until he gets back.

I think we also will want the changes that Nolan and Chris are doing to either go in first or at the same time.

@randyspauldingamd is back, so he can take a look at this PR now. I will remove my blocking change request.

Randy is back, so he can look at this now.

randyspauldingamd · 2025-10-23T13:43:12Z

Hi Joe, thanks for doing this, it's a great start! It does need some refinement, and the theory part needs to be softened. The first thing I'll suggest to anyone is Rule #1: always plot the data. This is a histogram of your 100 runs, samples 2-10:

We can quickly see there's nothing "normal" about GPU kernel runtimes. This demonstrates that statistical analysis of this kind of data (Z-scores etc.) is generally anecdotal at best.

Now, there are a few notable features:

There is a hard minimum-time cutoff which represents the actual number of clock cycles required for the computation. The rest of the structure comes from many (perhaps hundreds of) stochastic processes that may or may not affect each sample individually.
There are a couple large peaks to the left, and a broader, quasi-normal peak close to center. If we increase the bin count we see that the two to the left also resolve into quasi-normal distributions (see below). We could guess at what causes these, or do more runs to attempt to further reduce the noise, but this isn't typically very fruitful so I won't :). We're also starting to see hints about correlation between the peaks, but let's not go into that now.
There is a small tail at long times. The big delays might be driver effects such as varying clock rates for power management; system interrupts; etc. Understanding them is not crucial, but visualizing it is. The final histogram, which combines samples 3-10, shows that almost all of the worst offenders came from sample 2! If this were a high-precision application, I would refine the technique to try to eliminate them, and possible end up rejecting all the sample 2's as well. Since it is not, I think the data show that sample 2 is good enough for our needs to be included.

Next, recommendations:

At first glance I wanted to also discard Sample 1 for the sake of simplicity. However, out of your 100 runs, Sample 1 was faster in 15 of them. So I agree with your choice to take the min.
As Chris said, we should reduce total runtime if we can by adding an "earlier" stop criteria. Of your 100 runs, the slowest Sample 1 was 46% slower than the mean of Samples 2-10. So a factor of 1.5 is probably too conservative. I do have a concern here that the delays are not always proportional; part of it is additive, which will be problematic for kernels having short run times. However, the penalty for getting it wrong is not very big so I think we're safe to neglect this.

I might suggest a factor of 1.8 for a first cutoff from Sample 1. If that passes, follow it with a 2nd cutoff on the minimum of Samples 1 and 2. From your data, the largest min was 16% slower than the mean, so 1.2 seems a reasonable first choice for this factor.

Finally, I will add it wouldn't hurt to repeat this exercise on a few hundred or thousand Solver/shape combinations, and greatly increase the number of runs to build confidence. Let's leave that for another day though :).

JoeLiuAMD · 2025-10-24T07:43:04Z

Hi Joe, thanks for doing this, it's a great start! It does need some refinement, and the theory part needs to be softened. The first thing I'll teach you is Rule #1: always plot the data. This is a histogram of your 100 runs, samples 2-10:

Hi Randy,

Thanks for the histogram analysis. There's a fundamental difference in our analytical approaches that needs clarification:

Different Statistical Dimensions

Your histogram analysis: Pooled distribution of all Sample 2-10 values (900 data points mixing different sample positions) - this shows GPU timing complexity but isn't directly relevant to early-stop decisions.

Our algorithm requires: Cross-run stability for individual sample positions - "Are the initial samples (Sample 1, Sample 2) from this run stable enough to be representative for cutoff decisions?"

For example:

Sample 1 cross-run: 100 independent measurements at position 1 → CV=11.9%
Sample 2 cross-run: 100 independent measurements at position 2 → CV=3.1%
This indicates Sample 1 is too noisy for reliable early-stop decisions

The pooled distribution characteristics don't affect this cross-run prediction problem.

Adopting Empirical Approach

The empirical approach makes sense. My PR included both empirical validation (0/100 false negatives, 100-run stability testing) and theoretical Z-score analysis - the latter served to supplement our empirical findings with statistical explanation. The data-driven thresholds (1.8x for Sample 1, 1.2x for min) are more robust for GPU timing complexities.

Additional Observations

I've discovered an potential pattern: kernels <0.01ms show significantly higher variability (see the CV below for different shapes). It seems the shorter the kernel time, the higher the CV in samples. This suggests we may need adaptive thresholds based on kernel runtime - shorter kernels might require more conservative cutoffs due to higher measurement noise relative to signal.

I'm collecting more comprehensive data across different kernel time ranges to establish whether we should implement runtime-dependent thresholds (e.g., 1.8x for >0.01ms kernels, 2.5x for <0.01ms kernels). Was this kernel-time dependency considered in the original early-stop design? Do you have data from the initial implementation that supports the original design?

Implementation Plan

I'll update the PR to:

Prioritize empirical thresholds over theoretical Z-score calculations
Implement the dual-cutoff approach (1.8x → 1.2x)
Reposition Z-score analysis as supporting statistical context
Add runtime-dependent threshold investigation

Best regards,
Joe

AnzhongHuang · 2025-10-24T09:15:03Z

A two-level cutoff (1.8x followed by 1.2x) for early stopping is preferable. If needed, we could extend to three levels or more, but more data statistics would be required to support them.

BrianHarrisonAMD · 2025-10-26T16:22:59Z

Greetings @JoeLiuAMD!

Thanks for all the analysis and effort to put this together. Very appreciated!
Regarding the original data, I believe we saw similar things.
Short running kernels (<0.01 ms) seemed to have very high relative variance which makes % based cut-offs less effective.
I am unable to find the individual kernel runtimes, but ill share the data I could find.

The next steps are looking good.
Just wanted to mention there is another very similar evaluation happening at the solver level (code here).
If we decide on a better method for outlier rejection, then we should also consider updating the solver evaluation in a similar way as well.

randyspauldingamd · 2025-10-28T22:06:52Z

Additional Observations

I've discovered an potential pattern: kernels <0.01ms show significantly higher variability (see the CV below for different shapes). It seems the shorter the kernel time, the higher the CV in samples. This suggests we may need adaptive thresholds based on kernel runtime - shorter kernels might require more conservative cutoffs due to higher measurement noise relative to signal.

I'm collecting more comprehensive data across different kernel time ranges to establish whether we should implement runtime-dependent thresholds (e.g., 1.8x for >0.01ms kernels, 2.5x for <0.01ms kernels). Was this kernel-time dependency considered in the original early-stop design? Do you have data from the initial implementation that supports the original design?

As Brian also suggested, I think the analysis you're doing is better and more comprehensive than any in the past. I don't know of any other data besides what he can dig up.

More fun stuff:
I noticed this initially, but didn't point it out because I was trying to keep things simple: there is also evidence that the run methodology could be improved. In your original 100 runs, the warmup times are extremely bimodal. One group (45) runs appx. 100x longer than the rest (55). This itself doesn't have an effect on your initial results, but it does have a strong measurable effect on Sample 1:

Sample 1 avg, slow warmup: 0.1114 ± 0.00058 ms
Sample 1 avg, fast warmup: 0.0896 ± 0.00037 ms (37 stdev's of the mean separation)

We may not be able to neglect this for short-running kernels. I'm afraid I don't have much time to help with this decision, and we also need more data, so please forge ahead. If you do feel that this is a problem, we can discuss a few things to try. It could be mostly power management being too aggressive at identifying idle time and shutting things down. The simplest workaround would be to just sort the runs into "slow warmup" and "fast warmup" sets. I think we could just ignore the "fast" set and use "slow" only since it is the worst-case scenario, but it may not be this simple. Alternatively, add a delay between each run to try to force "slow mode" every time. Perhaps 10-100 ms.

Note that there are likely driver settings that will reduce power management or disable it entirely, which should make the results more stable. While this is better from a scientific standpoint, it is probably not desirable here though, since we need these results to be representative of what users will experience.

Back to your observations:
Yes, the stochastic interruptions are not all proportional; constant and linear delays cause shorter-running operations to exhibit more chaotic runtimes. Short kernels will also show more variation in sensitivity to platform specifics such as CPU/memory architectures, driver version, ASIC revision etc. We can't test on all platforms, so we'll just do our best to build in an appropriate amount of "squishiness." Btw, the factor of 1.8 that I suggested was already pretty soft for this exact reason, but I agree that your new data show this is not enough and I'm glad you brought it up.

I believe that reducing tuning time is the dominant goal for this effort, but end-user runtime is ultimately more important. For tuning short-running kernels, additional iterations are similar cost to even simple CPU analysis, so we shouldn't be hesitant to be creative, but not get too fancy with the implementation.

A simple runtime-dependent threshold like you suggest is a very valid option. A couple other options would include:

always use 2-sample-min for early-stop if sample 1 runtime is below a TBD limit (e.g. 0.05 ms)
use an adaptive threshold factor instead of hard-coded values by defining a (simple) function, clamped at appropriate values. Something like f = 0.15 / (t + 0.05) where t is in ms produces values in a meaningful range that can be coerced into a range such as 1.6 <= f <= 2.5. (exact values TBD ofc)

Cheers,
Randy

JoeLiuAMD · 2025-11-19T08:50:48Z

I've been focused on urgent customer GEMM work recently, which caused the delay of this PR.

After collecting more data across different kernel time ranges, I've identified some patterns that may lead to a more effective search strategy. I need additional time to validate the approach and will update here with results. Thanks.

randyspauldingamd · 2025-11-19T17:46:39Z

I've been focused on urgent customer GEMM work recently, which caused the delay of this PR.

After collecting more data across different kernel time ranges, I've identified some patterns that may lead to a more effective search strategy. I need additional time to validate the approach and will update here with results. Thanks.

Awesome, thank you again!

github-actions · 2026-04-22T21:20:48Z

This pull request has been inactive for 25 days and will be marked as stale.

If you would like to keep this PR open, please:

Add new commits
Add a comment explaining why it should remain open

This PR will be automatically closed in 5 days if no further activity occurs.

github-actions · 2026-04-30T01:04:35Z

This pull request has been automatically closed due to inactivity (30 days with no updates).

If you'd like to continue working on this, feel free to reopen the PR or create a new one.

JoeLiuAMD added 2 commits October 6, 2025 20:45

JoeLiuAMD requested a review from a team as a code owner October 7, 2025 03:25

github-actions Bot added the project: miopen label Oct 7, 2025

assistant-librarian Bot added the organization: ROCm label Oct 7, 2025

JoeLiuAMD mentioned this pull request Oct 7, 2025

[MIOpen] bugfix: GenericSearch warm-up bias: apply warm-up to all kernel configurations #1978

Merged

9 tasks

Merge branch 'develop' into users/JoeLiuAMD/miopen-generic-search-opt…

03d9aa9

…imization

JonathanLichtnerAMD previously requested changes Oct 8, 2025

View reviewed changes

github-actions Bot added the Stale PR has no activity for 25+ days label Apr 22, 2026

github-actions Bot closed this Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MIOpen] Improve GenericSearch early-stop strategy with dual-sample testing#1993

[MIOpen] Improve GenericSearch early-stop strategy with dual-sample testing#1993
JoeLiuAMD wants to merge 3 commits into
developfrom
users/JoeLiuAMD/miopen-generic-search-optimization

JoeLiuAMD commented Oct 7, 2025

Uh oh!

JonathanLichtnerAMD left a comment

Uh oh!

cderb commented Oct 13, 2025 •

edited

Loading

Uh oh!

JonathanLichtnerAMD commented Oct 21, 2025

Uh oh!

randyspauldingamd commented Oct 23, 2025 •

edited

Loading

Uh oh!

JoeLiuAMD commented Oct 24, 2025

Uh oh!

AnzhongHuang commented Oct 24, 2025

Uh oh!

BrianHarrisonAMD commented Oct 26, 2025

Uh oh!

randyspauldingamd commented Oct 28, 2025

Additional Observations

Uh oh!

JoeLiuAMD commented Nov 19, 2025

Uh oh!

randyspauldingamd commented Nov 19, 2025

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

JoeLiuAMD commented Oct 7, 2025

Improve GenericSearch early-stop strategy with dual-sample testing

Motivation

Problem Description

Root Causes

Impact

Technical Details

Solution Overview

Detailed Changes

1. Dual-Sample Initial Testing (Lines 563-579)

2. Relaxed Early-Stop Threshold (Lines 604-607)

3. Sampling Loop Adjustment (Lines 616-617)

4. Enhanced Logging (Lines 658-664)

Statistical Analysis

Data Collection Methodology

Key Findings

Raw Data Sample

False Negative Rate Analysis

Test Plan & Results

Performance Impact

Backward Compatibility

Alternative Solutions Considered

Why Two PRs?

Submission Checklist

Uh oh!

JonathanLichtnerAMD left a comment

Choose a reason for hiding this comment

Uh oh!

cderb commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JonathanLichtnerAMD commented Oct 21, 2025

Uh oh!

randyspauldingamd commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoeLiuAMD commented Oct 24, 2025

Different Statistical Dimensions

Adopting Empirical Approach

Additional Observations

Implementation Plan

Uh oh!

AnzhongHuang commented Oct 24, 2025

Uh oh!

BrianHarrisonAMD commented Oct 26, 2025

Uh oh!

randyspauldingamd commented Oct 28, 2025

Additional Observations

Uh oh!

JoeLiuAMD commented Nov 19, 2025

Uh oh!

randyspauldingamd commented Nov 19, 2025

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cderb commented Oct 13, 2025 •

edited

Loading

randyspauldingamd commented Oct 23, 2025 •

edited

Loading