Skip to content

[MIOpen] Improve GenericSearch early-stop strategy with dual-sample testing#1993

Closed
JoeLiuAMD wants to merge 3 commits into
developfrom
users/JoeLiuAMD/miopen-generic-search-optimization
Closed

[MIOpen] Improve GenericSearch early-stop strategy with dual-sample testing#1993
JoeLiuAMD wants to merge 3 commits into
developfrom
users/JoeLiuAMD/miopen-generic-search-optimization

Conversation

@JoeLiuAMD
Copy link
Copy Markdown
Contributor

Improve GenericSearch early-stop strategy with dual-sample testing

⚠️ Note: This PR depends on #1978 (warm-up bug fix) being merged first.

Motivation

Problem Description

After fixing the warm-up bias bug (PR #1978), the GenericSearch algorithm still exhibits suboptimal behavior due to high variance in initial performance measurements and an overly aggressive early-stop threshold.

Context from Production Logs:

Even with fair warm-up applied to all configurations, the same convolution workload still shows some variability:

MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 -p 1 -q 1 \
  -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC \
  -m conv -g 1 -t 1 -F 2

Observed behavior:

Lucky_Joe_20250930.log
Normal_Joe_20250930.log

Root Causes

  1. High variance in first measurement: Analysis of 100 test runs shows the first sample has 11.9% coefficient of variation (CV), causing unreliable early-stop decisions
  2. Aggressive threshold: The 1.1x early-stop multiplier combined with high measurement noise leads to false negatives (discarding optimal kernels)

Impact

  • Residual measurement noise: Even with warm-up fix, first sample variance (CV=11.9%) can still cause occasional misselections
  • Non-deterministic results: Same workload may produce different kernel selections across runs due to measurement noise

Technical Details

Solution Overview

This PR implements three coordinated improvements to reduce measurement noise and make early-stop decisions more robust:

  1. Dual-sample initial testing: Run 2 initial tests and use the minimum for early-stop evaluation
  2. Relaxed early-stop threshold: Increase from 1.1x to 1.2x to account for measurement variance
  3. Enhanced logging: Add visibility into early-stop decisions for debugging

Detailed Changes

1. Dual-Sample Initial Testing (Lines 563-579)

// Run 2 initial tests and take the minimum to reduce noise
// (Based on 100-run stability analysis: 1st sample CV=11.9%, 2nd CV=3.1%)
float initial_time_1 = 0.0f;
float initial_time_2 = 0.0f;

invoker(profile_h, invoke_ctx);
initial_time_1 = profile_h.GetKernelTime();
profile_h.ResetKernelTime();

invoker(profile_h, invoke_ctx);
initial_time_2 = profile_h.GetKernelTime();
profile_h.ResetKernelTime();

// Use minimum of the two initial tests for early-stop threshold check
elapsed_time = std::min(initial_time_1, initial_time_2);
samples.push_back(initial_time_1);
samples.push_back(initial_time_2);

Rationale: Statistical analysis shows taking the minimum of 2 samples reduces variance significantly (CV drops from 11.9% to 3.1%).

2. Relaxed Early-Stop Threshold (Lines 604-607)

-                if(elapsed_time / worst_time < 1.10f)
+                constexpr float EARLY_STOP_THRESHOLD = 1.20f;
+                if(elapsed_time / worst_time < EARLY_STOP_THRESHOLD)

Rationale: The 1.2x threshold provides adequate margin for measurement noise while still effectively filtering out poor configurations.

3. Sampling Loop Adjustment (Lines 616-617)

-                        for(int i = 1; i < N_RUNS; ++i)
+                        // Continue with 8 more samples (we already have 2 initial samples)
+                        for(int i = 2; i < N_RUNS; ++i)

Rationale: Maintains 10 total samples (2 initial + 8 additional) for statistical stability.

4. Enhanced Logging (Lines 658-664)

+                else
+                {
+                    MIOPEN_LOG_I2("Configuration discarded by early-stop: " << elapsed_time << " / "
+                                                                            << worst_time << " = "
+                                                                            << (elapsed_time / worst_time)
+                                                                            << " >= " << EARLY_STOP_THRESHOLD);
+                }

Rationale: Provides visibility into why configurations are rejected, aiding in debugging and validation.


Statistical Analysis

Data Collection Methodology

100 independent test runs on MI355X (gfx950), ROCm 7.0.2:

  • 11 samples per run (1 warm-up + 10 measurements)
  • Total: 1,100 data points analyzed
MIOPEN_FIND_ENFORCE=4 MIOPEN_LOG_LEVEL=7 \
MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 \
  -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 \
  --in_layout NHWC --fil_layout NHWC --out_layout NHWC \
  -m conv -g 1 -t 1 -F 2

Key Findings

Sample Mean (ms) Std Dev (ms) CV Interpretation
Sample 0 (warm-up) 1.537 3.835 249.6% Unreliable (30% cold-start)
Sample 1 0.0959 0.0114 11.9% ❌ Too high variance for early-stop
Sample 2 0.0902 0.0028 3.1% ✅ Much more stable
Samples 2-10 avg 0.0873 0.0038 4.4% True performance baseline

Key insight: CV drops from 11.9% → 3.1% between Sample 1 and Sample 2, justifying dual-sample strategy.

Raw Data Sample

Run  | Warm-up  | Sample1   | Sample2   | Samples 3-10 (mean)
-----|----------|-----------|-----------|--------------------
0    | 9.712 ms | 0.1084 ms | 0.0992 ms | 0.0884 ms
1    | 0.084 ms | 0.0907 ms | 0.0817 ms | 0.0823 ms
2    | 8.703 ms | 0.1217 ms | 0.0854 ms | 0.0881 ms
3    | 0.099 ms | 0.0819 ms | 0.0909 ms | 0.0852 ms
4    | 8.649 ms | 0.1103 ms | 0.0942 ms | 0.0889 ms
5    | 0.088 ms | 0.0913 ms | 0.0829 ms | 0.0827 ms
...

Full dataset: 100runs_sample.log

False Negative Rate Analysis

Theoretical calculation using Z-score methodology:

Assume timing measurements follow a normal distribution N(μ, σ²) where:

  • μ = mean of 10 samples ≈ 0.0873 ms (from empirical data)
  • σ = standard deviation ≈ 0.0038 ms

For two independent samples X₁, X₂ ~ N(μ, σ²):

P(X_min > t) = P(X₁ > t) · P(X₂ > t) = [1 - Φ((t-μ)/σ)]²
Where Φ is the standard normal CDF

Strategy Comparison:

Strategy Samples Threshold Threshold Value Mean Std Dev Z-score False Negative Rate Assessment
Original 1 1.1x 0.0960 ms 0.0959 ms 0.0114 ms 0.09 46.4% ❌ Too risky
This PR 2 (min) 1.2x 0.1048 ms 0.0873 ms 0.0038 ms 4.61 4 × 10⁻¹² (<0.00000001%) ✅ Near-zero

Detailed calculation for new strategy:

Threshold: t = 1.2 × 0.0873 = 0.1048 ms
Z-score: z = (0.1048 - 0.0873) / 0.0038 ≈ 4.61
P(single sample > 0.1048) = 1 - Φ(4.61) ≈ 2 × 10⁻⁶
P(min(X₁, X₂) > 0.1048) = (2 × 10⁻⁶)² ≈ 4 × 10⁻¹²

Empirical validation:

  • In 100 experimental runs, min(X₁, X₂) never exceeded 1.2 × mean
  • Observed false negative rate: 0/100 = 0%

Conclusion: Both theory (10⁻¹² probability) and empirical data (0/100 runs) confirm that the 1.2x threshold with dual-sample strategy makes false rejections virtually impossible.


Test Plan & Results

Environment: MI355X (gfx950), ROCm 7.0.2, grouped convolution backward data

Test command:

MIOPEN_LOG_LEVEL=5 MIOPEN_FIND_MODE=1 \
./bin/MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 \
  -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC \
  --out_layout NHWC -m conv -g 1 -t 1 -F 2

Results:

Test Success Rate Note
Original (no fixes) 6/10 (40% error) Baseline
PR #1978 only Improved, occasional errors remain Warm-up fix applied
This PR 10/10 (0% error) Full solution
100-run stability 100/100 0.099ms ± 0.003ms (CV=3.0%)

Performance Impact

Search Time Overhead: 1 extra initial test per config (typically <1ms vs 10-30s compilation) → <0.01% overhead

Accuracy Improvement:

  • Success rate: 6/10 → 10/10 (0% error)
  • False negative rate: Theoretical 10⁻¹², empirical 0/100
  • Performance: 0.099ms ± 0.003ms (CV=3.0%), consistent across runs

Backward Compatibility

Fully compatible

  • No API changes
  • No database format changes
  • Only internal algorithm improvements
  • Same algorithm flow, enhanced measurement strategy

Alternative Solutions Considered

Strategy Tests/Config Threshold Accuracy Search Overhead Decision
A (This PR) 2 initial (min) 1.2x ~100% (10⁻¹²) +1 test/config Selected - Best balance
B 3 initial (min) 1.1x ~100% +2 tests/config ❌ Higher overhead, minimal gain
C 1 initial 1.3x ~95% Same as baseline ❌ Still 5% false negatives
D 1 initial 1.1x ~54% Same as baseline ❌ Unacceptable error rate

Rationale for Strategy A:

  • Achieves near-perfect accuracy (theoretical: 10⁻¹², empirical: 0/100)
  • Minimal overhead (1 extra test per config, negligible vs compilation)
  • The 1.2x threshold is statistically justified (Z-score = 4.61)
  • 2 initial samples provide sufficient noise reduction without excessive cost
  • Mathematically proven and empirically validated

Why Two PRs?

Following reviewer feedback, the complete fix has been split into two stages:

Stage PR Focus Risk Status
1 #1978 Bug Fix: Warm-up bias Low (4 lines) Ready for merge
2 This PR Optimization: Measurement noise Medium Review

Progressive Impact:

Benefits of splitting:


Submission Checklist

Problem:
The original code only applied a warm-up run to the first configuration
(n_current == 0), leading to unfair comparison. The first configuration
always benefited from warm-up, while subsequent configurations suffered
from cold-start performance penalties.

Impact:
This caused up to 40% false negative rate in kernel selection, resulting
in 4x performance degradation when the optimal kernel was incorrectly
rejected due to cold-start bias.

Solution:
Remove the 'if(n_current == 0)' condition to ensure every configuration
receives a warm-up run before performance measurement. This guarantees
fair comparison across all kernel configurations.

Test Results:
Verified on MI355X (gfx950) with 100 test runs - optimal kernel is now
consistently selected (10/10 runs vs 6/10 before the fix).
Problem:
- Only the first kernel configuration received warm-up, causing cold-start
  performance bias for subsequent configurations
- The 1.1x early-stop threshold was too aggressive, sometimes discarding
  potentially optimal kernels due to first-sample variance (CV=11.9% of 100 samples)

Solution:
- Add warm-up run for every kernel configuration to eliminate cold-start bias
- Implement 2 initial tests + take minimum strategy (2nd sample CV=3.1%)
- Increase early-stop threshold from 1.1x to 1.2x to reduce false negatives

Impact:
- Ensures fair comparison across all kernel configurations
- Reduces sampling noise from 11.9% to 3.1% coefficient of variation
- Better balance between search speed and accuracy
- Based on 100-run stability analysis on gfx950 with ROCm 7.9.0

Testing:
- Verified on gfx950 with ROCm 7.9.0
- Tested with convolution backward data workloads (NHWC layout)
- Confirmed stable performance across multiple runs
- command:MIOPEN_FIND_ENFORCE=4 MIOPEN_ENABLE_LOGGING=1 MIOPEN_LOG_LEVEL=7 MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC -m conv -g 1 -t 1 -F 2

# Conflicts:
#	projects/miopen/src/include/miopen/generic_search.hpp
@JoeLiuAMD JoeLiuAMD requested a review from a team as a code owner October 7, 2025 03:25
JoeLiuAMD added a commit that referenced this pull request Oct 8, 2025
…nel configurations (#1978)

# Fix GenericSearch warm-up bias: apply warm-up to all configurations

>** 📝 Note**: This PR has follow up PR #1993
## Motivation

MIOpen's generic search algorithm suffers from a **race condition** that
causes optimal kernels to be randomly rejected, leading to 3-4x
performance degradation in some cases.

### Problem Description

When running the same convolution workload multiple times as sample
below:
```bash
MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 -p 1 -q 1 \
  -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --fil_layout NHWC --out_layout NHWC \
  -m conv -g 1 -t 1 -F 2
```

**Observed behavior:**
- **Lucky case**: Selected optimal kernel → **0.099 ms** per operation
- **Unlucky case**: Selected suboptimal kernel → **0.332 ms** per
operation (**3.35x slower**)


[Lucky_Joe_20250930.log](https://github.com/user-attachments/files/22714922/Lucky_Joe_20250930.log)

[Normal_Joe_20250930.log](https://github.com/user-attachments/files/22714923/Normal_Joe_20250930.log)

### Root Cause

**Cold-start bias in warm-up logic** (`generic_search.hpp`, lines
559-564):

```cpp
// Original buggy code
if(n_current == 0)  // ❌ Only first config gets warm-up
{
    invoker(profile_h, invoke_ctx);
    profile_h.ResetKernelTime();
}
```

This condition creates an **unfair advantage** for the first
configuration tested:
- **First kernel** (n_current == 0): Gets warm-up → Fair performance
measurement
- **Subsequent kernels** (n_current > 0): No warm-up → Cold-start
penalty (up to **100x slower** in extreme cases)

### Impact

- **High false negative rate**: Up to 40% chance of rejecting the
optimal kernel
- **Performance degradation**: 4x slower execution when suboptimal
kernel is selected
- **Non-deterministic behavior**: Kernel selection depends on which
configuration is tested first

### Example from Production Logs

**Environment**: MI355X (gfx950), ROCm 7.0.2

```
AI generated 4 kernel configurations for testing:

Kernel #0 (128,128,32,32,8,8...): 10 samples → avg 0.343166 ms → selected as "best"
Kernel #1 (64,64,64,32,8,8...):   1 sample  → 1.219 ms         → rejected (cold-start!)
Kernel #2 (64,64,16,32,8,8...):   1 sample  → 3.0267 ms        → rejected (cold-start!)
Kernel #3 (64,16,64,32,8,8...):   1 sample  → 0.482 ms         → rejected (cold-start!)

Final execution: 0.332 ms (using Kernel #0)

Issue: Kernel #2 suffered from cold-start bias (3.0267 ms first sample)
       With proper warm-up, its true performance is ~0.099 ms (3.4x faster than selected kernel)
```

**Detailed timing from Normal_Joe_20250930.log:**

Optimal kernel (incorrectly rejected due to cold-start):
-
`DeviceGroupedConvBwdData_Xdl_CShuffle_v1<64,64,16,32,8,8,Default,16,16,4,1,8,1,1,1>+1`
- Sample 1: **3.027 ms** ← Cold start! (30x slower than true
performance)
- Samples 2-11: 0.369, 0.349, 0.366, 0.352, 0.353, 0.365, 0.352, 0.359,
0.347, 0.352 ms
- **True mean**: 0.354 ms (excluding cold-start outlier)
- **Decision**: Rejected by early-stop (3.027 > 0.377 × 1.1)
- **Wrong outcome**: Best kernel discarded due to unfair cold-start
penalty

---

## Technical Details

### Changes

This PR contains **only the bug fix** - removing the unfair warm-up
condition:

```diff
-                // Warm-up run for first time invoker is used
-                if(n_current == 0)
-                {
-                    invoker(profile_h, invoke_ctx);
-                    profile_h.ResetKernelTime();
-                }
+                // Warm-up run for every configuration to eliminate cold-start bias
+                invoker(profile_h, invoke_ctx);
+                profile_h.ResetKernelTime();
```

**File modified:**
`projects/miopen/src/include/miopen/generic_search.hpp` (lines 559-564)

**Change summary:**
- 3 insertions(+), 6 deletions(-)
- Removes `if(n_current == 0)` condition
- Ensures every configuration receives one warm-up run before
measurement

### Why This is Low Risk

1. **Minimal code change**: Only 4 lines changed
2. **No algorithm change**: Same sampling strategy, same early-stop
logic
3. **Only ensures fairness**: All configs now receive identical warm-up
treatment
4. **No performance regression**: Adds one extra kernel call per config
(~0.3ms overhead per config)
5. **Negligible overhead**: For typical 4-config search, adds 1.2ms
total (kernel compilation takes 10-30 seconds, so overhead is <0.01%)

---

## Test Plan

### Test Environment
- **Hardware**: MI355X (gfx950)
- **ROCm Version**: 7.0.2 (HIP 7.0.51831)
- **Workload**: Grouped convolution backward data (NHWC layout, 4 kernel
configurations)

### Test Command
```bash
export MIOPEN_LOG_LEVEL=5
export MIOPEN_FIND_MODE=1
./bin/MIOpenDriver convbfp16 -n 8 -c 5 -H 225 -W 225 -k 64 -y 3 -x 3 \
  -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 \
  --in_layout NHWC --fil_layout NHWC --out_layout NHWC \
  -m conv -g 1 -t 1 -F 2
```

### Test Results

#### Before Fix (with bug):
- **Success rate**: 6/10 runs selected optimal kernel (40% error rate)
- **Failure pattern**: Optimal kernel rejected when its cold-start time
triggered early-stop
- **Performance impact**: Up to 4x slower when wrong kernel selected
(0.332ms vs 0.099ms)

#### After Fix:
- **Success rate**: 10/10 runs selected optimal kernel (0% error rate)
- **Consistency**: All configurations receive fair warm-up
- **Performance**: Optimal kernel always selected, no degradation
- **Overhead**: +1.2ms for 4 configs (negligible vs 10-30s compilation
time)

---

## Performance Impact

### Search Time Overhead
- **Additional cost**: 1 warm-up run per configuration (only for configs
beyond the first)

### Accuracy Improvement
- **Before**: 60% success rate (6/10 runs correct)
- **After**: 100% success rate (10/10 runs correct)
- **Performance gain**: Eliminates 4x slowdown from selecting wrong
kernel

---

## Backward Compatibility

✅ **Fully compatible** 
- No API changes
- No behavior changes except for fixing the bug
- All existing tests pass
- No impact on already-cached kernels (find database not affected)

---

## Why Split into Two PRs?

Following reviewer feedback, this work has been split into two separate
PRs:

### **PR1 (This PR) - Bug Fix: Warm-up Bias**
- **Risk**: Low (4 lines changed)
- **Impact**: Fixes root cause of unfair kernel comparison
- **Decision**: Ready for immediate merge
- **Rationale**: Without fair warm-up, no amount of threshold tuning can
fix the problem. Cold-start penalties (30-100x slower) make any single
threshold value inadequate.

### **PR2 (Separate PR) - Optimization: Early-Stop Strategy**
- **Branch**: `users/JoeLiuAMD/miopen-generic-search-optimization` #1993
- **Changes**: Dual-sample testing + 1.2x threshold + enhanced logging
- **Risk**: Medium (affects benchmark timing)
- **Decision**: Needs more validation and benchmarking
- **Rationale**: These optimizations improve accuracy further (40% → 0%
error rate) but add ~2 kernel executions per config. The performance
impact needs separate evaluation.

### Why This Approach?
- **PR1 can merge immediately**: Fixes the critical bug with minimal
risk
- **PR2 can be validated thoroughly**: Performance trade-offs can be
evaluated independently
- **Easier to isolate regressions**: If issues arise, we know which
change caused them
- **Progressive improvement**: Get the bug fix deployed while
optimizations are being validated


---

## Submission Checklist

- [x] I have read and agreed with the [contributing
guidelines](CONTRIBUTING.md)
- [x] The changes are minimal and focused on the bug fix
- [x] All existing tests pass
- [x] The fix has been verified on target hardware (MI355X/gfx950)
- [x] The fix eliminates the non-deterministic kernel selection issue
- [x] No performance regression introduced
- [x] Documentation (commit message) clearly explains the problem and
solution
- [x] Backward compatibility confirmed
- [x] Test data and logs provided for verification
Copy link
Copy Markdown
Contributor

@JonathanLichtnerAMD JonathanLichtnerAMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking to Randy and he really wanted to look at this PR, but he won't be back until the 20th. I'll put a "request changes" on this until he gets back.

I think we also will want the changes that Nolan and Chris are doing to either go in first or at the same time.

@cderb
Copy link
Copy Markdown
Contributor

cderb commented Oct 13, 2025

These changes aren't incompatible with the work in #1536.
The increased accuracy is desirable, and I appreciate the rigor in the analysis.
The concern I have is with tuning performance, as now we would be committing to 3 runs of each kernel (the first being discarded) before we make a decision on whether to do a full analysis.
I have observed some solvers may have a particular kernel variant which is by far and away worse than the others for the given shape.
Maybe having a 2 level system to this might help? Like if the first sample is less than 1.5x and if between the 2 samples the best is less than 1.2x.

@JonathanLichtnerAMD
Copy link
Copy Markdown
Contributor

I was talking to Randy and he really wanted to look at this PR, but he won't be back until the 20th. I'll put a "request changes" on this until he gets back.

I think we also will want the changes that Nolan and Chris are doing to either go in first or at the same time.

@randyspauldingamd is back, so he can take a look at this PR now. I will remove my blocking change request.

@JonathanLichtnerAMD JonathanLichtnerAMD dismissed their stale review October 21, 2025 18:06

Randy is back, so he can look at this now.

@randyspauldingamd
Copy link
Copy Markdown
Contributor

randyspauldingamd commented Oct 23, 2025

Hi Joe, thanks for doing this, it's a great start! It does need some refinement, and the theory part needs to be softened. The first thing I'll suggest to anyone is Rule #1: always plot the data. This is a histogram of your 100 runs, samples 2-10:
image
We can quickly see there's nothing "normal" about GPU kernel runtimes. This demonstrates that statistical analysis of this kind of data (Z-scores etc.) is generally anecdotal at best.

Now, there are a few notable features:

  1. There is a hard minimum-time cutoff which represents the actual number of clock cycles required for the computation. The rest of the structure comes from many (perhaps hundreds of) stochastic processes that may or may not affect each sample individually.
  2. There are a couple large peaks to the left, and a broader, quasi-normal peak close to center. If we increase the bin count we see that the two to the left also resolve into quasi-normal distributions (see below). We could guess at what causes these, or do more runs to attempt to further reduce the noise, but this isn't typically very fruitful so I won't :). We're also starting to see hints about correlation between the peaks, but let's not go into that now.
  3. There is a small tail at long times. The big delays might be driver effects such as varying clock rates for power management; system interrupts; etc. Understanding them is not crucial, but visualizing it is. The final histogram, which combines samples 3-10, shows that almost all of the worst offenders came from sample 2! If this were a high-precision application, I would refine the technique to try to eliminate them, and possible end up rejecting all the sample 2's as well. Since it is not, I think the data show that sample 2 is good enough for our needs to be included.

Next, recommendations:

  1. At first glance I wanted to also discard Sample 1 for the sake of simplicity. However, out of your 100 runs, Sample 1 was faster in 15 of them. So I agree with your choice to take the min.
  2. As Chris said, we should reduce total runtime if we can by adding an "earlier" stop criteria. Of your 100 runs, the slowest Sample 1 was 46% slower than the mean of Samples 2-10. So a factor of 1.5 is probably too conservative. I do have a concern here that the delays are not always proportional; part of it is additive, which will be problematic for kernels having short run times. However, the penalty for getting it wrong is not very big so I think we're safe to neglect this.

I might suggest a factor of 1.8 for a first cutoff from Sample 1. If that passes, follow it with a 2nd cutoff on the minimum of Samples 1 and 2. From your data, the largest min was 16% slower than the mean, so 1.2 seems a reasonable first choice for this factor.

Finally, I will add it wouldn't hurt to repeat this exercise on a few hundred or thousand Solver/shape combinations, and greatly increase the number of runs to build confidence. Let's leave that for another day though :).

image image

@JoeLiuAMD
Copy link
Copy Markdown
Contributor Author

Hi Joe, thanks for doing this, it's a great start! It does need some refinement, and the theory part needs to be softened. The first thing I'll teach you is Rule #1: always plot the data. This is a histogram of your 100 runs, samples 2-10:

Hi Randy,

Thanks for the histogram analysis. There's a fundamental difference in our analytical approaches that needs clarification:

Different Statistical Dimensions

Your histogram analysis: Pooled distribution of all Sample 2-10 values (900 data points mixing different sample positions) - this shows GPU timing complexity but isn't directly relevant to early-stop decisions.

Our algorithm requires: Cross-run stability for individual sample positions - "Are the initial samples (Sample 1, Sample 2) from this run stable enough to be representative for cutoff decisions?"

For example:

  • Sample 1 cross-run: 100 independent measurements at position 1 → CV=11.9%
  • Sample 2 cross-run: 100 independent measurements at position 2 → CV=3.1%
  • This indicates Sample 1 is too noisy for reliable early-stop decisions

The pooled distribution characteristics don't affect this cross-run prediction problem.

Adopting Empirical Approach

The empirical approach makes sense. My PR included both empirical validation (0/100 false negatives, 100-run stability testing) and theoretical Z-score analysis - the latter served to supplement our empirical findings with statistical explanation. The data-driven thresholds (1.8x for Sample 1, 1.2x for min) are more robust for GPU timing complexities.

Additional Observations

I've discovered an potential pattern: kernels <0.01ms show significantly higher variability (see the CV below for different shapes). It seems the shorter the kernel time, the higher the CV in samples. This suggests we may need adaptive thresholds based on kernel runtime - shorter kernels might require more conservative cutoffs due to higher measurement noise relative to signal.
image

I'm collecting more comprehensive data across different kernel time ranges to establish whether we should implement runtime-dependent thresholds (e.g., 1.8x for >0.01ms kernels, 2.5x for <0.01ms kernels). Was this kernel-time dependency considered in the original early-stop design? Do you have data from the initial implementation that supports the original design?

Implementation Plan

I'll update the PR to:

  • Prioritize empirical thresholds over theoretical Z-score calculations
  • Implement the dual-cutoff approach (1.8x → 1.2x)
  • Reposition Z-score analysis as supporting statistical context
  • Add runtime-dependent threshold investigation

Best regards,
Joe

@AnzhongHuang
Copy link
Copy Markdown
Contributor

A two-level cutoff (1.8x followed by 1.2x) for early stopping is preferable. If needed, we could extend to three levels or more, but more data statistics would be required to support them.

@BrianHarrisonAMD
Copy link
Copy Markdown
Contributor

Greetings @JoeLiuAMD!

Thanks for all the analysis and effort to put this together. Very appreciated!
Regarding the original data, I believe we saw similar things.
Short running kernels (<0.01 ms) seemed to have very high relative variance which makes % based cut-offs less effective.
I am unable to find the individual kernel runtimes, but ill share the data I could find.

The next steps are looking good.
Just wanted to mention there is another very similar evaluation happening at the solver level (code here).
If we decide on a better method for outlier rejection, then we should also consider updating the solver evaluation in a similar way as well.

@randyspauldingamd
Copy link
Copy Markdown
Contributor

Additional Observations

I've discovered an potential pattern: kernels <0.01ms show significantly higher variability (see the CV below for different shapes). It seems the shorter the kernel time, the higher the CV in samples. This suggests we may need adaptive thresholds based on kernel runtime - shorter kernels might require more conservative cutoffs due to higher measurement noise relative to signal.

I'm collecting more comprehensive data across different kernel time ranges to establish whether we should implement runtime-dependent thresholds (e.g., 1.8x for >0.01ms kernels, 2.5x for <0.01ms kernels). Was this kernel-time dependency considered in the original early-stop design? Do you have data from the initial implementation that supports the original design?

As Brian also suggested, I think the analysis you're doing is better and more comprehensive than any in the past. I don't know of any other data besides what he can dig up.

More fun stuff:
I noticed this initially, but didn't point it out because I was trying to keep things simple: there is also evidence that the run methodology could be improved. In your original 100 runs, the warmup times are extremely bimodal. One group (45) runs appx. 100x longer than the rest (55). This itself doesn't have an effect on your initial results, but it does have a strong measurable effect on Sample 1:

  • Sample 1 avg, slow warmup: 0.1114 ± 0.00058 ms
  • Sample 1 avg, fast warmup: 0.0896 ± 0.00037 ms (37 stdev's of the mean separation)

We may not be able to neglect this for short-running kernels. I'm afraid I don't have much time to help with this decision, and we also need more data, so please forge ahead. If you do feel that this is a problem, we can discuss a few things to try. It could be mostly power management being too aggressive at identifying idle time and shutting things down. The simplest workaround would be to just sort the runs into "slow warmup" and "fast warmup" sets. I think we could just ignore the "fast" set and use "slow" only since it is the worst-case scenario, but it may not be this simple. Alternatively, add a delay between each run to try to force "slow mode" every time. Perhaps 10-100 ms.

Note that there are likely driver settings that will reduce power management or disable it entirely, which should make the results more stable. While this is better from a scientific standpoint, it is probably not desirable here though, since we need these results to be representative of what users will experience.

Back to your observations:
Yes, the stochastic interruptions are not all proportional; constant and linear delays cause shorter-running operations to exhibit more chaotic runtimes. Short kernels will also show more variation in sensitivity to platform specifics such as CPU/memory architectures, driver version, ASIC revision etc. We can't test on all platforms, so we'll just do our best to build in an appropriate amount of "squishiness." Btw, the factor of 1.8 that I suggested was already pretty soft for this exact reason, but I agree that your new data show this is not enough and I'm glad you brought it up.

I believe that reducing tuning time is the dominant goal for this effort, but end-user runtime is ultimately more important. For tuning short-running kernels, additional iterations are similar cost to even simple CPU analysis, so we shouldn't be hesitant to be creative, but not get too fancy with the implementation.

A simple runtime-dependent threshold like you suggest is a very valid option. A couple other options would include:

  • always use 2-sample-min for early-stop if sample 1 runtime is below a TBD limit (e.g. 0.05 ms)
  • use an adaptive threshold factor instead of hard-coded values by defining a (simple) function, clamped at appropriate values. Something like f = 0.15 / (t + 0.05) where t is in ms produces values in a meaningful range that can be coerced into a range such as 1.6 <= f <= 2.5. (exact values TBD ofc)

Cheers,
Randy

@JoeLiuAMD
Copy link
Copy Markdown
Contributor Author

I've been focused on urgent customer GEMM work recently, which caused the delay of this PR.

After collecting more data across different kernel time ranges, I've identified some patterns that may lead to a more effective search strategy. I need additional time to validate the approach and will update here with results. Thanks.

@randyspauldingamd
Copy link
Copy Markdown
Contributor

I've been focused on urgent customer GEMM work recently, which caused the delay of this PR.

After collecting more data across different kernel time ranges, I've identified some patterns that may lead to a more effective search strategy. I need additional time to validate the approach and will update here with results. Thanks.

Awesome, thank you again!

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been inactive for 25 days and will be marked as stale.

If you would like to keep this PR open, please:

  • Add new commits
  • Add a comment explaining why it should remain open

This PR will be automatically closed in 5 days if no further activity occurs.

@github-actions github-actions Bot added the Stale PR has no activity for 25+ days label Apr 22, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been automatically closed due to inactivity (30 days with no updates).

If you'd like to continue working on this, feel free to reopen the PR or create a new one.

@github-actions github-actions Bot closed this Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

organization: ROCm project: miopen Stale PR has no activity for 25+ days

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants