elastic · strawgate · Feb 26, 2026 · Feb 26, 2026 · Feb 26, 2026 · Feb 26, 2026
diff --git a/.github/workflows/gh-aw-test-improvement.lock.yml b/.github/workflows/gh-aw-test-improvement.lock.yml
diff --git a/.github/workflows/gh-aw-test-improver.lock.yml b/.github/workflows/gh-aw-test-improver.lock.yml
diff --git a/.github/workflows/gh-aw-test-improver.md b/.github/workflows/gh-aw-test-improver.md
@@ -126,7 +126,21 @@ Identify under-tested code paths, add focused tests, and remove or consolidate d
 - Run the most relevant test command(s). **All tests — new and existing — must pass.** If the full suite is too heavy, run targeted tests.
 - If required commands, tests, or coverage cannot be run, call `noop`. Do not open a PR with untested test code.
 
-## Step 5: Quality Gate — Test Value Check
+## Step 5: Stability check — run new tests repeatedly
+
+New tests that pass once may still be flaky. Before filing a PR, verify stability by running each new or modified test multiple times.
+
+1. Run each new or modified test **at least 5 times** in sequence and confirm every run passes.
+   - Use the test framework's built-in repeat/count flag when available (e.g., `go test -count=5`, `pytest -x --count 5` with `pytest-repeat`, `--repeat 5` in Jest/Vitest).
+   - If no built-in mechanism exists, use a simple shell loop: `for i in $(seq 1 5); do <test-command> || exit 1; done`
+2. If any run fails intermittently, investigate the root cause before proceeding. Common sources of flakiness:
+   - Reliance on timing, sleep, or wall-clock assertions
+   - Shared mutable state between test cases
+   - Non-deterministic iteration order (e.g., map/set ordering)
+   - Dependence on external services or network
+3. If the test cannot be made reliably stable, do not include it in the PR. Call `noop` if no stable tests remain.
+
+## Step 6: Quality Gate — Test Value Check
 
 Before creating the PR, evaluate each new test:
 
@@ -138,7 +152,7 @@ Before creating the PR, evaluate each new test:
 
 If the tests don't pass this bar, call `noop`. Low-value tests are worse than no tests — they create maintenance burden and false confidence.
 
-## Step 6: Create the PR
+## Step 7: Create the PR
 
 1. Commit the changes locally.
 2. Call `create_pull_request` with: