From 6105b62b6db41fc8f997e775f6c302897534c8b5 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Thu, 26 Feb 2026 13:48:23 +0000
Subject: [PATCH 1/3] Initial plan


From 6127ba84a71eaa4e097cdf45a7bbf105ece7a076 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Thu, 26 Feb 2026 13:54:16 +0000
Subject: [PATCH 2/3] Add flaky test stability check step to test-improver
 workflow
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add Step 5: Stability check — run new tests repeatedly, which instructs
the test-improver agent to run each new or modified test at least 5 times
before filing a PR. Provides language-agnostic guidance with examples for
Go, Python, JavaScript, and Ruby test frameworks, plus a shell loop
fallback. Renumber subsequent steps accordingly.

Co-authored-by: strawgate <6384545+strawgate@users.noreply.github.com>
---
 .../workflows/gh-aw-test-improvement.lock.yml | 20 ++++++++++++++++---
 .../workflows/gh-aw-test-improver.lock.yml    | 20 ++++++++++++++++---
 .github/workflows/gh-aw-test-improver.md      | 18 +++++++++++++++--
 3 files changed, 50 insertions(+), 8 deletions(-)
diff --git a/.github/workflows/gh-aw-test-improvement.lock.yml b/.github/workflows/gh-aw-test-improvement.lock.yml
index 5a74b9fa..c2efe17f 100644
--- a/.github/workflows/gh-aw-test-improvement.lock.yml
+++ b/.github/workflows/gh-aw-test-improvement.lock.yml
@@ -41,7 +41,7 @@
 #
 # inlined-imports: true
 #
-# gh-aw-metadata: {"schema_version":"v1","frontmatter_hash":"268cfc3d5e16d7108c5458825a159d4d643f02a8f574fb79afcf1ee8aba3e945"}
+# gh-aw-metadata: {"schema_version":"v1","frontmatter_hash":"83801e7e9acc5954b0b6a3ac5842532c89300a09ed7a9ab7138af32251c990ec"}
 
 name: "Test Improver"
 "on":
@@ -335,7 +335,21 @@ jobs:
           - Run the most relevant test command(s). **All tests — new and existing — must pass.** If the full suite is too heavy, run targeted tests.
           - If required commands, tests, or coverage cannot be run, call `noop`. Do not open a PR with untested test code.
           
-          ## Step 5: Quality Gate — Test Value Check
+          ## Step 5: Stability check — run new tests repeatedly
+          
+          New tests that pass once may still be flaky. Before filing a PR, verify stability by running each new or modified test multiple times.
+          
+          1. Run each new or modified test **at least 5 times** in sequence and confirm every run passes.
+             - Use the test framework's built-in repeat/count flag when available (e.g., `go test -count=5`, `pytest -x --count 5` with `pytest-repeat`, `--repeat 5` in Jest/Vitest, `rspec --bisect` or loop in RSpec).
+             - If no built-in mechanism exists, use a simple shell loop: `for i in $(seq 1 5); do <test-command> || exit 1; done`
+          2. If any run fails intermittently, investigate the root cause before proceeding. Common sources of flakiness:
+             - Reliance on timing, sleep, or wall-clock assertions
+             - Shared mutable state between test cases
+             - Non-deterministic iteration order (e.g., map/set ordering)
+             - Dependence on external services or network
+          3. If the test cannot be made reliably stable, do not include it in the PR. Call `noop` if no stable tests remain.
+          
+          ## Step 6: Quality Gate — Test Value Check
           
           Before creating the PR, evaluate each new test:
           
@@ -347,7 +361,7 @@ jobs:
           
           If the tests don't pass this bar, call `noop`. Low-value tests are worse than no tests — they create maintenance burden and false confidence.
           
-          ## Step 6: Create the PR
+          ## Step 7: Create the PR
           
           1. Commit the changes locally.
           2. Call `create_pull_request` with:
diff --git a/.github/workflows/gh-aw-test-improver.lock.yml b/.github/workflows/gh-aw-test-improver.lock.yml
index 1784aabd..860b7c89 100644
--- a/.github/workflows/gh-aw-test-improver.lock.yml
+++ b/.github/workflows/gh-aw-test-improver.lock.yml
@@ -36,7 +36,7 @@
 #
 # inlined-imports: true
 #
-# gh-aw-metadata: {"schema_version":"v1","frontmatter_hash":"268cfc3d5e16d7108c5458825a159d4d643f02a8f574fb79afcf1ee8aba3e945"}
+# gh-aw-metadata: {"schema_version":"v1","frontmatter_hash":"83801e7e9acc5954b0b6a3ac5842532c89300a09ed7a9ab7138af32251c990ec"}
 
 name: "Test Improver"
 "on":
@@ -330,7 +330,21 @@ jobs:
           - Run the most relevant test command(s). **All tests — new and existing — must pass.** If the full suite is too heavy, run targeted tests.
           - If required commands, tests, or coverage cannot be run, call `noop`. Do not open a PR with untested test code.
           
-          ## Step 5: Quality Gate — Test Value Check
+          ## Step 5: Stability check — run new tests repeatedly
+          
+          New tests that pass once may still be flaky. Before filing a PR, verify stability by running each new or modified test multiple times.
+          
+          1. Run each new or modified test **at least 5 times** in sequence and confirm every run passes.
+             - Use the test framework's built-in repeat/count flag when available (e.g., `go test -count=5`, `pytest -x --count 5` with `pytest-repeat`, `--repeat 5` in Jest/Vitest, `rspec --bisect` or loop in RSpec).
+             - If no built-in mechanism exists, use a simple shell loop: `for i in $(seq 1 5); do <test-command> || exit 1; done`
+          2. If any run fails intermittently, investigate the root cause before proceeding. Common sources of flakiness:
+             - Reliance on timing, sleep, or wall-clock assertions
+             - Shared mutable state between test cases
+             - Non-deterministic iteration order (e.g., map/set ordering)
+             - Dependence on external services or network
+          3. If the test cannot be made reliably stable, do not include it in the PR. Call `noop` if no stable tests remain.
+          
+          ## Step 6: Quality Gate — Test Value Check
           
           Before creating the PR, evaluate each new test:
           
@@ -342,7 +356,7 @@ jobs:
           
           If the tests don't pass this bar, call `noop`. Low-value tests are worse than no tests — they create maintenance burden and false confidence.
           
-          ## Step 6: Create the PR
+          ## Step 7: Create the PR
           
           1. Commit the changes locally.
           2. Call `create_pull_request` with:
diff --git a/.github/workflows/gh-aw-test-improver.md b/.github/workflows/gh-aw-test-improver.md
index d62fd1d1..d90b4fce 100644
--- a/.github/workflows/gh-aw-test-improver.md
+++ b/.github/workflows/gh-aw-test-improver.md
@@ -126,7 +126,21 @@ Identify under-tested code paths, add focused tests, and remove or consolidate d
 - Run the most relevant test command(s). **All tests — new and existing — must pass.** If the full suite is too heavy, run targeted tests.
 - If required commands, tests, or coverage cannot be run, call `noop`. Do not open a PR with untested test code.
 
-## Step 5: Quality Gate — Test Value Check
+## Step 5: Stability check — run new tests repeatedly
+
+New tests that pass once may still be flaky. Before filing a PR, verify stability by running each new or modified test multiple times.
+
+1. Run each new or modified test **at least 5 times** in sequence and confirm every run passes.
+   - Use the test framework's built-in repeat/count flag when available (e.g., `go test -count=5`, `pytest -x --count 5` with `pytest-repeat`, `--repeat 5` in Jest/Vitest, `rspec --bisect` or loop in RSpec).
+   - If no built-in mechanism exists, use a simple shell loop: `for i in $(seq 1 5); do <test-command> || exit 1; done`
+2. If any run fails intermittently, investigate the root cause before proceeding. Common sources of flakiness:
+   - Reliance on timing, sleep, or wall-clock assertions
+   - Shared mutable state between test cases
+   - Non-deterministic iteration order (e.g., map/set ordering)
+   - Dependence on external services or network
+3. If the test cannot be made reliably stable, do not include it in the PR. Call `noop` if no stable tests remain.
+
+## Step 6: Quality Gate — Test Value Check
 
 Before creating the PR, evaluate each new test:
 
@@ -138,7 +152,7 @@ Before creating the PR, evaluate each new test:
 
 If the tests don't pass this bar, call `noop`. Low-value tests are worse than no tests — they create maintenance burden and false confidence.
 
-## Step 6: Create the PR
+## Step 7: Create the PR
 
 1. Commit the changes locally.
 2. Call `create_pull_request` with:

From 40b7fc02ea51ece28c1602ca85bb02b28bfc2dad Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Thu, 26 Feb 2026 13:55:10 +0000
Subject: [PATCH 3/3] Remove incorrect rspec --bisect example from stability
 check step

Co-authored-by: strawgate <6384545+strawgate@users.noreply.github.com>
---
 .github/workflows/gh-aw-test-improvement.lock.yml | 4 ++--
 .github/workflows/gh-aw-test-improver.lock.yml    | 4 ++--
 .github/workflows/gh-aw-test-improver.md          | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/.github/workflows/gh-aw-test-improvement.lock.yml b/.github/workflows/gh-aw-test-improvement.lock.yml
index c2efe17f..a1d67faf 100644
--- a/.github/workflows/gh-aw-test-improvement.lock.yml
+++ b/.github/workflows/gh-aw-test-improvement.lock.yml
@@ -41,7 +41,7 @@
 #
 # inlined-imports: true
 #
-# gh-aw-metadata: {"schema_version":"v1","frontmatter_hash":"83801e7e9acc5954b0b6a3ac5842532c89300a09ed7a9ab7138af32251c990ec"}
+# gh-aw-metadata: {"schema_version":"v1","frontmatter_hash":"0cb0e71f5f360f41d4858c9db3ca1aa5c20d667d07d3204f5c50947d71e43a0b"}
 
 name: "Test Improver"
 "on":
@@ -340,7 +340,7 @@ jobs:
           New tests that pass once may still be flaky. Before filing a PR, verify stability by running each new or modified test multiple times.
           
           1. Run each new or modified test **at least 5 times** in sequence and confirm every run passes.
-             - Use the test framework's built-in repeat/count flag when available (e.g., `go test -count=5`, `pytest -x --count 5` with `pytest-repeat`, `--repeat 5` in Jest/Vitest, `rspec --bisect` or loop in RSpec).
+             - Use the test framework's built-in repeat/count flag when available (e.g., `go test -count=5`, `pytest -x --count 5` with `pytest-repeat`, `--repeat 5` in Jest/Vitest).
              - If no built-in mechanism exists, use a simple shell loop: `for i in $(seq 1 5); do <test-command> || exit 1; done`
           2. If any run fails intermittently, investigate the root cause before proceeding. Common sources of flakiness:
              - Reliance on timing, sleep, or wall-clock assertions
diff --git a/.github/workflows/gh-aw-test-improver.lock.yml b/.github/workflows/gh-aw-test-improver.lock.yml
index 860b7c89..6f54c8a9 100644
--- a/.github/workflows/gh-aw-test-improver.lock.yml
+++ b/.github/workflows/gh-aw-test-improver.lock.yml
@@ -36,7 +36,7 @@
 #
 # inlined-imports: true
 #
-# gh-aw-metadata: {"schema_version":"v1","frontmatter_hash":"83801e7e9acc5954b0b6a3ac5842532c89300a09ed7a9ab7138af32251c990ec"}
+# gh-aw-metadata: {"schema_version":"v1","frontmatter_hash":"0cb0e71f5f360f41d4858c9db3ca1aa5c20d667d07d3204f5c50947d71e43a0b"}
 
 name: "Test Improver"
 "on":
@@ -335,7 +335,7 @@ jobs:
           New tests that pass once may still be flaky. Before filing a PR, verify stability by running each new or modified test multiple times.
           
           1. Run each new or modified test **at least 5 times** in sequence and confirm every run passes.
-             - Use the test framework's built-in repeat/count flag when available (e.g., `go test -count=5`, `pytest -x --count 5` with `pytest-repeat`, `--repeat 5` in Jest/Vitest, `rspec --bisect` or loop in RSpec).
+             - Use the test framework's built-in repeat/count flag when available (e.g., `go test -count=5`, `pytest -x --count 5` with `pytest-repeat`, `--repeat 5` in Jest/Vitest).
              - If no built-in mechanism exists, use a simple shell loop: `for i in $(seq 1 5); do <test-command> || exit 1; done`
           2. If any run fails intermittently, investigate the root cause before proceeding. Common sources of flakiness:
              - Reliance on timing, sleep, or wall-clock assertions
diff --git a/.github/workflows/gh-aw-test-improver.md b/.github/workflows/gh-aw-test-improver.md
index d90b4fce..a83ebb5d 100644
--- a/.github/workflows/gh-aw-test-improver.md
+++ b/.github/workflows/gh-aw-test-improver.md
@@ -131,7 +131,7 @@ Identify under-tested code paths, add focused tests, and remove or consolidate d
 New tests that pass once may still be flaky. Before filing a PR, verify stability by running each new or modified test multiple times.
 
 1. Run each new or modified test **at least 5 times** in sequence and confirm every run passes.
-   - Use the test framework's built-in repeat/count flag when available (e.g., `go test -count=5`, `pytest -x --count 5` with `pytest-repeat`, `--repeat 5` in Jest/Vitest, `rspec --bisect` or loop in RSpec).
+   - Use the test framework's built-in repeat/count flag when available (e.g., `go test -count=5`, `pytest -x --count 5` with `pytest-repeat`, `--repeat 5` in Jest/Vitest).
    - If no built-in mechanism exists, use a simple shell loop: `for i in $(seq 1 5); do <test-command> || exit 1; done`
 2. If any run fails intermittently, investigate the root cause before proceeding. Common sources of flakiness:
    - Reliance on timing, sleep, or wall-clock assertions