Skip to content

chore(codex): bootstrap PR for issue #1402#1403

Merged
stranske merged 13 commits intomainfrom
codex/issue-1402
Feb 9, 2026
Merged

chore(codex): bootstrap PR for issue #1402#1403
stranske merged 13 commits intomainfrom
codex/issue-1402

Conversation

@stranske
Copy link
Copy Markdown
Owner

@stranske stranske commented Feb 8, 2026

Source: Issue #1402

Automated Status Summary

Scope

PR #1387 addressed issue #1385, but verification returned CONCERNS because .agents/** is not fully excluded from bot review comment generation and the dismissal workflow/script integration is incomplete. This follow-up issue closes the remaining gaps by (1) enforcing .agents/** filtering at the point where review comments are created, and (2) ensuring the reusable bot comment handler dismisses only the individual comments on ignored paths (with correct pattern matching, age filtering, and structured logging), while keeping the diff tightly scoped.

Context for Agent

Related Issues/PRs

Tasks

Connector Configuration

  • Add .agents/** pattern to the connector's ignored_paths configuration file or settings object
  • Implement filter logic in the file-selection code path that runs before review comment construction
  • Apply the ignored_paths filter to exclude matching files from the review comment generation pipeline
  • Write unit tests that verify .agents/** files are excluded from the file selection results

Dismissal Script Enhancement

  • Replace string prefix checks with glob pattern matching using minimatch or equivalent library in bot-comment-dismiss.js
  • Implement pattern matching logic that correctly handles nested paths like .agents/a/b/c.yml
  • Add unit tests verifying glob patterns match nested paths under .agents/ correctly
  • Add negative test cases confirming non-matching paths are not incorrectly dismissed

Per-Comment Dismissal Logic

  • Modify dismissal logic to iterate through individual review comments rather than dismissing entire reviews
  • Implement path filtering that checks each comment's path field against ignored_paths patterns
  • Add logic to skip dismissal for comments whose paths do not match ignored patterns
  • Write integration tests for mixed-path reviews with both ignored and non-ignored file comments

Structured Logging

  • Add structured logging for each dismissed review comment by invoking formatDismissLog() for every dismissal
  • Ensure the log includes bot name and file path for each dismissed comment

Scope Cleanup

  • Identify and revert changes related to verify-compare evaluation logic modifications
  • Remove changes related to chain depth tracking functionality additions
  • Revert ledger validation caching implementation changes
  • Remove dependency version bumps unrelated to the connector filtering feature

Acceptance criteria

Connector Filtering

  • The connector's ignored-paths configuration includes an entry that matches all files under .agents/ (e.g., .agents/** or equivalent supported pattern)
  • The filter is applied in the code path that selects files for review comment generation (i.e., before any review comment is constructed/posted)
  • Given an input file list containing .agents/issue-test-ledger.yml and src/app.ts, the connector's file-selection logic returns src/app.ts and excludes .agents/issue-test-ledger.yml

Script Integration

  • The script .github/scripts/bot-comment-dismiss.js accepts maxAgeSeconds as an explicit input argument
  • When maxAgeSeconds is provided, the dismissal logic only dismisses individual review comments whose created_at timestamp is newer than now - maxAgeSeconds
  • Older comments are left unchanged when maxAgeSeconds filtering is active

Pattern Matching

  • Ignored-path matching supports patterns that match nested paths under .agents/ (e.g., .agents/** matches .agents/a/b/c.yml)
  • Pattern matching does not rely on simple prefix-only string checks
  • Unit tests verify that .agents/nested/deep/file.yml matches .agents/** pattern
  • Unit tests verify that src/agents/file.ts does NOT match .agents/** pattern

Per-Comment Dismissal

  • For a mixed-path GitHub review containing at least two review comments—one on an ignored path .agents/issue-test-ledger.yml and one on a non-ignored path src/app.ts—the script dismisses only the ignored-path comment
  • The script does not dismiss the entire review object when only some comments match ignored paths
  • The script does not dismiss non-ignored comments in mixed reviews

Logging

  • Each dismissed review comment produces exactly one structured log entry via formatDismissLog()
  • Each log entry includes (a) the bot identity and (b) the exact file path of the dismissed comment

End-to-End Validation

  • Given a test PR with changes to .agents/issue-test-ledger.yml, the dismissal script successfully dismisses all matching review comments when invoked
  • Querying the GitHub API for remaining non-dismissed comments on .agents/** paths returns zero results after script execution

Scope Control

  • The PR modifies only the following files: (1) files under chatgpt-codex-connector/ related to ignored_paths filtering, (2) .github/scripts/bot-comment-dismiss.js, and (3) test files with names matching **/test/**/ignore* or **/test/**/dismiss*
  • No other files are modified

@stranske stranske added agent:codex Agent-created issues from Codex agents:keepalive Use to initiate keepalive functionality with agents autofix Opt-in automated formatting & lint remediation labels Feb 8, 2026
Copilot AI review requested due to automatic review settings February 8, 2026 22:18
@stranske-keepalive
Copy link
Copy Markdown
Contributor

stranske-keepalive bot commented Feb 8, 2026

🤖 Keepalive Loop Status

PR #1403 | Agent: Codex | Iteration 5+2 🚀 extended

Current State

Metric Value
Iteration progress [##########] 5/5 5 base + 2 extended = 7 total
Action stop (max-iterations-unproductive)
Gate success
Tasks 36/37 complete
Timeout 45 min (default)
Timeout usage 5m elapsed (11%, 40m remaining)
Keepalive ✅ enabled
Autofix ❌ disabled

🔍 Failure Classification

| Error type | infrastructure |
| Error category | unknown |
| Suggested recovery | Capture logs and context; retry once and escalate if the issue persists. |

⚠️ Failure Tracking

| Consecutive failures | 1/3 |
| Reason | max-iterations-unproductive |

🛑 Paused – Human Attention Required

The keepalive loop has paused due to repeated failures.

To resume:

  1. Investigate the failure reason above
  2. Fix any issues in the code or prompt
  3. Remove the needs-human label from this PR
  4. The next Gate pass will restart the loop

Or manually edit this comment to reset failure: {} in the state below.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the standard Codex bootstrap marker file for issue #1402, consistent with existing agents/codex-*.md bootstrap entries.

Changes:

  • Create agents/codex-1402.md with the bootstrap HTML comment used by the Codex/agents workflow.

@stranske-keepalive
Copy link
Copy Markdown
Contributor

stranske-keepalive bot commented Feb 8, 2026

✅ Codex Completion Checkpoint

Iteration: 5
Commit: 72e29f6
Recorded: 2026-02-08T22:51:20.422Z

No new completions recorded this round.

About this comment

This comment is automatically generated to track task completions.
The Automated Status Summary reads these checkboxes to update PR progress.
Do not edit this comment manually.

@agents-workflows-bot agents-workflows-bot bot added the agent:needs-attention Agent needs human review or intervention label Feb 8, 2026
@agents-workflows-bot agents-workflows-bot bot added the needs-human Requires human intervention or review label Feb 8, 2026
@stranske stranske merged commit 631e831 into main Feb 9, 2026
64 of 68 checks passed
@stranske stranske deleted the codex/issue-1402 branch February 9, 2026 09:16
@stranske stranske added the verify:compare Compare multiple LLM evaluations label Feb 9, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 9, 2026

Provider Comparison Report

Provider Summary

Provider Model Verdict Confidence Summary
openai gpt-5.2 CONCERNS 72% The PR improves ignored-path matching in bot-comment-dismiss.js and adds tests covering nested .agents/** matching, non-matches, mixed-path dismissal behavior, maxAgeSeconds filtering behavior (as...
anthropic claude-sonnet-4-5-20250929 CONCERNS 85% The PR successfully implements glob pattern matching for bot comment dismissal with comprehensive unit tests covering nested paths, mixed reviews, age filtering, and structured logging. The bot-com...
📋 Full Provider Details (click to expand)

openai

  • Model: gpt-5.2
  • Verdict: CONCERNS
  • Confidence: 72%
  • Scores:
    • Correctness: 6.0/10
    • Completeness: 5.0/10
    • Quality: 6.0/10
    • Testing: 6.0/10
    • Risks: 5.0/10
  • Summary: The PR improves ignored-path matching in bot-comment-dismiss.js and adds tests covering nested .agents/** matching, non-matches, mixed-path dismissal behavior, maxAgeSeconds filtering behavior (as used by the JS API), and per-dismissal structured logging. However, it does not implement the connector-side ignored-path filtering required by the acceptance criteria, and it violates the stated scope constraints by changing several unrelated Python/dependency files and adding a non-test/non-script markdown file. There are also some integration uncertainties: the diff does not clearly show maxAgeSeconds being accepted as an explicit script input when invoked, and the custom glob-to-regex implementation may diverge from standard glob semantics.
  • Concerns:
    • Scope control acceptance criterion is not met: PR modifies multiple unrelated files (requirements.txt, scripts/langchain/structured_output.py, tests/test_fallback_chain_provider.py, tests/test_structured_output.py) and adds agents/codex-1395.md, which are outside the allowed file set in the stated criteria.
    • Acceptance criteria call for connector-side filtering under chatgpt-codex-connector/ before review comment construction; this PR does not include any connector changes or tests for connector file-selection behavior (e.g., excluding .agents/issue-test-ledger.yml while keeping src/app.ts).
    • bot-comment-dismiss.js acceptance requires maxAgeSeconds be an explicit input argument; the diff shown does not demonstrate any parsing/acceptance of maxAgeSeconds as an input (it is passed programmatically in tests, but there is no visible implementation change ensuring the script accepts it as an explicit argument/environment input when invoked as a script).
    • Glob/pattern matching is implemented via a custom regex converter rather than a standard glob library (minimatch). While it likely works for the tested cases, this increases risk of edge-case mismatches vs typical glob semantics (e.g., character classes, brace expansion, escaping behavior).
    • Timestamp handling was narrowed to created_at/createdAt only (dropping updated_at/updatedAt). If existing callers relied on updated timestamps for age filtering, this could change behavior.

anthropic

  • Model: claude-sonnet-4-5-20250929
  • Verdict: CONCERNS
  • Confidence: 85%
  • Scores:
    • Correctness: 7.0/10
    • Completeness: 6.0/10
    • Quality: 8.0/10
    • Testing: 9.0/10
    • Risks: 6.0/10
  • Summary: The PR successfully implements glob pattern matching for bot comment dismissal with comprehensive unit tests covering nested paths, mixed reviews, age filtering, and structured logging. The bot-comment-dismiss.js implementation is correct and well-tested. However, the PR has significant scope violations: (1) No connector filtering implementation is present despite being a core requirement, (2) Multiple unrelated files are modified (pandas downgrade, structured_output refactor, fallback chain provider tests), (3) The final acceptance criterion explicitly states 'No other files are modified' but 4 out of 8 changed files are out of scope. The dismissal script changes are production-ready, but the missing connector work and scope violations prevent a PASS verdict.
  • Concerns:
    • SCOPE VIOLATION: The PR modifies files outside the documented scope. Changes to requirements.txt (pandas downgrade), scripts/langchain/structured_output.py (invoke_repair_loop refactor), tests/test_fallback_chain_provider.py (BackupQualityProvider addition), and tests/test_structured_output.py (test refactor) are unrelated to the stated issue of filtering .agents/** paths and dismissing bot comments.
    • CONNECTOR FILTERING NOT IMPLEMENTED: The acceptance criteria require modifications to 'chatgpt-codex-connector/' for ignored_paths configuration and file-selection filtering. No such files appear in the diff. The connector filtering acceptance criteria cannot be verified from the code changes.
    • MISSING END-TO-END VALIDATION: The acceptance criteria require verification that 'querying the GitHub API for remaining non-dismissed comments on .agents/** paths returns zero results after script execution.' No test in the diff validates this end-to-end behavior.
    • INCOMPLETE SCOPE CLEANUP: The acceptance criteria explicitly require reverting changes related to verify-compare evaluation logic, chain depth tracking, and ledger validation caching. The presence of test_fallback_chain_provider.py and test_structured_output.py changes suggests these were not fully reverted.
    • DEPENDENCY DOWNGRADE RISK: The pandas downgrade from 3.0.0 to 2.3.3 in requirements.txt is unexplained and potentially introduces compatibility or security risks. This change is not mentioned in the scope or tasks.

Agreement

  • Verdict: CONCERNS (all providers)
  • Correctness: scores within 1 point (avg 6.5/10, range 6.0-7.0)
  • Completeness: scores within 1 point (avg 5.5/10, range 5.0-6.0)
  • Risks: scores within 1 point (avg 5.5/10, range 5.0-6.0)

Disagreement

Dimension openai anthropic
Quality 6.0/10 8.0/10
Testing 6.0/10 9.0/10

Unique Insights

  • openai: Scope control acceptance criterion is not met: PR modifies multiple unrelated files (requirements.txt, scripts/langchain/structured_output.py, tests/test_fallback_chain_provider.py, tests/test_structured_output.py) and adds agents/codex-1395.md, which are outside the allowed file set in the stated criteria.; Acceptance criteria call for connector-side filtering under chatgpt-codex-connector/ before review comment construction; this PR does not include any connector changes or tests for connector file-selection behavior (e.g., excluding .agents/issue-test-ledger.yml while keeping src/app.ts).; bot-comment-dismiss.js acceptance requires maxAgeSeconds be an explicit input argument; the diff shown does not demonstrate any parsing/acceptance of maxAgeSeconds as an input (it is passed programmatically in tests, but there is no visible implementation change ensuring the script accepts it as an explicit argument/environment input when invoked as a script).; Glob/pattern matching is implemented via a custom regex converter rather than a standard glob library (minimatch). While it likely works for the tested cases, this increases risk of edge-case mismatches vs typical glob semantics (e.g., character classes, brace expansion, escaping behavior).; Timestamp handling was narrowed to created_at/createdAt only (dropping updated_at/updatedAt). If existing callers relied on updated timestamps for age filtering, this could change behavior.
  • anthropic: SCOPE VIOLATION: The PR modifies files outside the documented scope. Changes to requirements.txt (pandas downgrade), scripts/langchain/structured_output.py (invoke_repair_loop refactor), tests/test_fallback_chain_provider.py (BackupQualityProvider addition), and tests/test_structured_output.py (test refactor) are unrelated to the stated issue of filtering .agents/** paths and dismissing bot comments.; CONNECTOR FILTERING NOT IMPLEMENTED: The acceptance criteria require modifications to 'chatgpt-codex-connector/' for ignored_paths configuration and file-selection filtering. No such files appear in the diff. The connector filtering acceptance criteria cannot be verified from the code changes.; MISSING END-TO-END VALIDATION: The acceptance criteria require verification that 'querying the GitHub API for remaining non-dismissed comments on .agents/** paths returns zero results after script execution.' No test in the diff validates this end-to-end behavior.; INCOMPLETE SCOPE CLEANUP: The acceptance criteria explicitly require reverting changes related to verify-compare evaluation logic, chain depth tracking, and ledger validation caching. The presence of test_fallback_chain_provider.py and test_structured_output.py changes suggests these were not fully reverted.; DEPENDENCY DOWNGRADE RISK: The pandas downgrade from 3.0.0 to 2.3.3 in requirements.txt is unexplained and potentially introduces compatibility or security risks. This change is not mentioned in the scope or tasks.

@stranske
Copy link
Copy Markdown
Owner Author

stranske commented Feb 9, 2026

📋 Follow-up issue created: #1407

Verification concerns have been analyzed and structured into a follow-up issue.

Next steps:

  1. Review the generated issue
  2. Auto-pilot will continue preparing a new PR

Or work on it manually - the choice is yours!

stranske added a commit that referenced this pull request Feb 12, 2026
…eshold

Two independent fixes for broken automation flows:

1. capability_check.py: The bare \bsecrets?\b regex matched negative
   mentions like 'no secrets' in issue constraint text, causing
   _requires_admin_access() to return true and the fallback classifier
   to BLOCK tasks that merely *describe* a no-secrets constraint.
   Replace with specific verb+secrets patterns (manage/configure/set/
   create/update/delete/add/modify/rotate secrets).
   Root cause of PAEM #1403 false-positive BLOCKED.

2. verdict_policy.py: CONCERNS_NEEDS_HUMAN_THRESHOLD lowered from 0.85
   to 0.50.  The old threshold meant any split verdict (PASS + CONCERNS)
   with <85% confidence on the concerns side triggered needs_human,
   blocking automatic follow-up issue creation.  A 72% confidence
   concerns verdict (TMP #4894) is well above chance and should produce
   a follow-up rather than require manual triage.

Both template and main copies updated; new regression tests added.
stranske added a commit that referenced this pull request Feb 12, 2026
* fix: resolve 8 issues found in Codex run log audit

Essential fixes:
- Reporter sparse-checkout: add .github/actions to checkout so setup-api-client
  action is available (was failing 100% on Workflows repo)
- Belt Worker: re-install API client after branch checkout wipes node_modules
  (was causing @octokit/rest import failures and degraded token rotation)

High-value fixes:
- LLM analysis outputs: use print(..., end='') to strip trailing newlines from
  python extraction (confidence values had '\n' suffix e.g. '0.63\n')
- Repo variables fetch: downgrade from core.info to core.debug since the token
  permission limitation is known and the fallback to defaults works correctly

Medium fixes:
- Health 75 API Rate Diagnostic: pass secrets to 4 setup-api-client calls that
  were missing the input, causing 'No tokens were exported' warnings
- datetime.utcnow(): replace deprecated calls with timezone-aware alternative
  in both Belt Worker ledger functions

Low-salience fixes:
- error_classifier: gate entry log behind RUNNER_DEBUG to reduce log noise
- Non-artifact commit warning: downgrade from warning to notice since it is
  expected behavior when Codex produces only workflow artifacts

* fix: address review comments on belt worker re-install step

1. Use .belt-tools action path instead of ./ for setup-api-client
   after branch checkout, so the action runs from trusted Workflows
   code rather than the untrusted issue branch (security fix).

2. Pass GH_BELT_TOKEN || github.token as github_token input to
   preserve the belt token selection instead of overriding
   GITHUB_TOKEN/GH_TOKEN with the default workflow token.

* fix: capability_check false-positive on 'secrets' + lower verdict threshold

Two independent fixes for broken automation flows:

1. capability_check.py: The bare \bsecrets?\b regex matched negative
   mentions like 'no secrets' in issue constraint text, causing
   _requires_admin_access() to return true and the fallback classifier
   to BLOCK tasks that merely *describe* a no-secrets constraint.
   Replace with specific verb+secrets patterns (manage/configure/set/
   create/update/delete/add/modify/rotate secrets).
   Root cause of PAEM #1403 false-positive BLOCKED.

2. verdict_policy.py: CONCERNS_NEEDS_HUMAN_THRESHOLD lowered from 0.85
   to 0.50.  The old threshold meant any split verdict (PASS + CONCERNS)
   with <85% confidence on the concerns side triggered needs_human,
   blocking automatic follow-up issue creation.  A 72% confidence
   concerns verdict (TMP #4894) is well above chance and should produce
   a follow-up rather than require manual triage.

Both template and main copies updated; new regression tests added.

* chore(codex-autofix): apply updates (PR #1480)

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
stranske added a commit that referenced this pull request Feb 12, 2026
* fix: resolve 8 issues found in Codex run log audit

Essential fixes:
- Reporter sparse-checkout: add .github/actions to checkout so setup-api-client
  action is available (was failing 100% on Workflows repo)
- Belt Worker: re-install API client after branch checkout wipes node_modules
  (was causing @octokit/rest import failures and degraded token rotation)

High-value fixes:
- LLM analysis outputs: use print(..., end='') to strip trailing newlines from
  python extraction (confidence values had '\n' suffix e.g. '0.63\n')
- Repo variables fetch: downgrade from core.info to core.debug since the token
  permission limitation is known and the fallback to defaults works correctly

Medium fixes:
- Health 75 API Rate Diagnostic: pass secrets to 4 setup-api-client calls that
  were missing the input, causing 'No tokens were exported' warnings
- datetime.utcnow(): replace deprecated calls with timezone-aware alternative
  in both Belt Worker ledger functions

Low-salience fixes:
- error_classifier: gate entry log behind RUNNER_DEBUG to reduce log noise
- Non-artifact commit warning: downgrade from warning to notice since it is
  expected behavior when Codex produces only workflow artifacts

* fix: address review comments on belt worker re-install step

1. Use .belt-tools action path instead of ./ for setup-api-client
   after branch checkout, so the action runs from trusted Workflows
   code rather than the untrusted issue branch (security fix).

2. Pass GH_BELT_TOKEN || github.token as github_token input to
   preserve the belt token selection instead of overriding
   GITHUB_TOKEN/GH_TOKEN with the default workflow token.

* fix: capability_check false-positive on 'secrets' + lower verdict threshold

Two independent fixes for broken automation flows:

1. capability_check.py: The bare \bsecrets?\b regex matched negative
   mentions like 'no secrets' in issue constraint text, causing
   _requires_admin_access() to return true and the fallback classifier
   to BLOCK tasks that merely *describe* a no-secrets constraint.
   Replace with specific verb+secrets patterns (manage/configure/set/
   create/update/delete/add/modify/rotate secrets).
   Root cause of PAEM #1403 false-positive BLOCKED.

2. verdict_policy.py: CONCERNS_NEEDS_HUMAN_THRESHOLD lowered from 0.85
   to 0.50.  The old threshold meant any split verdict (PASS + CONCERNS)
   with <85% confidence on the concerns side triggered needs_human,
   blocking automatic follow-up issue creation.  A 72% confidence
   concerns verdict (TMP #4894) is well above chance and should produce
   a follow-up rather than require manual triage.

Both template and main copies updated; new regression tests added.

* fix: prevent Codex bootstrap from overwriting vendored node_modules

Three-layer fix for the systemic issue where setup-api-client's npm install
overwrites vendored minimatch package.json, and git add -A captures the
modification into bootstrap/autofix commits.

Layer 1 (source fix): setup-api-client/action.yml
  - Snapshot vendored package.json files before npm install
  - Restore them after npm install completes
  - Applied to both .github/actions/ and templates/consumer-repo/

Layer 2 (targeted staging): reusable-agents-issue-bridge.yml
  - Replace 'git add -A' with targeted 'git add agents/${AGENT}-${ISSUE}.md'
  - Only the bootstrap file gets staged, not npm side-effects

Layer 3 (safety net): reusable-18-autofix.yml
  - Add 'git reset HEAD -- .github/scripts/node_modules ...' after git add -A
  - Matches existing pattern in reusable-codex-run.yml line 1184
  - Applied to both push-commit and patch-commit paths

Also fixes test assertions that referenced the old CONCERNS_NEEDS_HUMAN_THRESHOLD
(was 0.85, now 0.50) — confidence values in tests updated accordingly.

Fixes: Copilot review finding on PAEM PR #1417 (minimatch vendoring cycle)

* fix: flip needs_human to trigger on high-confidence CONCERNS, not low

The needs_human gate was backwards: it fired when the CONCERNS provider
had LOW confidence (LLM unsure there's a problem) instead of HIGH
confidence (LLM confident there's a real problem).

Confidence reflects the LLM's certainty in its own evaluation, not a
measure of code quality. Low-confidence CONCERNS is a weak signal that
shouldn't block follow-up automation. High-confidence CONCERNS is the
stronger signal warranting human review.

Changed: confidence_value < threshold  →  confidence_value >= threshold
Threshold set to 0.85 (high bar — a human is already in the loop and
depth-of-rounds provides an independent guard against runaway automation).

* chore(codex-autofix): apply updates (PR #1483)

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
stranske added a commit that referenced this pull request Feb 12, 2026
* fix: resolve 8 issues found in Codex run log audit

Essential fixes:
- Reporter sparse-checkout: add .github/actions to checkout so setup-api-client
  action is available (was failing 100% on Workflows repo)
- Belt Worker: re-install API client after branch checkout wipes node_modules
  (was causing @octokit/rest import failures and degraded token rotation)

High-value fixes:
- LLM analysis outputs: use print(..., end='') to strip trailing newlines from
  python extraction (confidence values had '\n' suffix e.g. '0.63\n')
- Repo variables fetch: downgrade from core.info to core.debug since the token
  permission limitation is known and the fallback to defaults works correctly

Medium fixes:
- Health 75 API Rate Diagnostic: pass secrets to 4 setup-api-client calls that
  were missing the input, causing 'No tokens were exported' warnings
- datetime.utcnow(): replace deprecated calls with timezone-aware alternative
  in both Belt Worker ledger functions

Low-salience fixes:
- error_classifier: gate entry log behind RUNNER_DEBUG to reduce log noise
- Non-artifact commit warning: downgrade from warning to notice since it is
  expected behavior when Codex produces only workflow artifacts

* fix: address review comments on belt worker re-install step

1. Use .belt-tools action path instead of ./ for setup-api-client
   after branch checkout, so the action runs from trusted Workflows
   code rather than the untrusted issue branch (security fix).

2. Pass GH_BELT_TOKEN || github.token as github_token input to
   preserve the belt token selection instead of overriding
   GITHUB_TOKEN/GH_TOKEN with the default workflow token.

* fix: capability_check false-positive on 'secrets' + lower verdict threshold

Two independent fixes for broken automation flows:

1. capability_check.py: The bare \bsecrets?\b regex matched negative
   mentions like 'no secrets' in issue constraint text, causing
   _requires_admin_access() to return true and the fallback classifier
   to BLOCK tasks that merely *describe* a no-secrets constraint.
   Replace with specific verb+secrets patterns (manage/configure/set/
   create/update/delete/add/modify/rotate secrets).
   Root cause of PAEM #1403 false-positive BLOCKED.

2. verdict_policy.py: CONCERNS_NEEDS_HUMAN_THRESHOLD lowered from 0.85
   to 0.50.  The old threshold meant any split verdict (PASS + CONCERNS)
   with <85% confidence on the concerns side triggered needs_human,
   blocking automatic follow-up issue creation.  A 72% confidence
   concerns verdict (TMP #4894) is well above chance and should produce
   a follow-up rather than require manual triage.

Both template and main copies updated; new regression tests added.

* fix: prevent Codex bootstrap from overwriting vendored node_modules

Three-layer fix for the systemic issue where setup-api-client's npm install
overwrites vendored minimatch package.json, and git add -A captures the
modification into bootstrap/autofix commits.

Layer 1 (source fix): setup-api-client/action.yml
  - Snapshot vendored package.json files before npm install
  - Restore them after npm install completes
  - Applied to both .github/actions/ and templates/consumer-repo/

Layer 2 (targeted staging): reusable-agents-issue-bridge.yml
  - Replace 'git add -A' with targeted 'git add agents/${AGENT}-${ISSUE}.md'
  - Only the bootstrap file gets staged, not npm side-effects

Layer 3 (safety net): reusable-18-autofix.yml
  - Add 'git reset HEAD -- .github/scripts/node_modules ...' after git add -A
  - Matches existing pattern in reusable-codex-run.yml line 1184
  - Applied to both push-commit and patch-commit paths

Also fixes test assertions that referenced the old CONCERNS_NEEDS_HUMAN_THRESHOLD
(was 0.85, now 0.50) — confidence values in tests updated accordingly.

Fixes: Copilot review finding on PAEM PR #1417 (minimatch vendoring cycle)

* fix: flip needs_human to trigger on high-confidence CONCERNS, not low

The needs_human gate was backwards: it fired when the CONCERNS provider
had LOW confidence (LLM unsure there's a problem) instead of HIGH
confidence (LLM confident there's a real problem).

Confidence reflects the LLM's certainty in its own evaluation, not a
measure of code quality. Low-confidence CONCERNS is a weak signal that
shouldn't block follow-up automation. High-confidence CONCERNS is the
stronger signal warranting human review.

Changed: confidence_value < threshold  →  confidence_value >= threshold
Threshold set to 0.85 (high bar — a human is already in the loop and
depth-of-rounds provides an independent guard against runaway automation).

* fix: harden Codex pipeline — corrupt ledger resilience, autofix limits, task-focused prompts, PR meta debounce

- ledger_migrate_base.py: skip corrupt YAML files instead of blocking all
  belt worker runs (root cause of issue #1418 stall)
- agents-autofix-loop: reduce max_attempts 3→2 (standard) and 2→1
  (escalated) to cut autofix churn observed in PR #4906
- agents-72-codex-belt-worker: emit task_title output and include
  task-focused directive in activation comment for higher first-commit
  success rate
- agents-pr-meta: add PR-number concurrency grouping with
  cancel-in-progress for pull_request events to debounce redundant runs
- All template counterparts updated in sync
- 2 new tests for corrupt ledger handling

* chore(autofix): formatting/lint

* chore(codex-autofix): apply updates (PR #1484)

* chore(codex-autofix): apply updates (PR #1484)

* chore: sync template scripts

* fix: sanitize task_title for GITHUB_OUTPUT and normalize warning annotations

Address inline review feedback on PR #1484:
- Sanitize task_title by replacing newlines/carriage returns with spaces
  before writing to $GITHUB_OUTPUT (prevents broken output parsing)
- Normalize yaml.YAMLError messages to single-line in ::warning:: annotations
  (prevents malformed GitHub Actions annotations)
- Both belt-worker copies updated in sync

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent:codex Agent-created issues from Codex agent:needs-attention Agent needs human review or intervention agents:keepalive Use to initiate keepalive functionality with agents autofix Opt-in automated formatting & lint remediation needs-human Requires human intervention or review verify:compare Compare multiple LLM evaluations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants