Wire review suppression guard into keepalive workflow (issue #1414)#1417
Wire review suppression guard into keepalive workflow (issue #1414)#1417
Conversation
Automated Status SummaryHead SHA: dde88e7
Updated automatically; will refresh on subsequent CI/Docker completions. Keepalive checklistScopeNo scope information available Tasks
Acceptance criteria
|
🤖 Keepalive Loop StatusPR #1417 | Agent: Codex | Iteration 5+2 🚀 extended Current State
🔍 Failure Classification| Error type | infrastructure | |
Summary
Testing
Also completed:
|
✅ Codex Completion CheckpointIteration: 5 Tasks Completed
Acceptance Criteria Met
About this commentThis comment is automatically generated to track task completions. |
|
Status | ✅ no new diagnostics |
|
Autofix updated these files:
|
The check_api_wrapper_guard.py already excluded __tests__ directories since test files use mock github objects, not real API clients. Extend the exclusion to also cover the top-level tests/ directory, which contains YAML fixture strings with API call patterns (e.g. github.rest.issues.createComment) that are test data, not actual API usage.
Provider Comparison ReportProvider Summary
📋 Full Provider Details (click to expand)openai
anthropic
Agreement
DisagreementNo major disagreements detected. Unique Insights
|
Three-layer fix for the systemic issue where setup-api-client's npm install
overwrites vendored minimatch package.json, and git add -A captures the
modification into bootstrap/autofix commits.
Layer 1 (source fix): setup-api-client/action.yml
- Snapshot vendored package.json files before npm install
- Restore them after npm install completes
- Applied to both .github/actions/ and templates/consumer-repo/
Layer 2 (targeted staging): reusable-agents-issue-bridge.yml
- Replace 'git add -A' with targeted 'git add agents/${AGENT}-${ISSUE}.md'
- Only the bootstrap file gets staged, not npm side-effects
Layer 3 (safety net): reusable-18-autofix.yml
- Add 'git reset HEAD -- .github/scripts/node_modules ...' after git add -A
- Matches existing pattern in reusable-codex-run.yml line 1184
- Applied to both push-commit and patch-commit paths
Also fixes test assertions that referenced the old CONCERNS_NEEDS_HUMAN_THRESHOLD
(was 0.85, now 0.50) — confidence values in tests updated accordingly.
Fixes: Copilot review finding on PAEM PR #1417 (minimatch vendoring cycle)
* fix: resolve 8 issues found in Codex run log audit Essential fixes: - Reporter sparse-checkout: add .github/actions to checkout so setup-api-client action is available (was failing 100% on Workflows repo) - Belt Worker: re-install API client after branch checkout wipes node_modules (was causing @octokit/rest import failures and degraded token rotation) High-value fixes: - LLM analysis outputs: use print(..., end='') to strip trailing newlines from python extraction (confidence values had '\n' suffix e.g. '0.63\n') - Repo variables fetch: downgrade from core.info to core.debug since the token permission limitation is known and the fallback to defaults works correctly Medium fixes: - Health 75 API Rate Diagnostic: pass secrets to 4 setup-api-client calls that were missing the input, causing 'No tokens were exported' warnings - datetime.utcnow(): replace deprecated calls with timezone-aware alternative in both Belt Worker ledger functions Low-salience fixes: - error_classifier: gate entry log behind RUNNER_DEBUG to reduce log noise - Non-artifact commit warning: downgrade from warning to notice since it is expected behavior when Codex produces only workflow artifacts * fix: address review comments on belt worker re-install step 1. Use .belt-tools action path instead of ./ for setup-api-client after branch checkout, so the action runs from trusted Workflows code rather than the untrusted issue branch (security fix). 2. Pass GH_BELT_TOKEN || github.token as github_token input to preserve the belt token selection instead of overriding GITHUB_TOKEN/GH_TOKEN with the default workflow token. * fix: capability_check false-positive on 'secrets' + lower verdict threshold Two independent fixes for broken automation flows: 1. capability_check.py: The bare \bsecrets?\b regex matched negative mentions like 'no secrets' in issue constraint text, causing _requires_admin_access() to return true and the fallback classifier to BLOCK tasks that merely *describe* a no-secrets constraint. Replace with specific verb+secrets patterns (manage/configure/set/ create/update/delete/add/modify/rotate secrets). Root cause of PAEM #1403 false-positive BLOCKED. 2. verdict_policy.py: CONCERNS_NEEDS_HUMAN_THRESHOLD lowered from 0.85 to 0.50. The old threshold meant any split verdict (PASS + CONCERNS) with <85% confidence on the concerns side triggered needs_human, blocking automatic follow-up issue creation. A 72% confidence concerns verdict (TMP #4894) is well above chance and should produce a follow-up rather than require manual triage. Both template and main copies updated; new regression tests added. * fix: prevent Codex bootstrap from overwriting vendored node_modules Three-layer fix for the systemic issue where setup-api-client's npm install overwrites vendored minimatch package.json, and git add -A captures the modification into bootstrap/autofix commits. Layer 1 (source fix): setup-api-client/action.yml - Snapshot vendored package.json files before npm install - Restore them after npm install completes - Applied to both .github/actions/ and templates/consumer-repo/ Layer 2 (targeted staging): reusable-agents-issue-bridge.yml - Replace 'git add -A' with targeted 'git add agents/${AGENT}-${ISSUE}.md' - Only the bootstrap file gets staged, not npm side-effects Layer 3 (safety net): reusable-18-autofix.yml - Add 'git reset HEAD -- .github/scripts/node_modules ...' after git add -A - Matches existing pattern in reusable-codex-run.yml line 1184 - Applied to both push-commit and patch-commit paths Also fixes test assertions that referenced the old CONCERNS_NEEDS_HUMAN_THRESHOLD (was 0.85, now 0.50) — confidence values in tests updated accordingly. Fixes: Copilot review finding on PAEM PR #1417 (minimatch vendoring cycle) * fix: flip needs_human to trigger on high-confidence CONCERNS, not low The needs_human gate was backwards: it fired when the CONCERNS provider had LOW confidence (LLM unsure there's a problem) instead of HIGH confidence (LLM confident there's a real problem). Confidence reflects the LLM's certainty in its own evaluation, not a measure of code quality. Low-confidence CONCERNS is a weak signal that shouldn't block follow-up automation. High-confidence CONCERNS is the stronger signal warranting human review. Changed: confidence_value < threshold → confidence_value >= threshold Threshold set to 0.85 (high bar — a human is already in the loop and depth-of-rounds provides an independent guard against runaway automation). * chore(codex-autofix): apply updates (PR #1483) --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix: resolve 8 issues found in Codex run log audit Essential fixes: - Reporter sparse-checkout: add .github/actions to checkout so setup-api-client action is available (was failing 100% on Workflows repo) - Belt Worker: re-install API client after branch checkout wipes node_modules (was causing @octokit/rest import failures and degraded token rotation) High-value fixes: - LLM analysis outputs: use print(..., end='') to strip trailing newlines from python extraction (confidence values had '\n' suffix e.g. '0.63\n') - Repo variables fetch: downgrade from core.info to core.debug since the token permission limitation is known and the fallback to defaults works correctly Medium fixes: - Health 75 API Rate Diagnostic: pass secrets to 4 setup-api-client calls that were missing the input, causing 'No tokens were exported' warnings - datetime.utcnow(): replace deprecated calls with timezone-aware alternative in both Belt Worker ledger functions Low-salience fixes: - error_classifier: gate entry log behind RUNNER_DEBUG to reduce log noise - Non-artifact commit warning: downgrade from warning to notice since it is expected behavior when Codex produces only workflow artifacts * fix: address review comments on belt worker re-install step 1. Use .belt-tools action path instead of ./ for setup-api-client after branch checkout, so the action runs from trusted Workflows code rather than the untrusted issue branch (security fix). 2. Pass GH_BELT_TOKEN || github.token as github_token input to preserve the belt token selection instead of overriding GITHUB_TOKEN/GH_TOKEN with the default workflow token. * fix: capability_check false-positive on 'secrets' + lower verdict threshold Two independent fixes for broken automation flows: 1. capability_check.py: The bare \bsecrets?\b regex matched negative mentions like 'no secrets' in issue constraint text, causing _requires_admin_access() to return true and the fallback classifier to BLOCK tasks that merely *describe* a no-secrets constraint. Replace with specific verb+secrets patterns (manage/configure/set/ create/update/delete/add/modify/rotate secrets). Root cause of PAEM #1403 false-positive BLOCKED. 2. verdict_policy.py: CONCERNS_NEEDS_HUMAN_THRESHOLD lowered from 0.85 to 0.50. The old threshold meant any split verdict (PASS + CONCERNS) with <85% confidence on the concerns side triggered needs_human, blocking automatic follow-up issue creation. A 72% confidence concerns verdict (TMP #4894) is well above chance and should produce a follow-up rather than require manual triage. Both template and main copies updated; new regression tests added. * fix: prevent Codex bootstrap from overwriting vendored node_modules Three-layer fix for the systemic issue where setup-api-client's npm install overwrites vendored minimatch package.json, and git add -A captures the modification into bootstrap/autofix commits. Layer 1 (source fix): setup-api-client/action.yml - Snapshot vendored package.json files before npm install - Restore them after npm install completes - Applied to both .github/actions/ and templates/consumer-repo/ Layer 2 (targeted staging): reusable-agents-issue-bridge.yml - Replace 'git add -A' with targeted 'git add agents/${AGENT}-${ISSUE}.md' - Only the bootstrap file gets staged, not npm side-effects Layer 3 (safety net): reusable-18-autofix.yml - Add 'git reset HEAD -- .github/scripts/node_modules ...' after git add -A - Matches existing pattern in reusable-codex-run.yml line 1184 - Applied to both push-commit and patch-commit paths Also fixes test assertions that referenced the old CONCERNS_NEEDS_HUMAN_THRESHOLD (was 0.85, now 0.50) — confidence values in tests updated accordingly. Fixes: Copilot review finding on PAEM PR #1417 (minimatch vendoring cycle) * fix: flip needs_human to trigger on high-confidence CONCERNS, not low The needs_human gate was backwards: it fired when the CONCERNS provider had LOW confidence (LLM unsure there's a problem) instead of HIGH confidence (LLM confident there's a real problem). Confidence reflects the LLM's certainty in its own evaluation, not a measure of code quality. Low-confidence CONCERNS is a weak signal that shouldn't block follow-up automation. High-confidence CONCERNS is the stronger signal warranting human review. Changed: confidence_value < threshold → confidence_value >= threshold Threshold set to 0.85 (high bar — a human is already in the loop and depth-of-rounds provides an independent guard against runaway automation). * fix: harden Codex pipeline — corrupt ledger resilience, autofix limits, task-focused prompts, PR meta debounce - ledger_migrate_base.py: skip corrupt YAML files instead of blocking all belt worker runs (root cause of issue #1418 stall) - agents-autofix-loop: reduce max_attempts 3→2 (standard) and 2→1 (escalated) to cut autofix churn observed in PR #4906 - agents-72-codex-belt-worker: emit task_title output and include task-focused directive in activation comment for higher first-commit success rate - agents-pr-meta: add PR-number concurrency grouping with cancel-in-progress for pull_request events to debounce redundant runs - All template counterparts updated in sync - 2 new tests for corrupt ledger handling * chore(autofix): formatting/lint * chore(codex-autofix): apply updates (PR #1484) * chore(codex-autofix): apply updates (PR #1484) * chore: sync template scripts * fix: sanitize task_title for GITHUB_OUTPUT and normalize warning annotations Address inline review feedback on PR #1484: - Sanitize task_title by replacing newlines/carriage returns with spaces before writing to $GITHUB_OUTPUT (prevents broken output parsing) - Normalize yaml.YAMLError messages to single-line in ::warning:: annotations (prevents malformed GitHub Actions annotations) - Both belt-worker copies updated in sync --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Automated Status Summary
Scope
PR #1413 addressed issue #1412, but verification identified remaining gaps (verdict: CONCERNS). This follow-up issue closes those gaps by (1) enforcing suppression at the workflow level so comment/review posting cannot run when suppressed, (2) implementing missing modules required for the core logic and tests, (3) fixing output semantics to avoid duplicate
$GITHUB_OUTPUTentries, (4) bounding pagination to prevent excessive API calls, and (5) removing remaining TODO/skipped coverage so the full test suite can validate behavior end-to-end.Context for Agent
Related Issues/PRs
Tasks
.github/workflows/keepalive.ymlto add explicitif:guards on every step/job that posts a PR comment or PR review so they cannot run when the suppression output indicates posting is suppressed..github/workflows/autofix.ymlto add explicitif:guards on every step/job that posts a PR comment so they cannot run when the suppression output indicates posting is suppressed.scripts/keepalive_review_guard.jsexporting functions to load the designated review result file and evaluate it, returningfalsewhen the file is missing, JSON parsing fails, or the parsed payload is an all-empty object.scripts/should-post-review.jsto call intoscripts/keepalive_review_guard.jsand ensure the final computed decision output isfalsewhen the guard returnsfalse.scripts/should-post-review.jsto write exactly onekey=valueline per run to the file path inprocess.env.GITHUB_OUTPUT(replace anyappendFileSync-style duplication) while keeping the output key name unchanged.test/keepalive_review_guard.test.jsto cover evaluator edge cases: missing file, invalid JSON, and all-empty object payload returningfalse.test/should-post-review.test.jswith an integration-style test that runsscripts/should-post-review.jsend-to-end using a tempGITHUB_OUTPUTfile and asserts it outputsfalsewhen the guard encounters missing/invalid/all-empty payload.scripts/bot-comment-handler.jsto enforce a hard upper bound on pagination (constant/configurableN) when listing PR comments for deduplication.test/bot-comment-handler.test.jsto assert the mocked PR comment-list API call count is<= Neven when mocked responses keep returning full pages.scripts/bot-comment-dismiss.jsexporting the API expected bytest/bot-comment-dismiss.test.js, and wire deterministic mocks/fixtures as needed so the test runs without network calls.it.skip/describe.skipor TODO placeholders intest/**that bypass assertions for the implemented features so the full test suite executes.Acceptance criteria
.github/workflows/keepalive.yml, every step/job that posts a PR comment or PR review includes anif:guard that evaluates to false when the suppression output indicates posting is suppressed..github/workflows/autofix.yml, every step/job that posts a PR comment includes anif:guard that evaluates to false when the suppression output indicates posting is suppressed.scripts/keepalive_review_guard.jsexists and exports functions to load the designated review result file and evaluate it, and the evaluator returnsfalsewhen: (a) the file does not exist, (b) file contents are not valid JSON, or (c) the parsed JSON payload is an all-empty object.scripts/should-post-review.jscalls intokeepalive_review_guardsuch that when the designated review result file is missing, invalid JSON, or all-empty, the final computed decision output used by workflows isfalse.scripts/should-post-review.jswrites exactly one line for the chosen output key to the file path specified by theGITHUB_OUTPUTenvironment variable (no duplicate keys/lines for the same output per run).scripts/should-post-review.jsmatches the key consumed in.github/workflows/keepalive.ymland.github/workflows/autofix.yml(workflows reference the exact same key name).test/keepalive_review_guard.test.jsincludes explicit test cases asserting the evaluator returnsfalsefor: (1) missing review result file, (2) invalid JSON file contents, and (3) all-empty object payload.test/should-post-review.test.jsincludes at least one integration-style test that executesscripts/should-post-review.jsend-to-end and asserts the produced$GITHUB_OUTPUTvalue isfalsewhen the guard encounters (a) missing file, (b) invalid JSON, or (c) all-empty payload.scripts/bot-comment-handler.jsenforces a hard upper bound on pagination when listing PR comments for deduplication so it stops requesting further pages afterNpages even if the API continues returning full pages.test/bot-comment-handler.test.jsverifies the maximum number of comment-list API calls does not exceed the configured page limit (<= N) via mock call-count assertions.scripts/bot-comment-dismiss.jsexists and can be imported bytest/bot-comment-dismiss.test.jswithout module-not-found errors.test/bot-comment-dismiss.test.jspasses using deterministic mocks/fixtures (no network calls).test/**are skipped (it.skip,describe.skip) and no TODO placeholders remain that bypass assertions for the implemented features.