Skip to content

Fix/codex log issues#1484

Merged
stranske merged 13 commits intomainfrom
fix/codex-log-issues
Feb 12, 2026
Merged

Fix/codex log issues#1484
stranske merged 13 commits intomainfrom
fix/codex-log-issues

Conversation

@stranske
Copy link
Copy Markdown
Owner

No description provided.

Essential fixes:
- Reporter sparse-checkout: add .github/actions to checkout so setup-api-client
  action is available (was failing 100% on Workflows repo)
- Belt Worker: re-install API client after branch checkout wipes node_modules
  (was causing @octokit/rest import failures and degraded token rotation)

High-value fixes:
- LLM analysis outputs: use print(..., end='') to strip trailing newlines from
  python extraction (confidence values had '\n' suffix e.g. '0.63\n')
- Repo variables fetch: downgrade from core.info to core.debug since the token
  permission limitation is known and the fallback to defaults works correctly

Medium fixes:
- Health 75 API Rate Diagnostic: pass secrets to 4 setup-api-client calls that
  were missing the input, causing 'No tokens were exported' warnings
- datetime.utcnow(): replace deprecated calls with timezone-aware alternative
  in both Belt Worker ledger functions

Low-salience fixes:
- error_classifier: gate entry log behind RUNNER_DEBUG to reduce log noise
- Non-artifact commit warning: downgrade from warning to notice since it is
  expected behavior when Codex produces only workflow artifacts
1. Use .belt-tools action path instead of ./ for setup-api-client
   after branch checkout, so the action runs from trusted Workflows
   code rather than the untrusted issue branch (security fix).

2. Pass GH_BELT_TOKEN || github.token as github_token input to
   preserve the belt token selection instead of overriding
   GITHUB_TOKEN/GH_TOKEN with the default workflow token.
…eshold

Two independent fixes for broken automation flows:

1. capability_check.py: The bare \bsecrets?\b regex matched negative
   mentions like 'no secrets' in issue constraint text, causing
   _requires_admin_access() to return true and the fallback classifier
   to BLOCK tasks that merely *describe* a no-secrets constraint.
   Replace with specific verb+secrets patterns (manage/configure/set/
   create/update/delete/add/modify/rotate secrets).
   Root cause of PAEM #1403 false-positive BLOCKED.

2. verdict_policy.py: CONCERNS_NEEDS_HUMAN_THRESHOLD lowered from 0.85
   to 0.50.  The old threshold meant any split verdict (PASS + CONCERNS)
   with <85% confidence on the concerns side triggered needs_human,
   blocking automatic follow-up issue creation.  A 72% confidence
   concerns verdict (TMP #4894) is well above chance and should produce
   a follow-up rather than require manual triage.

Both template and main copies updated; new regression tests added.
Three-layer fix for the systemic issue where setup-api-client's npm install
overwrites vendored minimatch package.json, and git add -A captures the
modification into bootstrap/autofix commits.

Layer 1 (source fix): setup-api-client/action.yml
  - Snapshot vendored package.json files before npm install
  - Restore them after npm install completes
  - Applied to both .github/actions/ and templates/consumer-repo/

Layer 2 (targeted staging): reusable-agents-issue-bridge.yml
  - Replace 'git add -A' with targeted 'git add agents/${AGENT}-${ISSUE}.md'
  - Only the bootstrap file gets staged, not npm side-effects

Layer 3 (safety net): reusable-18-autofix.yml
  - Add 'git reset HEAD -- .github/scripts/node_modules ...' after git add -A
  - Matches existing pattern in reusable-codex-run.yml line 1184
  - Applied to both push-commit and patch-commit paths

Also fixes test assertions that referenced the old CONCERNS_NEEDS_HUMAN_THRESHOLD
(was 0.85, now 0.50) — confidence values in tests updated accordingly.

Fixes: Copilot review finding on PAEM PR #1417 (minimatch vendoring cycle)
The needs_human gate was backwards: it fired when the CONCERNS provider
had LOW confidence (LLM unsure there's a problem) instead of HIGH
confidence (LLM confident there's a real problem).

Confidence reflects the LLM's certainty in its own evaluation, not a
measure of code quality. Low-confidence CONCERNS is a weak signal that
shouldn't block follow-up automation. High-confidence CONCERNS is the
stronger signal warranting human review.

Changed: confidence_value < threshold  →  confidence_value >= threshold
Threshold set to 0.85 (high bar — a human is already in the loop and
depth-of-rounds provides an independent guard against runaway automation).
- Relax verb-to-secret regex from \s+ to .{0,30} so phrases like
  'Set repository secret TOKEN' and 'Update GitHub Actions secret FOO'
  are correctly blocked even with intervening words (addresses Codex
  inline review on capability_check.py L165)
- Add 2 regression tests for the above patterns
- Resolve merge conflicts in 4 test files (keep >= 0.85 threshold
  logic; main had lowered to 0.50 with < direction)
- Restore CONCERNS_NEEDS_HUMAN_THRESHOLD = 0.85 (auto-merge picked up
  main's 0.50 value but our >= comparison direction)

All 1907 tests pass.
…s, task-focused prompts, PR meta debounce

- ledger_migrate_base.py: skip corrupt YAML files instead of blocking all
  belt worker runs (root cause of issue #1418 stall)
- agents-autofix-loop: reduce max_attempts 3→2 (standard) and 2→1
  (escalated) to cut autofix churn observed in PR #4906
- agents-72-codex-belt-worker: emit task_title output and include
  task-focused directive in activation comment for higher first-commit
  success rate
- agents-pr-meta: add PR-number concurrency grouping with
  cancel-in-progress for pull_request events to debounce redundant runs
- All template counterparts updated in sync
- 2 new tests for corrupt ledger handling
Copilot AI review requested due to automatic review settings February 12, 2026 15:24
@stranske stranske temporarily deployed to agent-high-privilege February 12, 2026 15:24 — with GitHub Actions Inactive
@stranske-keepalive
Copy link
Copy Markdown
Contributor

⚠️ Action Required: Unable to determine source issue for PR #1484. The PR title, branch name, or body must contain the issue number (e.g. #123, branch: issue-123, or the hidden marker ).

@stranske-keepalive
Copy link
Copy Markdown
Contributor

stranske-keepalive bot commented Feb 12, 2026

Automated Status Summary

Head SHA: e896d7e
Latest Runs: ⏳ pending — Gate
Required contexts: Gate / gate, Health 45 Agents Guard / guard
Required: core tests (3.11): ⏳ pending, core tests (3.12): ⏳ pending, docker smoke: ⏳ pending, gate: ⏳ pending

Workflow / Job Result Logs
(no jobs reported) ⏳ pending

Coverage Overview

  • Coverage history entries: 1

Coverage Trend

Metric Value
Current 93.12%
Baseline 85.00%
Delta +8.12%
Minimum 70.00%
Status ✅ Pass

Top Coverage Hotspots (lowest coverage)

File Coverage Missing
src/cli_parser.py 81.8% 4
src/percentile_calculator.py 95.0% 1
src/aggregator.py 95.0% 2
src/__init__.py 100.0% 0
src/ndjson_parser.py 100.0% 0

Updated automatically; will refresh on subsequent CI/Docker completions.


Keepalive checklist

Scope

No scope information available

Tasks

  • No tasks defined

Acceptance criteria

  • No acceptance criteria defined

@agents-workflows-bot
Copy link
Copy Markdown
Contributor

agents-workflows-bot bot commented Feb 12, 2026

🤖 Keepalive Loop Status

PR #1484 | Agent: Codex | Iteration 0/5

Current State

Metric Value
Iteration progress [----------] 0/5
Action wait (missing-agent-label)
Disposition skipped (transient)
Gate success
Tasks 0/0 complete
Timeout 45 min (default)
Timeout usage 3m elapsed (8%, 42m remaining)
Keepalive ❌ disabled
Autofix ❌ disabled

🔍 Failure Classification

| Error type | infrastructure |
| Error category | resource |
| Suggested recovery | Confirm the referenced resource exists (repo, PR, branch, workflow, or file). |

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 371370e04b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates automation/workflow behavior to reduce noisy vendored node_modules diffs, tighten “needs human” gating to high-confidence split verdicts, and harden capability/admin checks and ledger migration behavior.

Changes:

  • Reworked verdict policy so split PASS+CONCERNS only triggers needs_human when CONCERNS confidence is high (>= 0.85), and updated tests accordingly.
  • Improved admin capability fallback detection for “set/update … secret(s)” phrasing and added regression tests.
  • Reduced unintended node_modules staging/commits in automation workflows; added ledger migration resilience for corrupt YAML ledgers; refined keepalive/meta workflow concurrency behavior.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
scripts/langchain/verdict_policy.py Switch split-verdict needs_human gating to high-confidence threshold (>= 0.85).
tests/test_verdict_policy.py Align unit tests and messaging with updated threshold semantics.
tests/test_verdict_policy_integration.py Align integration parity tests for workflow vs follow-up verdict handling.
tests/test_verdict_extract.py Update expectations for needs_human with higher CONCERNS confidence.
tests/test_followup_issue_generator.py Update follow-up labeling test to reflect high-confidence needs-human.
scripts/langchain/capability_check.py Broaden admin-required regexes to catch “verb … secret(s)” with intervening words.
templates/consumer-repo/scripts/langchain/capability_check.py Mirror capability-check regex changes in consumer template.
tests/scripts/test_capability_check.py Add regression tests for “set repository secret” / “update actions secret” fallback blocking.
scripts/ledger_migrate_base.py Skip corrupt ledgers (YAML/Migration errors) while continuing processing; summarize skips.
tests/scripts/test_ledger_migrate_base.py Add coverage for skipping corrupt ledgers in normal and --check modes.
.github/actions/setup-api-client/action.yml Snapshot/restore vendored node_modules package metadata around npm install.
templates/consumer-repo/.github/actions/setup-api-client/action.yml Mirror setup-api-client vendored metadata snapshot/restore in template.
.github/workflows/reusable-agents-issue-bridge.yml Avoid staging unintended files by adding only the bootstrap markdown file.
.github/workflows/reusable-18-autofix.yml Unstage vendored node_modules paths before committing autofix results.
.github/workflows/agents-pr-meta-v4.yml Concurrency group now distinguishes PR events; cancel-in-progress for PR runs.
templates/consumer-repo/.github/workflows/agents-pr-meta.yml Same concurrency refinement for the consumer thin-caller workflow.
.github/workflows/agents-autofix-loop.yml Reduce max attempts; cap escalated PR attempts to 1.
templates/consumer-repo/.github/workflows/agents-autofix-loop.yml Mirror autofix-loop attempt reductions in template.
.github/workflows/agents-72-codex-belt-worker.yml Emit task_title output and include a narrowed “focus task” directive in activation comment.
templates/consumer-repo/.github/workflows/agents-72-codex-belt-worker.yml Mirror belt-worker task_title output and activation comment directive in template.

Comment on lines 906 to 909
with open(gh_output, 'a', encoding='utf-8') as handle:
handle.write(f"task_id={start_info['task']['id'] if start_info['task'] else ''}\n")
handle.write(f"task_title={start_info['task']['title'] if start_info['task'] else ''}\n")
handle.write(f"task_status={start_info['task']['current_status'] if start_info['task'] else ''}\n")
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing task_title to $GITHUB_OUTPUT as a single key=value line will break output parsing if a ledger task title contains a newline (or other characters that require the multiline output format). Consider sanitizing task_title (e.g., replace newlines) or emitting it using the official multiline <<DELIM syntax so any title is safe.

Suggested change
with open(gh_output, 'a', encoding='utf-8') as handle:
handle.write(f"task_id={start_info['task']['id'] if start_info['task'] else ''}\n")
handle.write(f"task_title={start_info['task']['title'] if start_info['task'] else ''}\n")
handle.write(f"task_status={start_info['task']['current_status'] if start_info['task'] else ''}\n")
# Prepare safe, single-line task fields for GitHub outputs
task = start_info['task'] or {}
task_id = task.get('id', '') if task else ''
task_title = task.get('title', '') if task else ''
# Sanitize title to avoid breaking GITHUB_OUTPUT parsing
task_title = task_title.replace('\r', ' ').replace('\n', ' ')
task_status = task.get('current_status', '') if task else ''
with open(gh_output, 'a', encoding='utf-8') as handle:
handle.write(f"task_id={task_id}\n")
handle.write(f"task_title={task_title}\n")
handle.write(f"task_status={task_status}\n")

Copilot uses AI. Check for mistakes.
if gh_output:
with open(gh_output, 'a', encoding='utf-8') as handle:
handle.write(f"task_id={start_info['task']['id'] if start_info['task'] else ''}\n")
handle.write(f"task_title={start_info['task']['title'] if start_info['task'] else ''}\n")
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing task_title to $GITHUB_OUTPUT as a single key=value line will break output parsing if a ledger task title contains a newline (or other characters that require the multiline output format). Consider sanitizing task_title (e.g., replace newlines) or emitting it using the official multiline <<DELIM syntax so any title is safe.

Suggested change
handle.write(f"task_title={start_info['task']['title'] if start_info['task'] else ''}\n")
title = start_info['task']['title'] if start_info['task'] else ''
handle.write("task_title<<TASK_TITLE_EOF\n")
handle.write(f"{title}\n")
handle.write("TASK_TITLE_EOF\n")

Copilot uses AI. Check for mistakes.
@github-actions github-actions bot added the autofix Opt-in automated formatting & lint remediation label Feb 12, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Autofix updated these files:

  • tests/scripts/test_ledger_migrate_base.py

…tations

Address inline review feedback on PR #1484:
- Sanitize task_title by replacing newlines/carriage returns with spaces
  before writing to $GITHUB_OUTPUT (prevents broken output parsing)
- Normalize yaml.YAMLError messages to single-line in ::warning:: annotations
  (prevents malformed GitHub Actions annotations)
- Both belt-worker copies updated in sync
@stranske stranske temporarily deployed to agent-high-privilege February 12, 2026 16:04 — with GitHub Actions Inactive
@stranske stranske merged commit b4d5b1a into main Feb 12, 2026
37 checks passed
@stranske stranske deleted the fix/codex-log-issues branch February 12, 2026 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autofix Opt-in automated formatting & lint remediation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants