fix: prevent dual-agent conflict for codex in consumer template#701
fix: prevent dual-agent conflict for codex in consumer template#701
Conversation
- Added missing PYTHONPATH env var to agents-issue-optimizer.yml (Phase 1, 2, dedup) - Copied agents-capability-check.yml, agents-decompose.yml, agents-dedup.yml from template - Adapted all workflows for Workflows repo (removed self-checkout, fixed paths) - Fixed agents-auto-label.yml checkout and paths - Created SHORT_TERM_PLAN.md and SHORT_TERM_PLAN_SUMMARY.md Fixes: agents:optimize failing with 'No module named tools' error on issue #691
Remove ../ prefix from file paths in capability-check, decompose, and dedup workflows. Consumer template uses ../ because workflows-repo is a subdirectory, but Workflows repo itself doesn't need the prefix.
- Add agents-capability-check.yml to workflow docs - Add agents-decompose.yml to workflow docs - Add agents-dedup.yml to workflow docs Fixes remaining test failures in PR #694
Per Copilot review comment - this was missing PYTHONPATH which could cause the same ModuleNotFoundError this PR is fixing. All other workflows in this PR have PYTHONPATH set, this one was overlooked. Co-authored-by: copilot-pull-request-reviewer
- agents-capability-check.yml - agents-decompose.yml - agents-dedup.yml Fixes test_canonical_workflow_names_match_expected_mapping failure
**Problem:** - PRs stuck indefinitely when Gate cancelled due to API rate limits - Keepalive orchestrator defers with no automatic retry mechanism - All agent work stops until manual intervention **Solution:** 1. **Automatic retry after timeout**: Keepalive now retries after 5 minutes instead of deferring indefinitely 2. **State tracking**: Track last defer timestamp to calculate retry timing 3. **Automated recovery workflow**: New workflow runs every 10 minutes to find and retry deferred PRs automatically **Changes:** - keepalive_loop.js: Add retry-after-rate-limit logic with 5min timeout - keepalive_loop.js: Track last_defer_timestamp in state - agents-keepalive-retry-deferred.yml: New scheduled workflow to auto-retry Fixes systematic stoppage when GitHub API rate limits are exceeded.
…ment When PRs are created from issues labeled with agent:codex, the template was posting @codex start comments which triggered the UI agent. However, the CLI keepalive loop already handles codex PRs through the workflow system. Having both agents run simultaneously causes conflicts. This fix: - Skips post_agent_comment for agent:codex (CLI keepalive handles it) - Keeps post_agent_comment for other agents like agent:claude - Preserves workflow_dispatch behavior for manual overrides Addresses P2 issue raised in PR #700 review comment. See docs/keepalive/Agents.md for CLI vs UI agent distinction.
|
Summary
Testing
|
There was a problem hiding this comment.
Pull request overview
This PR addresses a critical dual-agent conflict issue where both CLI and UI agents were running simultaneously for codex-labeled issues, introduced in PR #695. The primary fix prevents the agents-issue-intake.yml workflow from posting @codex comments for codex agents, since the CLI keepalive loop handles agent activation directly.
Key Changes
- Modified consumer template logic to skip
post_agent_commentforagent:codexwhile preserving it for other agents and workflow_dispatch overrides - Added rate limit retry logic to the keepalive system with automatic deferral and retry mechanisms
- Introduced four new Phase 3 workflows (capability check, task decomposition, duplicate detection, and retry deferred) plus PYTHONPATH fixes
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| templates/consumer-repo/.github/workflows/agents-issue-intake.yml | Core fix: prevents dual-agent conflict by skipping comment posts for codex agents |
| .github/scripts/keepalive_loop.js | Adds rate limit detection and deferral logic with timestamp tracking |
| .github/workflows/agents-keepalive-retry-deferred.yml | New workflow to automatically retry rate-limited PRs after delay |
| .github/workflows/agents-capability-check.yml | New pre-flight check workflow to identify blockers before agent assignment |
| .github/workflows/agents-decompose.yml | New workflow for breaking down large issues into actionable sub-tasks |
| .github/workflows/agents-dedup.yml | New workflow for semantic duplicate issue detection |
| .github/workflows/agents-issue-optimizer.yml | Adds missing PYTHONPATH configuration to fix ModuleNotFoundError |
| .github/workflows/agents-auto-label.yml | Removes self-checkout logic and fixes PYTHONPATH for Workflows repo |
| topics.json | Adds scope and non_goals fields to issue template schema |
| tests/workflows/test_workflow_naming.py | Adds expected names for three new workflows |
| docs/ci/WORKFLOW_SYSTEM.md | Documents the three new Phase 3 workflows in the system inventory |
| docs/ci/WORKFLOWS.md | Adds descriptions for new agent workflows |
| docs/plans/SHORT_TERM_PLAN_SUMMARY.md | New file documenting 2-week Phase 3 testing plan |
| docs/plans/SHORT_TERM_PLAN.md | New detailed planning document for Phase 3 completion |
| docs/WORKFLOW_AUDIT_2026-01-09.md | New audit results documenting workflow issues and fixes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,119 @@ | |||
| # Short-Term Plan Summary | |||
There was a problem hiding this comment.
This PR includes numerous unrelated changes beyond the dual-agent conflict fix described in the title and description. The PR bundles together:
- The primary fix (agents-issue-intake.yml post_agent_comment logic)
- Four new workflow files (agents-keepalive-retry-deferred.yml, agents-capability-check.yml, agents-decompose.yml, agents-dedup.yml)
- PYTHONPATH fixes across multiple workflows
- Rate limit retry logic in keepalive_loop.js
- Three new documentation files (SHORT_TERM_PLAN.md, SHORT_TERM_PLAN_SUMMARY.md, WORKFLOW_AUDIT_2026-01-09.md)
- Schema changes to topics.json
- Test file updates
This makes the PR difficult to review and increases the risk of introducing issues. Consider either:
- Splitting these into separate, focused PRs
- Updating the PR description to comprehensively document all changes and their relationships
| "agents-capability-check.yml": "Capability Check", | ||
| "agents-decompose.yml": "Task Decomposition", | ||
| "agents-dedup.yml": "Duplicate Detection", |
There was a problem hiding this comment.
The new workflow file agents-keepalive-retry-deferred.yml is missing from the EXPECTED_NAMES mapping in this test file. The test test_workflow_names_match_filename_convention() will fail with "Missing expected name mapping for agents-keepalive-retry-deferred.yml" when it encounters this workflow.
Add an entry to the EXPECTED_NAMES dictionary with the workflow's display name (which should be "Agents Keepalive Retry Deferred" based on line 1 of the workflow file).
| const lastDeferTime = previousState?.last_defer_timestamp || 0; | ||
| const timeSinceLastDefer = now - lastDeferTime; | ||
| const RATE_LIMIT_RETRY_INTERVAL = 5 * 60 * 1000; // 5 minutes | ||
|
|
||
| if (timeSinceLastDefer >= RATE_LIMIT_RETRY_INTERVAL) { | ||
| // Enough time has passed, retry with reduced API usage | ||
| action = 'run'; | ||
| reason = 'retry-after-rate-limit'; | ||
| if (core) core.info(`Rate limit defer timeout reached (${Math.floor(timeSinceLastDefer / 60000)}min since last defer), retrying with reduced API calls`); | ||
| } else { | ||
| action = 'defer'; | ||
| reason = 'gate-cancelled-rate-limit-transient'; | ||
| if (core) core.info(`Rate limit detected, deferring. Will retry in ${Math.floor((RATE_LIMIT_RETRY_INTERVAL - timeSinceLastDefer) / 60000)}min`); |
There was a problem hiding this comment.
Logic issue: The rate limit retry logic will always execute on the first defer because last_defer_timestamp will be 0 (falsy) for the first occurrence. This means timeSinceLastDefer will equal the current timestamp (potentially millions of milliseconds), which will always exceed the 5-minute threshold (300,000ms).
This effectively bypasses the deferral mechanism on the first rate limit encounter. The code should initialize the defer timestamp on the first defer, not check against it. Consider either:
- Checking if
lastDeferTimeis 0 and treating that as the first defer (set the timestamp but don't retry yet) - Setting the defer timestamp immediately and only retry on subsequent checks
| const lastDeferTime = previousState?.last_defer_timestamp || 0; | |
| const timeSinceLastDefer = now - lastDeferTime; | |
| const RATE_LIMIT_RETRY_INTERVAL = 5 * 60 * 1000; // 5 minutes | |
| if (timeSinceLastDefer >= RATE_LIMIT_RETRY_INTERVAL) { | |
| // Enough time has passed, retry with reduced API usage | |
| action = 'run'; | |
| reason = 'retry-after-rate-limit'; | |
| if (core) core.info(`Rate limit defer timeout reached (${Math.floor(timeSinceLastDefer / 60000)}min since last defer), retrying with reduced API calls`); | |
| } else { | |
| action = 'defer'; | |
| reason = 'gate-cancelled-rate-limit-transient'; | |
| if (core) core.info(`Rate limit detected, deferring. Will retry in ${Math.floor((RATE_LIMIT_RETRY_INTERVAL - timeSinceLastDefer) / 60000)}min`); | |
| const lastDeferTime = previousState?.last_defer_timestamp ?? 0; | |
| const RATE_LIMIT_RETRY_INTERVAL = 5 * 60 * 1000; // 5 minutes | |
| if (!lastDeferTime) { | |
| // First time we've seen a rate limit in this state: defer and initialize timestamp | |
| action = 'defer'; | |
| reason = 'gate-cancelled-rate-limit-transient'; | |
| if (core) core.info('Rate limit detected for the first time in this state, deferring and initializing defer timestamp'); | |
| } else { | |
| const timeSinceLastDefer = now - lastDeferTime; | |
| if (timeSinceLastDefer >= RATE_LIMIT_RETRY_INTERVAL) { | |
| // Enough time has passed, retry with reduced API usage | |
| action = 'run'; | |
| reason = 'retry-after-rate-limit'; | |
| if (core) core.info(`Rate limit defer timeout reached (${Math.floor(timeSinceLastDefer / 60000)}min since last defer), retrying with reduced API calls`); | |
| } else { | |
| action = 'defer'; | |
| reason = 'gate-cancelled-rate-limit-transient'; | |
| if (core) core.info(`Rate limit detected, deferring. Will retry in ${Math.floor((RATE_LIMIT_RETRY_INTERVAL - timeSinceLastDefer) / 60000)}min`); | |
| } |
| const agentPRs = pulls.filter(pr => | ||
| pr.labels.some(label => | ||
| label.name.startsWith('agents:') || | ||
| label.name.startsWith('codex:') |
There was a problem hiding this comment.
The label filter includes codex: prefix but based on the codebase search, the standard label format is agent:codex not codex:*. The filter label.name.startsWith('codex:') will not match agent:codex labels.
Either remove this filter line if it's incorrect, or clarify if there are indeed codex:* labels used in the system. Based on the PR description and code review, only agent:codex and agents:* patterns are used.
| label.name.startsWith('codex:') | |
| label.name === 'agent:codex' |
…ment When PRs are created from issues labeled with agent:codex, the template was posting @codex start comments which triggered the UI agent. However, the CLI keepalive loop already handles codex PRs through the workflow system. Having both agents run simultaneously causes conflicts. This fix: - Skips post_agent_comment for agent:codex (CLI keepalive handles it) - Keeps post_agent_comment for other agents like agent:claude - Preserves workflow_dispatch behavior for manual overrides Supersedes PR #701 (which had unrelated commits mixed in) Addresses P2 issue raised in PR #700 review comment. See docs/keepalive/Agents.md for CLI vs UI agent distinction.
Closing - Superseded by PR #705This PR accumulated unrelated commits from the
The actual fix (skipping Bot Comments Addressed:
Closing in favor of #705. |
…ment (#705) When PRs are created from issues labeled with agent:codex, the template was posting @codex start comments which triggered the UI agent. However, the CLI keepalive loop already handles codex PRs through the workflow system. Having both agents run simultaneously causes conflicts. This fix: - Skips post_agent_comment for agent:codex (CLI keepalive handles it) - Keeps post_agent_comment for other agents like agent:claude - Preserves workflow_dispatch behavior for manual overrides Supersedes PR #701 (which had unrelated commits mixed in) Addresses P2 issue raised in PR #700 review comment. See docs/keepalive/Agents.md for CLI vs UI agent distinction.
Problem
PR #695 introduced a change that causes both CLI and UI agents to run simultaneously for codex-labeled issues:
agent:codexagents-issue-intake.ymlcreates PR withpost_agent_comment: 'true'@codex startcomment → triggers UI agent (chatgpt-codex-connector)agent:codexlabel → triggers CLI keepalive loopThis was identified by the P2 bot comment on PR #700:
Solution
Changed the consumer template logic to:
post_agent_commentforagent:codex(CLI keepalive handles it)post_agent_commentfor other agents likeagent:claudeworkflow_dispatchbehavior for manual overridesLogic Change
Context
See
docs/keepalive/Agents.mdfor the CLI vs UI agent architecture:@codexcomments@codexcomments (backup/manual mode)For codex, the CLI keepalive loop is the primary mechanism - posting comments would duplicate/conflict.
Impact
Consumer repos using this template will correctly:
agent:codexlabelsworkflow_dispatch