Skip to content

fix: prevent dual-agent conflict for codex in consumer template#701

Closed
stranske wants to merge 9 commits intomainfrom
fix/prevent-dual-agent-conflict
Closed

fix: prevent dual-agent conflict for codex in consumer template#701
stranske wants to merge 9 commits intomainfrom
fix/prevent-dual-agent-conflict

Conversation

@stranske
Copy link
Copy Markdown
Owner

@stranske stranske commented Jan 9, 2026

Problem

PR #695 introduced a change that causes both CLI and UI agents to run simultaneously for codex-labeled issues:

  1. Issue gets labeled with agent:codex
  2. agents-issue-intake.yml creates PR with post_agent_comment: 'true'
  3. Bridge posts @codex start comment → triggers UI agent (chatgpt-codex-connector)
  4. PR has agent:codex label → triggers CLI keepalive loop
  5. *Both agents conflict+x /tmp/create_missing_labels.sh && /tmp/create_missing_labels.sh

This was identified by the P2 bot comment on PR #700:

This change forces post_agent_comment to 'true' for all non-workflow_dispatch runs... causing the UI agent to run alongside the CLI keepalive loop and potentially conflict.

Solution

Changed the consumer template logic to:

  • Skip post_agent_comment for agent:codex (CLI keepalive handles it)
  • Keep post_agent_comment for other agents like agent:claude
  • Preserve workflow_dispatch behavior for manual overrides

Logic Change

# Before (always true for non-dispatch):
post_agent_comment: ${{ github.event_name == 'workflow_dispatch' && (...) || 'true' }}

# After (skip for codex):
post_agent_comment: >-
  ${{ github.event_name == 'workflow_dispatch'
      && (inputs.post_codex_comment && 'true' || 'false')
      || (needs.check_labels.outputs.agent != 'codex' && 'true' || 'false') }}

Context

See docs/keepalive/Agents.md for the CLI vs UI agent architecture:

  • CLI Agent: Triggered by workflows (keepalive loop), does NOT need @codex comments
  • UI Agent: Triggered by @codex comments (backup/manual mode)

For codex, the CLI keepalive loop is the primary mechanism - posting comments would duplicate/conflict.

Impact

Consumer repos using this template will correctly:

  • Use CLI-only automation for agent:codex labels
  • Use comment-based triggering for other agents
  • Allow manual override via workflow_dispatch

- Added missing PYTHONPATH env var to agents-issue-optimizer.yml (Phase 1, 2, dedup)
- Copied agents-capability-check.yml, agents-decompose.yml, agents-dedup.yml from template
- Adapted all workflows for Workflows repo (removed self-checkout, fixed paths)
- Fixed agents-auto-label.yml checkout and paths
- Created SHORT_TERM_PLAN.md and SHORT_TERM_PLAN_SUMMARY.md

Fixes: agents:optimize failing with 'No module named tools' error on issue #691
Remove ../ prefix from file paths in capability-check, decompose, and dedup workflows.
Consumer template uses ../ because workflows-repo is a subdirectory, but Workflows repo itself doesn't need the prefix.
- Add agents-capability-check.yml to workflow docs
- Add agents-decompose.yml to workflow docs
- Add agents-dedup.yml to workflow docs

Fixes remaining test failures in PR #694
Per Copilot review comment - this was missing PYTHONPATH which could
cause the same ModuleNotFoundError this PR is fixing. All other
workflows in this PR have PYTHONPATH set, this one was overlooked.

Co-authored-by: copilot-pull-request-reviewer
- agents-capability-check.yml
- agents-decompose.yml
- agents-dedup.yml

Fixes test_canonical_workflow_names_match_expected_mapping failure
**Problem:**
- PRs stuck indefinitely when Gate cancelled due to API rate limits
- Keepalive orchestrator defers with no automatic retry mechanism
- All agent work stops until manual intervention

**Solution:**
1. **Automatic retry after timeout**: Keepalive now retries after 5 minutes
   instead of deferring indefinitely
2. **State tracking**: Track last defer timestamp to calculate retry timing
3. **Automated recovery workflow**: New workflow runs every 10 minutes to
   find and retry deferred PRs automatically

**Changes:**
- keepalive_loop.js: Add retry-after-rate-limit logic with 5min timeout
- keepalive_loop.js: Track last_defer_timestamp in state
- agents-keepalive-retry-deferred.yml: New scheduled workflow to auto-retry

Fixes systematic stoppage when GitHub API rate limits are exceeded.
…ment

When PRs are created from issues labeled with agent:codex, the template
was posting @codex start comments which triggered the UI agent. However,
the CLI keepalive loop already handles codex PRs through the workflow
system. Having both agents run simultaneously causes conflicts.

This fix:
- Skips post_agent_comment for agent:codex (CLI keepalive handles it)
- Keeps post_agent_comment for other agents like agent:claude
- Preserves workflow_dispatch behavior for manual overrides

Addresses P2 issue raised in PR #700 review comment.
See docs/keepalive/Agents.md for CLI vs UI agent distinction.
Copilot AI review requested due to automatic review settings January 9, 2026 15:54
@chatgpt-codex-connector
Copy link
Copy Markdown

Summary

Testing

  • Not run (workflow template change).

View task →

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a critical dual-agent conflict issue where both CLI and UI agents were running simultaneously for codex-labeled issues, introduced in PR #695. The primary fix prevents the agents-issue-intake.yml workflow from posting @codex comments for codex agents, since the CLI keepalive loop handles agent activation directly.

Key Changes

  • Modified consumer template logic to skip post_agent_comment for agent:codex while preserving it for other agents and workflow_dispatch overrides
  • Added rate limit retry logic to the keepalive system with automatic deferral and retry mechanisms
  • Introduced four new Phase 3 workflows (capability check, task decomposition, duplicate detection, and retry deferred) plus PYTHONPATH fixes

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
templates/consumer-repo/.github/workflows/agents-issue-intake.yml Core fix: prevents dual-agent conflict by skipping comment posts for codex agents
.github/scripts/keepalive_loop.js Adds rate limit detection and deferral logic with timestamp tracking
.github/workflows/agents-keepalive-retry-deferred.yml New workflow to automatically retry rate-limited PRs after delay
.github/workflows/agents-capability-check.yml New pre-flight check workflow to identify blockers before agent assignment
.github/workflows/agents-decompose.yml New workflow for breaking down large issues into actionable sub-tasks
.github/workflows/agents-dedup.yml New workflow for semantic duplicate issue detection
.github/workflows/agents-issue-optimizer.yml Adds missing PYTHONPATH configuration to fix ModuleNotFoundError
.github/workflows/agents-auto-label.yml Removes self-checkout logic and fixes PYTHONPATH for Workflows repo
topics.json Adds scope and non_goals fields to issue template schema
tests/workflows/test_workflow_naming.py Adds expected names for three new workflows
docs/ci/WORKFLOW_SYSTEM.md Documents the three new Phase 3 workflows in the system inventory
docs/ci/WORKFLOWS.md Adds descriptions for new agent workflows
docs/plans/SHORT_TERM_PLAN_SUMMARY.md New file documenting 2-week Phase 3 testing plan
docs/plans/SHORT_TERM_PLAN.md New detailed planning document for Phase 3 completion
docs/WORKFLOW_AUDIT_2026-01-09.md New audit results documenting workflow issues and fixes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1,119 @@
# Short-Term Plan Summary
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR includes numerous unrelated changes beyond the dual-agent conflict fix described in the title and description. The PR bundles together:

  1. The primary fix (agents-issue-intake.yml post_agent_comment logic)
  2. Four new workflow files (agents-keepalive-retry-deferred.yml, agents-capability-check.yml, agents-decompose.yml, agents-dedup.yml)
  3. PYTHONPATH fixes across multiple workflows
  4. Rate limit retry logic in keepalive_loop.js
  5. Three new documentation files (SHORT_TERM_PLAN.md, SHORT_TERM_PLAN_SUMMARY.md, WORKFLOW_AUDIT_2026-01-09.md)
  6. Schema changes to topics.json
  7. Test file updates

This makes the PR difficult to review and increases the risk of introducing issues. Consider either:

  • Splitting these into separate, focused PRs
  • Updating the PR description to comprehensively document all changes and their relationships

Copilot uses AI. Check for mistakes.
Comment on lines +167 to +169
"agents-capability-check.yml": "Capability Check",
"agents-decompose.yml": "Task Decomposition",
"agents-dedup.yml": "Duplicate Detection",
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new workflow file agents-keepalive-retry-deferred.yml is missing from the EXPECTED_NAMES mapping in this test file. The test test_workflow_names_match_filename_convention() will fail with "Missing expected name mapping for agents-keepalive-retry-deferred.yml" when it encounters this workflow.

Add an entry to the EXPECTED_NAMES dictionary with the workflow's display name (which should be "Agents Keepalive Retry Deferred" based on line 1 of the workflow file).

Copilot uses AI. Check for mistakes.
Comment on lines +1140 to +1152
const lastDeferTime = previousState?.last_defer_timestamp || 0;
const timeSinceLastDefer = now - lastDeferTime;
const RATE_LIMIT_RETRY_INTERVAL = 5 * 60 * 1000; // 5 minutes

if (timeSinceLastDefer >= RATE_LIMIT_RETRY_INTERVAL) {
// Enough time has passed, retry with reduced API usage
action = 'run';
reason = 'retry-after-rate-limit';
if (core) core.info(`Rate limit defer timeout reached (${Math.floor(timeSinceLastDefer / 60000)}min since last defer), retrying with reduced API calls`);
} else {
action = 'defer';
reason = 'gate-cancelled-rate-limit-transient';
if (core) core.info(`Rate limit detected, deferring. Will retry in ${Math.floor((RATE_LIMIT_RETRY_INTERVAL - timeSinceLastDefer) / 60000)}min`);
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic issue: The rate limit retry logic will always execute on the first defer because last_defer_timestamp will be 0 (falsy) for the first occurrence. This means timeSinceLastDefer will equal the current timestamp (potentially millions of milliseconds), which will always exceed the 5-minute threshold (300,000ms).

This effectively bypasses the deferral mechanism on the first rate limit encounter. The code should initialize the defer timestamp on the first defer, not check against it. Consider either:

  1. Checking if lastDeferTime is 0 and treating that as the first defer (set the timestamp but don't retry yet)
  2. Setting the defer timestamp immediately and only retry on subsequent checks
Suggested change
const lastDeferTime = previousState?.last_defer_timestamp || 0;
const timeSinceLastDefer = now - lastDeferTime;
const RATE_LIMIT_RETRY_INTERVAL = 5 * 60 * 1000; // 5 minutes
if (timeSinceLastDefer >= RATE_LIMIT_RETRY_INTERVAL) {
// Enough time has passed, retry with reduced API usage
action = 'run';
reason = 'retry-after-rate-limit';
if (core) core.info(`Rate limit defer timeout reached (${Math.floor(timeSinceLastDefer / 60000)}min since last defer), retrying with reduced API calls`);
} else {
action = 'defer';
reason = 'gate-cancelled-rate-limit-transient';
if (core) core.info(`Rate limit detected, deferring. Will retry in ${Math.floor((RATE_LIMIT_RETRY_INTERVAL - timeSinceLastDefer) / 60000)}min`);
const lastDeferTime = previousState?.last_defer_timestamp ?? 0;
const RATE_LIMIT_RETRY_INTERVAL = 5 * 60 * 1000; // 5 minutes
if (!lastDeferTime) {
// First time we've seen a rate limit in this state: defer and initialize timestamp
action = 'defer';
reason = 'gate-cancelled-rate-limit-transient';
if (core) core.info('Rate limit detected for the first time in this state, deferring and initializing defer timestamp');
} else {
const timeSinceLastDefer = now - lastDeferTime;
if (timeSinceLastDefer >= RATE_LIMIT_RETRY_INTERVAL) {
// Enough time has passed, retry with reduced API usage
action = 'run';
reason = 'retry-after-rate-limit';
if (core) core.info(`Rate limit defer timeout reached (${Math.floor(timeSinceLastDefer / 60000)}min since last defer), retrying with reduced API calls`);
} else {
action = 'defer';
reason = 'gate-cancelled-rate-limit-transient';
if (core) core.info(`Rate limit detected, deferring. Will retry in ${Math.floor((RATE_LIMIT_RETRY_INTERVAL - timeSinceLastDefer) / 60000)}min`);
}

Copilot uses AI. Check for mistakes.
const agentPRs = pulls.filter(pr =>
pr.labels.some(label =>
label.name.startsWith('agents:') ||
label.name.startsWith('codex:')
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The label filter includes codex: prefix but based on the codebase search, the standard label format is agent:codex not codex:*. The filter label.name.startsWith('codex:') will not match agent:codex labels.

Either remove this filter line if it's incorrect, or clarify if there are indeed codex:* labels used in the system. Based on the PR description and code review, only agent:codex and agents:* patterns are used.

Suggested change
label.name.startsWith('codex:')
label.name === 'agent:codex'

Copilot uses AI. Check for mistakes.
stranske added a commit that referenced this pull request Jan 9, 2026
…ment

When PRs are created from issues labeled with agent:codex, the template
was posting @codex start comments which triggered the UI agent. However,
the CLI keepalive loop already handles codex PRs through the workflow
system. Having both agents run simultaneously causes conflicts.

This fix:
- Skips post_agent_comment for agent:codex (CLI keepalive handles it)
- Keeps post_agent_comment for other agents like agent:claude
- Preserves workflow_dispatch behavior for manual overrides

Supersedes PR #701 (which had unrelated commits mixed in)
Addresses P2 issue raised in PR #700 review comment.
See docs/keepalive/Agents.md for CLI vs UI agent distinction.
@stranske
Copy link
Copy Markdown
Owner Author

stranske commented Jan 9, 2026

Closing - Superseded by PR #705

This PR accumulated unrelated commits from the fix/workflows-pythonpath-phase3 branch, causing confusion:

The actual fix (skipping post_agent_comment for codex) has been cleanly extracted to PR #705.

Bot Comments Addressed:

  1. Unrelated changes - Fixed by clean extraction to fix: prevent dual-agent conflict for codex by skipping post_agent_comment #705
  2. Missing workflow mapping test - Not needed in fix: prevent dual-agent conflict for codex by skipping post_agent_comment #705 (no new workflows)
  3. Rate limit retry logic issue - Superseded by fix: bypass rate-limit-only Gate cancellations - proceed with work #702 (bypass instead of defer/retry)
  4. Label filter issue - Removed in fix: prevent dual-agent conflict for codex by skipping post_agent_comment #705 (no new workflow file)

Closing in favor of #705.

@stranske stranske closed this Jan 9, 2026
stranske added a commit that referenced this pull request Jan 9, 2026
…ment (#705)

When PRs are created from issues labeled with agent:codex, the template
was posting @codex start comments which triggered the UI agent. However,
the CLI keepalive loop already handles codex PRs through the workflow
system. Having both agents run simultaneously causes conflicts.

This fix:
- Skips post_agent_comment for agent:codex (CLI keepalive handles it)
- Keeps post_agent_comment for other agents like agent:claude
- Preserves workflow_dispatch behavior for manual overrides

Supersedes PR #701 (which had unrelated commits mixed in)
Addresses P2 issue raised in PR #700 review comment.
See docs/keepalive/Agents.md for CLI vs UI agent distinction.
@stranske stranske deleted the fix/prevent-dual-agent-conflict branch January 9, 2026 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants