Skip to content

Fix/workflow startup failure real fix#606

Merged
stranske merged 19 commits intomainfrom
fix/workflow-startup-failure-real-fix
Jan 6, 2026
Merged

Fix/workflow startup failure real fix#606
stranske merged 19 commits intomainfrom
fix/workflow-startup-failure-real-fix

Conversation

@stranske
Copy link
Copy Markdown
Owner

@stranske stranske commented Jan 6, 2026

No description provided.

* fix: correct YAML syntax in agents-issue-intake.yml template

The 'if' condition in the check_labels job was improperly formatted,
causing the line to wrap incorrectly with 'runs-on' ending up on the
same line. This resulted in startup_failure errors when the workflow
was deployed to consumer repos.

Changes:
- Use multiline scalar (|) for complex if condition
- Properly indent continuation lines
- Ensure runs-on is on its own line

Fixes workflow failures in stranske/Travel-Plan-Permission and other
consumer repositories using this template.

* fix: add validation safeguards for template changes

Problem: Template changes sync to 4+ consumer repos. A syntax error
in agents-issue-intake.yml caused startup_failure in all consumer
repos because there was no validation preventing bad templates.

Changes:
1. Fix YAML syntax error in check_labels job (multiline if condition)
2. Add validate_workflow_yaml.py script to catch YAML/style issues
3. Add pre-commit hook to validate templates before commit
4. Add CRITICAL section to CLAUDE.md about template changes

Safeguards added:
- Pre-commit hook blocks template commits with validation errors
- Script checks: YAML syntax, line length (100), runs-on placement
- Clear warning in CLAUDE.md with validation commands
- Enforces repo standards before sync

Related: Travel-Plan-Permission#253, Workflows#602
* fix: correct YAML syntax in agents-issue-intake.yml template

The 'if' condition in the check_labels job was improperly formatted,
causing the line to wrap incorrectly with 'runs-on' ending up on the
same line. This resulted in startup_failure errors when the workflow
was deployed to consumer repos.

Changes:
- Use multiline scalar (|) for complex if condition
- Properly indent continuation lines
- Ensure runs-on is on its own line

Fixes workflow failures in stranske/Travel-Plan-Permission and other
consumer repositories using this template.

* fix: add validation safeguards for template changes

Problem: Template changes sync to 4+ consumer repos. A syntax error
in agents-issue-intake.yml caused startup_failure in all consumer
repos because there was no validation preventing bad templates.

Changes:
1. Fix YAML syntax error in check_labels job (multiline if condition)
2. Add validate_workflow_yaml.py script to catch YAML/style issues
3. Add pre-commit hook to validate templates before commit
4. Add CRITICAL section to CLAUDE.md about template changes

Safeguards added:
- Pre-commit hook blocks template commits with validation errors
- Script checks: YAML syntax, line length (100), runs-on placement
- Clear warning in CLAUDE.md with validation commands
- Enforces repo standards before sync

Related: Travel-Plan-Permission#253, Workflows#602
The workflow now uses the CODESPACES_WORKFLOWS secret which has
merge permissions, falling back to GITHUB_TOKEN if not available.

Successfully merged sync PRs in Manager-Database, Template, and
trip-planner using this token.
- Parse multiline REGISTERED_CONSUMER_REPOS env var instead of hardcoded list
- Add stale PR cleanup: close and delete branches for older sync PRs
- Process repos in order from REGISTERED_CONSUMER_REPOS (7 repos total)
- Increase per_page to 20 to catch multiple stale PRs
- Add stale_closed status tracking in summary
- Extract consumer repo list from maint-68-sync-consumer-repos.yml at runtime
- Use yq to parse the authoritative REGISTERED_CONSUMER_REPOS env var
- Remove duplicated hardcoded list to maintain single source of truth
- Change default max_length from 150 to 100 to match repo standards (black, ruff, isort)
- Add explicit encoding='utf-8' to all file operations for cross-platform compatibility
- Remove redundant condition check (already verified by elif condition)
- Add critical section to CLAUDE.md about checking new workflows for file artifacts
- Create comprehensive WORKFLOW_ARTIFACT_CHECKLIST.md with decision trees and examples
- Document common artifact patterns that cause merge conflicts in consumer repos
- Provide recovery procedures for artifact pollution
- Emphasize template workflows sync to 7+ repos (one mistake = 7+ conflicts)
- Require addressing ALL bot comments before merging PRs
- Document that bot comments are mandatory fixes, not suggestions
- Provide process for evaluating and resolving bot feedback
- Emphasize impact: ignored comments → bugs in 7+ consumer repos
- Add examples of critical issues bots catch (encoding, defaults, logic)
- Add workflow to EXPECTED_NAMES test mapping
- Document in docs/ci/WORKFLOWS.md with description
- Add to docs/ci/WORKFLOW_SYSTEM.md workflow table
- Fixes test failures: test_canonical_workflow_names_match_expected_mapping, test_workflow_names_match_filename_convention, test_inventory_docs_list_all_workflows
- Quote $repos variable in yq pipeline to prevent word splitting (SC2086)
- Quote $GITHUB_OUTPUT and $GITHUB_STEP_SUMMARY variables
- Fixes shellcheck warnings in actionlint
The fallback to GITHUB_TOKEN causes merge failures since GITHUB_TOKEN
lacks merge permissions. Require CODESPACES_WORKFLOWS secret explicitly.
Keep CODESPACES_WORKFLOWS without fallback to fix merge permissions.
Prevents merge conflicts and wasted CI resources by requiring
git fetch/merge before gh pr create.
- Consumer repos should automatically create bootstrap PRs when issues
  are labeled with agent:codex or similar labels
- Previously used 'invite' mode which only waits for humans to create PRs
- Changed template to 'create' mode to enable automatic PR creation
- This will propagate to all consumer repos via sync workflow

Also fixed line length issues to pass validation.
Root cause: The reusable issue bridge workflow was hardcoded to always
use 'invite' mode for issue events, ignoring the mode input parameter.
This prevented automatic PR creation when issues are labeled.

The logic at line 257-268 always overrides mode to 'invite' when
eventName === 'issues', with the rationale that 'the human post lands
on the issue'. However, this breaks the desired workflow of auto-creating
bootstrap PRs when issues are labeled with agent:codex.

Solution:
- Add force_mode boolean input to reusable workflow
- When force_mode=true, respect the mode input regardless of event type
- Update consumer template to pass force_mode: true
- This allows mode: create to work for issue events while maintaining
  backward compatibility (default force_mode=false preserves old behavior)

This is the correct fix after 5 attempts - the previous attempts only
changed the mode input but didn't account for the hardcoded override.
Root cause: PR #89 removed the permissions block from the sync job,
breaking the chatgpt_sync workflow that processes topic files to create
issues.

The sync job calls agents-63-issue-intake.yml which needs:
- contents: read (to checkout repo and read topic files)
- issues: write (to create/update issues)
- id-token: write (for GitHub OIDC token)
- models: read (for LangChain formatting with GitHub Models)

Without these permissions, the workflow cannot process files or create
issues from topic files.

This fixes the actual issue - file processing in chatgpt_sync mode.
COMPLETE ROOT CAUSE ANALYSIS:

The format_created_issues job in agents-63-issue-intake.yml uses:
  GH_TOKEN: ${{ github.token }}
  GITHUB_TOKEN: ${{ github.token }}

When a reusable workflow is called:
- Without secrets: inherit → github.token has NO permissions
- With explicit secrets → github.token still has NO permissions
- With secrets: inherit → github.token gets caller's permissions

The consumer template was passing explicit secrets (SERVICE_BOT_PAT,
OWNER_PR_PAT) but NOT using 'secrets: inherit'. This meant:
1. The sync job in the reusable workflow couldn't use github.token
2. gh CLI and GitHub API calls failed with permission errors
3. Files were processed but issues couldn't be created/updated

The permissions block on the sync job sets what github.token CAN have,
but secrets: inherit is what actually PASSES that token to the reusable
workflow with those permissions.

This is the actual fix. Testing flow:
1. User triggers workflow_dispatch with chatgpt_sync mode
2. route job determines mode → should_run_sync=true
3. sync job calls agents-63-issue-intake.yml with secrets: inherit
4. chatgpt_sync job has contents:read, issues:write permissions
5. format_created_issues job has those + id-token:write + models:read
6. Both jobs can use github.token with proper permissions
7. Files are processed, issues created, LangChain formatting applied
Copilot AI review requested due to automatic review settings January 6, 2026 17:06
@agents-workflows-bot
Copy link
Copy Markdown
Contributor

⚠️ Action Required: Unable to determine source issue for PR #606. The PR title, branch name, or body must contain the issue number (e.g. #123, branch: issue-123, or the hidden marker ).

@stranske stranske temporarily deployed to agent-high-privilege January 6, 2026 17:08 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses workflow startup failures by modifying the agent issue intake workflow behavior and adding process documentation. The primary changes involve switching from "invite" mode to "create" mode with a force mode override, adjusting secret handling, and documenting best practices for branch synchronization.

Key Changes

  • Changed agent bridge workflow from "invite" to "create" mode with force_mode: true to override event-based mode selection
  • Added force_mode input parameter to the reusable agent bridge workflow to allow bypassing event-driven mode logic
  • Modified secret handling in sync job from explicit secrets to secrets: inherit and added permissions block

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
templates/consumer-repo/.github/workflows/agents-issue-intake.yml Updated bridge job mode to "create" with force_mode, reformatted comments, added permissions to sync job, changed to inherited secrets
.github/workflows/reusable-agents-issue-bridge.yml Added force_mode input parameter and conditional logic to override event-based mode selection
.github/workflows/maint-71-merge-sync-prs.yml Removed fallback to GITHUB_TOKEN, now only uses CODESPACES_WORKFLOWS secret
CLAUDE.md Added documentation section emphasizing the importance of syncing with main before creating PRs
Comments suppressed due to low confidence (1)

templates/consumer-repo/.github/workflows/agents-issue-intake.yml:168

  • The mode has been changed from "invite" to "create" with force_mode enabled. This represents a significant behavior change that will affect how agent assignment works. Ensure this is the intended behavior and that all calling workflows and downstream systems expect this mode change. The "create" mode may have different side effects compared to "invite" mode.
      mode: "create"
      post_agent_comment: ${{ inputs.post_codex_comment && 'true' || 'false' }}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 6, 2026

Automated Status Summary

Head SHA: 54cc3ec
Latest Runs: ⏳ pending — Gate
Required contexts: Gate / gate, Health 45 Agents Guard / Enforce agents workflow protections
Required: core tests (3.11): ⏳ pending, core tests (3.12): ⏳ pending, docker smoke: ⏳ pending, gate: ⏳ pending

Workflow / Job Result Logs
(no jobs reported) ⏳ pending

Coverage Overview

  • Coverage history entries: 1

Coverage Trend

Metric Value
Current 92.21%
Baseline 85.00%
Delta +7.21%
Minimum 70.00%
Status ✅ Pass

Top Coverage Hotspots (lowest coverage)

File Coverage Missing
scripts/workflow_health_check.py 62.6% 28
scripts/classify_test_failures.py 62.9% 37
scripts/ledger_validate.py 65.3% 63
scripts/mypy_return_autofix.py 82.6% 11
scripts/ledger_migrate_base.py 85.5% 13
scripts/fix_cosmetic_aggregate.py 92.3% 1
scripts/coverage_history_append.py 92.8% 2
scripts/workflow_validator.py 93.3% 4
scripts/update_autofix_expectations.py 93.9% 1
scripts/pr_metrics_tracker.py 95.7% 3
scripts/generate_residual_trend.py 96.6% 1
scripts/build_autofix_pr_comment.py 97.0% 2
scripts/aggregate_agent_metrics.py 97.2% 0
scripts/fix_numpy_asserts.py 98.1% 0
scripts/sync_test_dependencies.py 98.3% 1

Updated automatically; will refresh on subsequent CI/Docker completions.


Keepalive checklist

Scope

No scope information available

Tasks

  • No tasks defined

Acceptance criteria

  • No acceptance criteria defined

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 6, 2026

🤖 Keepalive Loop Status

PR #606 | Agent: Codex | Iteration 0/5

Current State

Metric Value
Iteration progress [----------] 0/5
Action wait (missing-agent-label)
Disposition skipped (transient)
Gate success
Tasks 0/0 complete
Keepalive ❌ disabled
Autofix ❌ disabled

🔍 Failure Classification

| Error type | infrastructure |
| Error category | resource |
| Suggested recovery | Confirm the referenced resource exists (repo, PR, branch, workflow, or file). |

@stranske stranske merged commit bd48949 into main Jan 6, 2026
37 checks passed
@stranske stranske deleted the fix/workflow-startup-failure-real-fix branch January 6, 2026 17:10
stranske added a commit that referenced this pull request Jan 7, 2026
Root cause: Consumer repos were using mode: 'invite' without force_mode,
causing the reusable workflow to ignore the mode and prevent automatic
bootstrap PR creation when issues are labeled with agent:codex.

Changes:
- Change mode from 'invite' to 'create' in bridge job template
- Add force_mode: true to override issue event defaults

This template change will sync to all consumer repos:
- Travel-Plan-Permission
- Trend_Model_Project
- Manager-Database
- trip-planner
- Template
- And others

When synced, all consumer repos will support automatic PR creation when
issues are labeled with agent:* labels, fixing startup_failure issues.

Related: Trend_Model_Project#4185, PR #606
stranske added a commit that referenced this pull request Jan 7, 2026
Root cause: Consumer repos were using mode: 'invite' without force_mode,
causing the reusable workflow to ignore the mode and prevent automatic
bootstrap PR creation when issues are labeled with agent:codex.

Changes:
- Change mode from 'invite' to 'create' in bridge job template
- Add force_mode: true to override issue event defaults

This template change will sync to all consumer repos:
- Travel-Plan-Permission
- Trend_Model_Project
- Manager-Database
- trip-planner
- Template
- And others

When synced, all consumer repos will support automatic PR creation when
issues are labeled with agent:* labels, fixing startup_failure issues.

Related: Trend_Model_Project#4185, PR #606
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants