Skip to content

feat: Add dynamic token load balancer for API rate limit management#1008

Merged
stranske merged 19 commits intomainfrom
feature/token-load-balancer
Jan 21, 2026
Merged

feat: Add dynamic token load balancer for API rate limit management#1008
stranske merged 19 commits intomainfrom
feature/token-load-balancer

Conversation

@stranske
Copy link
Copy Markdown
Owner

Summary

Implement intelligent token rotation across multiple PATs and GitHub Apps to prevent API rate limit exhaustion during keepalive operations.

Problem

The keepalive loop was failing due to API rate limit exhaustion, even though multiple token sources (PATs and Apps) were available with unused capacity. GITHUB_TOKEN was 100% exhausted while WORKFLOWS_APP only had 3.2% usage.

Solution

New token_load_balancer.js Module

A comprehensive token management system that:

  1. Token Registry - Tracks all available tokens and their rate limit status
  2. Capacity-based Selection - Selects token with highest remaining capacity
  3. Task Specialization - Routes specific tasks to appropriate tokens:
Token Primary Tasks Exclusive?
KEEPALIVE_APP keepalive-loop ✅ Yes
OWNER_PR_PAT PR creation as owner ✅ Yes
SERVICE_BOT_PAT bot comments, labels Primary
ACTIONS_BOT_PAT workflow dispatch Primary
GH_APP comment handling Primary
WORKFLOWS_APP general ops Fallback
  1. Proactive Rotation - Switches tokens before hitting limits
  2. Graceful Degradation - Defers when all tokens exhausted

Integration

  • checkRateLimitStatus() added to keepalive_loop.js
  • Early rate limit check at evaluation start
  • Automatic deferral when capacity exhausted
  • New workflow outputs: rate_limit_remaining, rate_limit_recommendation

Enhanced Diagnostics

Health-75 workflow now checks all 6 token types:

  • GITHUB_TOKEN
  • CODESPACES_WORKFLOWS
  • SERVICE_BOT_PAT
  • WORKFLOWS_APP
  • KEEPALIVE_APP
  • GH_APP

Files Changed

  • .github/scripts/token_load_balancer.js (new)
  • .github/scripts/keepalive_loop.js (rate limit integration)
  • .github/workflows/agents-keepalive-loop.yml (new outputs)
  • .github/workflows/health-75-api-rate-diagnostic.yml (all 6 tokens)
  • Templates synced

Testing

  • All JavaScript files pass syntax check
  • All YAML files valid
  • Pre-commit hooks pass
  • Ready for integration testing via Health-75 workflow

github-actions bot and others added 11 commits January 21, 2026 04:43
Implement intelligent token rotation across multiple PATs and GitHub Apps
to prevent API rate limit exhaustion during keepalive operations.

Key features:
- Token registry with capacity tracking for all available tokens
- Dynamic selection based on remaining capacity and task requirements
- Token specialization support (exclusive/primary assignments)
- Proactive rotation before limits are hit
- Graceful degradation when all tokens low

Token specializations:
- KEEPALIVE_APP: Exclusive for keepalive-loop (isolated pool)
- OWNER_PR_PAT: Exclusive for PR creation as owner
- SERVICE_BOT_PAT: Primary for bot comments/labels
- ACTIONS_BOT_PAT: Primary for workflow dispatch
- GH_APP: Primary for comment handling

Integration:
- checkRateLimitStatus() added to keepalive_loop.js
- Early rate limit check with automatic deferral
- Rate limit outputs added to workflow

Diagnostics:
- Health-75 updated to check all 6 token types
- Aggregate totals across all token pools
Copilot AI review requested due to automatic review settings January 21, 2026 06:20
@stranske stranske temporarily deployed to agent-high-privilege January 21, 2026 06:20 — with GitHub Actions Inactive
@github-actions
Copy link
Copy Markdown
Contributor

❌ Sync Manifest Validation Failed

This PR modifies files that should be synced to consumer repos, but the sync manifest is incomplete.

Required action: Update .github/sync-manifest.yml to include the new/modified files.

Why this matters

Files not declared in the manifest won't be synced to consumer repos (Template, Manager-Database, trip-planner, Travel-Plan-Permission), causing features to silently not work in those repos.

How to fix

  1. Open .github/sync-manifest.yml
  2. Add your new files to the appropriate section (workflows, prompts, scripts, etc.)
  3. Include a description explaining what the file does
  4. If the file should NOT be synced, add it to excluded: with a reason

See the workflow logs for specific files that need to be added.

@agents-workflows-bot
Copy link
Copy Markdown
Contributor

⚠️ Action Required: Unable to determine source issue for PR #1008. The PR title, branch name, or body must contain the issue number (e.g. #123, branch: issue-123, or the hidden marker ).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1bbaea73d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1480 to +1482
// Determine if we should defer
result.shouldDefer = tokenLoadBalancer.shouldDefer(minRequired);
result.canProceed = !result.shouldDefer && result.totalRemaining >= minRequired;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Initialize token registry before deferring on limits

Because checkRateLimitStatus unconditionally calls tokenLoadBalancer.shouldDefer(minRequired) when the module is present, the keepalive loop will defer even if the primary token has quota whenever the registry is empty (the new module is required but initializeTokenRegistry is never called in this script). shouldDefer returns true when tokenRegistry.tokens is empty (token_load_balancer.js lines 631–637), so this path will always hit the early action: 'defer' return on every run unless forceRetry is set. This effectively stalls the loop in the default configuration.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a comprehensive token load balancer system to prevent API rate limit exhaustion during keepalive operations by intelligently rotating across multiple Personal Access Tokens (PATs) and GitHub Apps.

Changes:

  • Introduces new token_load_balancer.js module for dynamic token selection based on rate limit capacity
  • Integrates rate limit checking into the keepalive loop with early deferral when all tokens are exhausted
  • Enhances diagnostic workflows to monitor all 6 token types (GITHUB_TOKEN, CODESPACES_WORKFLOWS, SERVICE_BOT_PAT, WORKFLOWS_APP, KEEPALIVE_APP, GH_APP)
  • Improves task completion analysis to support numbered checklists and issue-only tasks
  • Refactors Source section extraction to properly handle nested headings and code blocks

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
templates/consumer-repo/.github/scripts/token_load_balancer.js New module implementing token registry, rate limit tracking, and optimal token selection with task specialization
.github/scripts/token_load_balancer.js Duplicate of template version for repository use
templates/consumer-repo/.github/scripts/keepalive_loop.js Adds rate limit status checking, enhanced Source section parsing, numbered checklist support, and issue-only task matching
.github/scripts/keepalive_loop.js Duplicate of template version with same enhancements
templates/consumer-repo/.github/workflows/agents-keepalive-loop.yml Adds rate_limit_remaining and rate_limit_recommendation outputs, improves YAML formatting
.github/workflows/agents-keepalive-loop.yml Duplicate of template version with same outputs
.github/workflows/health-75-api-rate-diagnostic.yml Expands token monitoring from 3 to 6 tokens, calculates total remaining capacity
.github/workflows/health-72-template-sync.yml Uses HEAD_REF variable for consistency
.github/scripts/tests/keepalive-loop.test.js Adds comprehensive tests for new features (Source section parsing, numbered checklists, issue-only tasks)
agents/codex-1001.md Bootstrap file for codex on issue #1001

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +297 to +309
const percentUsed = core_limit.limit > 0
? ((core_limit.used / core_limit.limit) * 100).toFixed(1)
: 0;

return {
limit: core_limit.limit,
remaining: core_limit.remaining,
used: core_limit.used,
reset: core_limit.reset * 1000,
checked: Date.now(),
percentUsed: parseFloat(percentUsed),
percentRemaining: 100 - parseFloat(percentUsed),
};
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The percentUsed calculation uses toFixed which returns a string, but this string is then parsed back to a number. The percentRemaining calculation on line 308 will subtract a string from 100, which in JavaScript will work due to type coercion but could lead to unexpected behavior. Consider using the numeric value directly instead of converting to string and back.

Copilot uses AI. Check for mistakes.
Comment on lines +492 to +495
tokenInfo.rateLimit.percentUsed = tokenInfo.rateLimit.limit > 0
? ((tokenInfo.rateLimit.used / tokenInfo.rateLimit.limit) * 100).toFixed(1)
: 0;
tokenInfo.rateLimit.percentRemaining = 100 - tokenInfo.rateLimit.percentUsed;
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The percentUsed calculation uses toFixed which returns a string, but this string is then used in arithmetic operations. The percentRemaining assignment will perform string-to-number coercion. This is similar to the issue in checkTokenRateLimit. Consider calculating both as numbers directly to avoid type inconsistencies.

Copilot uses AI. Check for mistakes.
Comment on lines +124 to +164
async function initializeTokenRegistry({ secrets, github, core, githubToken }) {
tokenRegistry.tokens.clear();

// Register GITHUB_TOKEN (always available)
if (githubToken) {
registerToken({
id: 'GITHUB_TOKEN',
token: githubToken,
type: 'GITHUB_TOKEN',
source: 'github.token',
capabilities: TOKEN_CAPABILITIES.GITHUB_TOKEN,
priority: 0, // Lowest priority (most restricted)
});
}

// Register PATs (check for PAT1, PAT2, etc. pattern as well as named PATs)
const patSources = [
{ id: 'SERVICE_BOT_PAT', env: secrets.SERVICE_BOT_PAT, account: 'stranske-automation-bot' },
{ id: 'ACTIONS_BOT_PAT', env: secrets.ACTIONS_BOT_PAT, account: 'stranske-automation-bot' },
{ id: 'CODESPACES_WORKFLOWS', env: secrets.CODESPACES_WORKFLOWS, account: 'stranske' },
{ id: 'OWNER_PR_PAT', env: secrets.OWNER_PR_PAT, account: 'stranske' },
{ id: 'AGENTS_AUTOMATION_PAT', env: secrets.AGENTS_AUTOMATION_PAT, account: 'unknown' },
// Numbered PATs for future expansion
{ id: 'PAT_1', env: secrets.PAT_1, account: 'pool' },
{ id: 'PAT_2', env: secrets.PAT_2, account: 'pool' },
{ id: 'PAT_3', env: secrets.PAT_3, account: 'pool' },
];

for (const pat of patSources) {
if (pat.env) {
registerToken({
id: pat.id,
token: pat.env,
type: 'PAT',
source: pat.id,
account: pat.account,
capabilities: TOKEN_CAPABILITIES.PAT,
priority: 5, // Medium priority
});
}
}
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initializeTokenRegistry function does not validate that the secrets object exists before accessing its properties. If secrets is undefined or null, this will throw an error. Consider adding a guard check at the beginning of the function.

Copilot uses AI. Check for mistakes.
.filter(line => /^\s*[-*+]\s*\[\s*\]/.test(line))
.map(line => {
const match = line.match(/^\s*[-*+]\s*\[\s*\]\s*(.+)$/);
const match = line.match(/^\s*(?:[-*+]|\d+[.)])\s*\[\s*\]\s*(.+)$/);
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern used to match numbered checklist items includes both '1.' and '1)' formats. However, the pattern requires whitespace after the number/marker with '\s*', which means '1.[ ]' (without space) won't match. Consider if this is the intended behavior or if the pattern should be more flexible.

Copilot uses AI. Check for mistakes.
Comment on lines +2784 to +2787
const strippedIssueTask = task
.replace(/\[[^\]]*\]\(([^)]+)\)/g, '$1')
.replace(/https?:\/\/\S+/gi, '')
.replace(/[#\d]/g, '')
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strippedIssueTask calculation removes all digits with .replace(/[#\d]/g, ''), which could inadvertently remove legitimate digits that are part of the task description, not just the issue number. This might cause false positives for isIssueOnlyTask. Consider a more targeted approach that only removes the specific issue reference pattern.

Suggested change
const strippedIssueTask = task
.replace(/\[[^\]]*\]\(([^)]+)\)/g, '$1')
.replace(/https?:\/\/\S+/gi, '')
.replace(/[#\d]/g, '')
let strippedIssueTask = task
.replace(/\[[^\]]*\]\(([^)]+)\)/g, '$1')
.replace(/https?:\/\/\S+/gi, '');
if (issuePattern) {
strippedIssueTask = strippedIssueTask.replace(issuePattern, '');
}
strippedIssueTask = strippedIssueTask

Copilot uses AI. Check for mistakes.
Comment on lines +460 to +477
let token = best.tokenInfo.token;
if (best.tokenInfo.type === 'APP' && !token) {
token = await mintAppToken({ tokenInfo: best.tokenInfo, core });
best.tokenInfo.token = token;
}

core?.info?.(`Selected token: ${best.id} (${best.remaining} remaining, ${best.percentRemaining.toFixed(1)}% capacity)${best.isPrimary ? ' [primary]' : ''}`);

return {
token,
source: best.id,
type: best.tokenInfo.type,
remaining: best.remaining,
percentRemaining: best.percentRemaining,
percentUsed: best.tokenInfo.rateLimit?.percentUsed ?? 0,
isPrimary: best.isPrimary,
task,
};
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After minting an app token, if the token is null (minting failed), the function continues and returns a response with token: null. This could cause issues for callers expecting a valid token. Consider adding a check after token minting and returning null from getOptimalToken if the token couldn't be minted, or trying the next candidate.

Copilot uses AI. Check for mistakes.
Comment on lines +383 to +399
if (tokenInfo.type === 'APP' && !token) {
token = await mintAppToken({ tokenInfo, core });
tokenInfo.token = token;
}
if (token) {
return {
token,
source: id,
type: tokenInfo.type,
remaining: tokenInfo.rateLimit?.remaining ?? 0,
percentRemaining: tokenInfo.rateLimit?.percentRemaining ?? 0,
percentUsed: tokenInfo.rateLimit?.percentUsed ?? 0,
exclusive: true,
task,
};
}
}
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to the main token selection path, after minting an app token for an exclusive task (line 384-385), if the token is null (minting failed), the code checks 'if (token)' but this means no token is returned for that exclusive task. The function will then fall through to general token selection, which may violate the exclusivity contract. Consider handling the failed minting case more explicitly.

Copilot uses AI. Check for mistakes.
Comment on lines +516 to +524
tokenInfo.rateLimit = {
limit,
remaining,
used: used || (limit - remaining),
reset: reset ? reset * 1000 : tokenInfo.rateLimit.reset,
checked: Date.now(),
percentUsed: ((limit - remaining) / limit * 100).toFixed(1),
percentRemaining: (remaining / limit * 100).toFixed(1),
};
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updateFromHeaders function has inconsistent type handling. Lines 522-523 calculate percentUsed and percentRemaining using toFixed which returns strings, while the rest of the rateLimit object uses numbers. This type inconsistency could cause issues in comparisons and calculations elsewhere in the code.

Copilot uses AI. Check for mistakes.

if (pattern.test(updatedBody)) {
updatedBody = updatedBody.replace(pattern, '$1[x]$2');
updatedBody = updatedBody.replace(pattern, '$1$2[x]$3');
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern on line 3000 attempts to match numbered checklist items for replacement. However, the capture groups have changed from the previous version. The pattern now captures '(^|\n)(\s*(?:[-+]|\d+[.)])\s)\\s*\', but the replacement '$1$2[x]$3' may not preserve the original formatting correctly. Specifically, if a match occurs at the start of a line (not after \n), the \n won't be in $1, which could affect the output formatting.

Suggested change
updatedBody = updatedBody.replace(pattern, '$1$2[x]$3');
updatedBody = updatedBody.replace(
pattern,
(fullMatch, lineStart, prefix, taskText) => `${lineStart}${prefix}[x]${taskText}`,
);

Copilot uses AI. Check for mistakes.
Comment on lines +2844 to +2846
const prTitle = pr?.title;
const prRef = pr?.head?.ref;
const prMatch = issueMatchesText(issuePattern, prTitle) || issueMatchesText(issuePattern, prRef);
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the analyzeTaskCompletion function, the pr parameter is added but there's no null/undefined check before accessing pr.title and pr.head.ref on lines 2844-2845. If pr is not provided (which is optional based on the function signature), this will throw an error.

Copilot uses AI. Check for mistakes.
…anifest

- Add SERVICE_BOT_PAT to DISPATCH_TOKEN_KEYS for fallback support
- Add token_load_balancer.js to sync manifest
- Fixes failing keepalive-runner.test.js test
- Fixes sync manifest validation
@stranske stranske temporarily deployed to agent-high-privilege January 21, 2026 06:29 — with GitHub Actions Inactive
- Initialize token registry before shouldDefer check to prevent always-defer bug (P1)
- Fix percentUsed/percentRemaining type inconsistency (use numbers instead of strings)
- Add secrets validation in initializeTokenRegistry
- Handle failed app token minting for exclusive tasks (don't fall through)
- Handle failed app token minting in general selection (try next candidate)
- Add pr parameter null check before accessing properties
- Use targeted issue number removal (only #number patterns, not all digits)

Addresses 8 code review comments from Copilot and Codex bots.
Adds isInitialized() helper to check if token registry contains tokens
before attempting to use shouldDefer(). Required by keepalive_loop.js
P1 fix.
Add .flake8 config with max-line-length=100 to match black/ruff/isort configuration across all repos.
Fixed line length violations in workflows added/modified in main:
- agents-80-pr-event-hub.yml: Split long if conditions (2 long JS lines in script block are unavoidable)
- agents-pr-meta.yml: Multiline if conditions
- agents-81-gate-followups.yml: Split long env var expressions
- agents-verify-to-issue-v2.yml: Multiline if condition
@stranske stranske temporarily deployed to agent-high-privilege January 21, 2026 06:47 — with GitHub Actions Inactive
@github-actions github-actions bot added autofix Opt-in automated formatting & lint remediation autofix:patch labels Jan 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 21, 2026

Automated Status Summary

Head SHA: d6a6935
Latest Runs: ⏳ pending — Gate
Required contexts: Gate / gate, Health 45 Agents Guard / Enforce agents workflow protections
Required: core tests (3.11): ⏳ pending, core tests (3.12): ⏳ pending, docker smoke: ⏳ pending, gate: ⏳ pending

Workflow / Job Result Logs
(no jobs reported) ⏳ pending

Coverage Overview

  • Coverage history entries: 1

Coverage Trend

Metric Value
Current 93.12%
Baseline 85.00%
Delta +8.12%
Minimum 70.00%
Status ✅ Pass

Top Coverage Hotspots (lowest coverage)

File Coverage Missing
src/cli_parser.py 81.8% 4
src/percentile_calculator.py 95.0% 1
src/aggregator.py 95.0% 2
src/__init__.py 100.0% 0
src/ndjson_parser.py 100.0% 0

Updated automatically; will refresh on subsequent CI/Docker completions.


Keepalive checklist

Scope

No scope information available

Tasks

  • No tasks defined

Acceptance criteria

  • No acceptance criteria defined

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 21, 2026

🤖 Keepalive Loop Status

PR #1008 | Agent: Codex | Iteration 0/5

Current State

Metric Value
Iteration progress [----------] 0/5
Action wait (missing-agent-label)
Disposition skipped (transient)
Gate success
Tasks 0/5 complete
Timeout 45 min (default)
Timeout usage 2m elapsed (5%, 43m remaining)
Keepalive ❌ disabled
Autofix ❌ disabled

🔍 Failure Classification

| Error type | infrastructure |
| Error category | resource |
| Suggested recovery | Confirm the referenced resource exists (repo, PR, branch, workflow, or file). |

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 21, 2026

Status | ✅ autofix updates applied
History points | 1
Timestamp | 2026-01-21 07:06:06 UTC
Report artifact | autofix-report-pr-1008
Remaining | 0
New | 0
No additional artifacts

@github-actions
Copy link
Copy Markdown
Contributor

Autofix updated these files:

  • tests/test_workflow_validator.py

- Removed leftover <<<<<<< HEAD marker at line 2823
- Fixed duplicate strippedIssueTask declaration
- Fixed missing semicolon and improper if statement placement
- Removed duplicate isIssueOnlyTask block

All JavaScript syntax now valid.
@stranske stranske force-pushed the feature/token-load-balancer branch from 2e1be17 to 472fed1 Compare January 21, 2026 06:55
@stranske stranske temporarily deployed to agent-high-privilege January 21, 2026 06:56 — with GitHub Actions Inactive
When resolving merge conflicts, accidentally removed the variable declarations
for 'confidence' and 'reason' that are used throughout analyzeTaskCompletion.
This caused 'ReferenceError: confidence/reason is not defined' in tests.
@stranske stranske force-pushed the feature/token-load-balancer branch from ab8225c to fb804c8 Compare January 21, 2026 07:04
@stranske stranske temporarily deployed to agent-high-privilege January 21, 2026 07:04 — with GitHub Actions Inactive
@stranske stranske merged commit 8a1f5ca into main Jan 21, 2026
39 checks passed
@stranske stranske deleted the feature/token-load-balancer branch January 21, 2026 07:07
stranske added a commit that referenced this pull request Feb 2, 2026
This commit adds:

1. New unified setup-api-client action (.github/actions/setup-api-client/action.yml)
   - Combines npm install + token export into one reusable action
   - Pins @octokit/* versions for consistency (20.0.2, 6.0.1, 9.1.5, 6.0.3)
   - Supports both JSON secrets and individual inputs
   - Reports token count for debugging

2. Comprehensive remediation plan (docs/fixes/RATE_LIMIT_REMEDIATION_PLAN.md)
   - Detailed PR history from #1008 to #1182
   - Root cause analysis
   - Implementation phases
   - Testing strategy
   - Handoff protocol

Next steps: Apply the action to high-frequency workflows
stranske added a commit that referenced this pull request Feb 2, 2026
* feat: add unified setup-api-client action and remediation plan

This commit adds:

1. New unified setup-api-client action (.github/actions/setup-api-client/action.yml)
   - Combines npm install + token export into one reusable action
   - Pins @octokit/* versions for consistency (20.0.2, 6.0.1, 9.1.5, 6.0.3)
   - Supports both JSON secrets and individual inputs
   - Reports token count for debugging

2. Comprehensive remediation plan (docs/fixes/RATE_LIMIT_REMEDIATION_PLAN.md)
   - Detailed PR history from #1008 to #1182
   - Root cause analysis
   - Implementation phases
   - Testing strategy
   - Handoff protocol

Next steps: Apply the action to high-frequency workflows

* refactor: apply setup-api-client action to keepalive workflow

This commit updates agents-keepalive-loop.yml to use the new unified
setup-api-client action instead of separate npm install + export steps.

Changes:
- Replace 4 instances of 'npm install' + 'export-load-balancer-tokens'
  with single 'setup-api-client' action
- Remove duplicate export block in evaluate job
- Add setup-api-client to summary job

Benefits:
- Single point of dependency management (pinned versions)
- Consistent token export across all jobs
- Reduced workflow file from 965 to 887 lines
- Easier maintenance - one action to update, not multiple blocks

Jobs updated:
- evaluate: lines 78-82
- mark-running: lines 332-336
- run-codex: lines 427-431
- summary: lines 639-643

* chore: add setup-api-client action to sync manifest

This ensures the new unified setup action will be synced to consumer repos.
Also marks export-load-balancer-tokens as deprecated.

* fix: add explicit 'Agent Stopped: API capacity depleted' status

When rate limits are exhausted, the summary comment now shows:

### 🛑 Agent Stopped: API capacity depleted

This replaces the misleading '🔄 Agent Running' status that previously
appeared when the agent was actually blocked by rate limits.

The new status clearly indicates:
- All token pools are exhausted
- This is NOT a code/prompt problem
- Automatic recovery when limits reset (~1 hour)

Detection logic:
- reason === 'rate-limit-exhausted'
- action === 'defer' with rate-related reason

* chore: sync template scripts

* chore(codex-autofix): apply updates (PR #1183)

* fix: update API wrapper guard to accept setup-api-client action

The check_api_wrapper_guard.py script now accepts either:
- export-load-balancer-tokens (old pattern)
- setup-api-client (new unified action)

This allows workflows to use the new setup-api-client action
without triggering lint errors.

* chore(autofix): formatting/lint

* chore(autofix): formatting/lint

* fix: skip node_modules in API guard scan

The _collect_all_files fallback was scanning node_modules directories
when the git diff failed, causing false positives from third-party
library code (e.g., @octokit type definitions containing api.github.com).

This fix explicitly skips any path containing 'node_modules' in its
parts, which is the correct behavior since we only want to lint
project code, not dependencies.

* fix: recognize ensureRateLimitWrapped as valid wrapper pattern

The github-rate-limited-wrapper.js wrapper internally uses createTokenAwareRetry
from github-api-with-retry.js to provide rate limit protection. This change
recognizes both wrappers as valid, avoiding false positives for files like
keepalive_loop.js that use the higher-level wrapper.

* fix: exclude node_modules from _is_target_file check

The previous commit added node_modules exclusion only to _collect_all_files().
When running with --base-ref (diff mode), the files come from git diff
which can include node_modules paths if lockfiles were updated. Adding the
node_modules check to _is_target_file ensures it applies to both code paths.

* fix: address code review feedback from Copilot

Changes to setup-api-client action:
- Fix output description: 'tokens exported to environment' (not 'load balancer')
- Create .github/scripts dir if it doesn't exist (avoid clutter at workspace root)
- Capture npm stderr for debugging instead of suppressing it
- Add jq availability check with warning and fallback to individual inputs
- Fix double-counting of GITHUB_TOKEN/GH_TOKEN (same value, count once)
- Add clarifying comment about empty values not being counted

Documentation updates:
- Mark Task A.1 and A.2 as DONE
- Update example to match actual implementation
- Note that package.json is not needed (inline version pinning)
- Update 'to be created' to 'created in this PR'

* fix: sync setup-api-client and updated keepalive to templates

Critical fix: The sync system copies from templates/consumer-repo/, not from
.github/workflows/. Without this commit:
- setup-api-client action wouldn't sync to consumer repos
- agents-keepalive-loop.yml updates wouldn't sync to consumer repos

Consumer repos would continue failing with 'Token registry initialized with 0 tokens'

Remaining work (documented in RATE_LIMIT_REMEDIATION_PLAN.md):
- Update other workflows to use setup-api-client (belt dispatcher, worker, conveyor, etc.)
- Currently 9 other workflows still use export-load-balancer-tokens

* docs: add remaining work section to remediation plan

Documents:
- 9 workflows still using old export-load-balancer-tokens pattern
- Priority order for updates
- Template sync architecture (templates/consumer-repo/ is the source)
- Update pattern to follow
- Verification steps after merge

* refactor: migrate ALL workflows to setup-api-client with simplified params

This comprehensive update migrates all 70 workflow files from the legacy
export-load-balancer-tokens action to the new setup-api-client action.

Changes:
- Replace all sparse-checkout paths from export-load-balancer-tokens to setup-api-client
- Replace all action usages to use simplified parameter interface:
  - secrets: ${{ toJSON(secrets) }}
  - github_token: ${{ github.token }}
- Remove individual secret parameters (service_bot_pat, token_rotation_json, etc)
- Update consumer-repo templates to match

The new setup-api-client action:
- Accepts secrets via toJSON(secrets) for automatic discovery
- Properly counts tokens without double-counting GH_TOKEN
- Includes jq availability check
- Creates .github/scripts directory if needed
- Preserves backward compatibility via optional individual secret inputs

Workflow files updated: 60+ workflows
Template files updated: 10+ consumer templates
Action synced: templates/consumer-repo/.github/actions/setup-api-client/

* fix: escape template expressions in action description

GitHub Actions doesn't allow template expressions in the description field.
Changed from ${{ }} to '{{ }}' for the usage examples.

* fix: sync agents-auto-pilot.yml template from main workflow

The consumer template was significantly out of date and still had direct
gh api calls. Synced from the main workflow which uses the rate-limit
wrapped helpers.

* chore(codex-autofix): apply updates (PR #1183)

* chore: sync template scripts

* docs: add Rate Limiting Architecture section to CLAUDE.md

Documents the relationship between:
- setup-api-client action (exports tokens to GITHUB_ENV)
- github-api-with-retry.js (reads env vars, creates token-aware wrapper)
- token_load_balancer.js (token registry and selection)

This ensures future changes to rate limiting understand the full component chain.

* ci: add template drift check workflow

Fails CI when templates in templates/consumer-repo/ drift more than 50 lines
from their main workflow counterparts in .github/workflows/.

This prevents the situation where consumers receive outdated versions
because templates weren't updated when main workflows changed.

Triggered on:
- Push/PR touching agents-*.yml workflows
- Push/PR touching consumer templates

* docs: add copilot-instructions.md with mandatory read-first rule

Explicitly instructs Copilot to read CLAUDE.md before any work.
Also documents the template sync requirement.

* fix: rename template drift check to follow naming convention

- Renamed ci-template-drift.yml → health-74-template-drift.yml
- Added to EXPECTED_NAMES in test_workflow_naming.py
- Changed to warn-only mode (pre-existing drift shouldn't block PRs)

* docs: add health-74-template-drift to workflow inventory

Required for test_inventory_docs_list_all_workflows test

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autofix:patch autofix Opt-in automated formatting & lint remediation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants