Skip to content

fix: add retry with backoff for setup-api-client npm install#1656

Merged
stranske merged 3 commits intomainfrom
claude/fix-task-completion-concerns-I1gRT
Feb 25, 2026
Merged

fix: add retry with backoff for setup-api-client npm install#1656
stranske merged 3 commits intomainfrom
claude/fix-task-completion-concerns-I1gRT

Conversation

@stranske
Copy link
Copy Markdown
Owner

@stranske stranske commented Feb 25, 2026

Source: Issue #249

Automated Status Summary

Scope

Scope section missing from source issue.

Tasks

  • Tasks section missing from source issue.

Acceptance criteria

  • Acceptance criteria section missing from source issue.

Head SHA: 85edc46
Latest Runs: ✅ success — Gate
Required: gate: ✅ success

Workflow / Job Result Logs
.github/workflows/autofix.yml ❌ failure View run
Agents PR meta manager ❔ in progress View run
Gate ✅ success View run
Health 40 Sweep ✅ success View run
Health 44 Gate Branch Protection ✅ success View run
Health 45 Agents Guard ✅ success View run
Health 50 Security Scan ✅ success View run
Health 72 Template Sync ✅ success View run
Maint 52 Validate Workflows ✅ success View run
PR 11 - Minimal invariant CI ✅ success View run
Selftest CI ✅ success View run
Validate Sync Manifest ✅ success View run

…query

ERROR_CATEGORIES.RESOURCE was undefined because the error_classifier
exports lowercase keys. This caused the isNotFound check to always fail,
so 404/resource errors from missing workflows were reported as "api_error"
instead of being handled as expected not-found responses.

https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6
Transient npm registry 403 errors (e.g. on safe-buffer) can kill the
keepalive loop chain with no recovery. Replace the single-retry logic
with a 3-attempt loop using exponential backoff (5s, 10s, 20s). The
first failure still tries --legacy-peer-deps as before.

https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6
Copilot AI review requested due to automatic review settings February 25, 2026 20:20
@agents-workflows-bot
Copy link
Copy Markdown
Contributor

agents-workflows-bot bot commented Feb 25, 2026

Automated Status Summary

Head SHA: 43324c3
Latest Runs: ⏳ pending — Gate
Required contexts: Gate / gate, Health 45 Agents Guard / guard
Required: core tests (3.11): ⏳ pending, core tests (3.12): ⏳ pending, docker smoke: ⏳ pending, gate: ⏳ pending

Workflow / Job Result Logs
(no jobs reported) ⏳ pending

Coverage Overview

  • Coverage history entries: 1

Coverage Trend

Metric Value
Current 93.12%
Baseline 85.00%
Delta +8.12%
Minimum 70.00%
Status ✅ Pass

Top Coverage Hotspots (lowest coverage)

File Coverage Missing
src/cli_parser.py 81.8% 4
src/percentile_calculator.py 95.0% 1
src/aggregator.py 95.0% 2
src/__init__.py 100.0% 0
src/ndjson_parser.py 100.0% 0

Updated automatically; will refresh on subsequent CI/Docker completions.


Keepalive checklist

Scope

Scope section missing from source issue.

Tasks

  • Tasks section missing from source issue.

Acceptance criteria

  • Acceptance criteria section missing from source issue.

@stranske-keepalive
Copy link
Copy Markdown
Contributor

stranske-keepalive bot commented Feb 25, 2026

🤖 Keepalive Loop Status

PR #1656 | Agent: Codex | Iteration 0/5

Current State

Metric Value
Iteration progress [----------] 0/5
Action wait (missing-agent-label)
Disposition skipped (transient)
Gate success
Tasks 0/2 complete
Timeout 45 min (default)
Timeout usage 3m elapsed (7%, 42m remaining)
Keepalive ❌ disabled
Autofix ❌ disabled

🔍 Failure Classification

| Error type | infrastructure |
| Error category | resource |
| Suggested recovery | Confirm the referenced resource exists (repo, PR, branch, workflow, or file). |

@stranske-keepalive
Copy link
Copy Markdown
Contributor

stranske-keepalive bot commented Feb 25, 2026

Keepalive Work Log (click to expand)
# Time (UTC) Agent Action Result Files Tasks Progress Commit Gate
0 2026-02-25 20:24:04 Codex wait (missing-agent-label-transient) skipped 0 0/2 success
0 2026-02-25 20:35:16 Codex wait (missing-agent-label-transient) skipped 0 0/2 success

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves workflow resilience and CI signal quality by hardening the setup-api-client dependency install against transient npm failures and aligning verifier CI error-category casing with the shared error classifier.

Changes:

  • Add an exponential-backoff retry loop (plus --legacy-peer-deps fallback) around npm install in setup-api-client.
  • Fix ERROR_CATEGORIES constant casing from RESOURCE to resource in verifier CI query logic (in both main + template scripts).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
.github/actions/setup-api-client/action.yml Adds multi-attempt retry + backoff to npm install for pinned Octokit dependencies.
.github/scripts/verifier_ci_query.js Uses ERROR_CATEGORIES.resource to correctly classify 404/workflow-not-found as “resource”.
templates/consumer-repo/.github/scripts/verifier_ci_query.js Mirrors the same ERROR_CATEGORIES.resource fix in the consumer template script.

Comment on lines +254 to +256
NPM_MAX_RETRIES=3
NPM_BACKOFF=5 # seconds; doubles each retry (5, 10, 20)
npm_installed=false
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backoff comment/behavior doesn’t match the stated plan: with NPM_MAX_RETRIES=3 the loop only sleeps twice (5s then 10s) and never reaches 20s, even though the comment says "(5, 10, 20)" and the PR description mentions 5/10/20. Either increase the retry count (e.g. 4 attempts / 3 retries) or adjust the comment/plan so it accurately reflects the actual delays.

Copilot uses AI. Check for mistakes.
Comment on lines +279 to +280
fi
rm -f "$npm_output"
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the --legacy-peer-deps fallback install fails, its stderr is currently discarded (the temp file is removed without being logged). That makes diagnosing persistent failures harder, especially when the first failure was transient but the legacy attempt fails for a different reason. Capture and log the legacy attempt’s stderr (at least once) before continuing to the backoff retries.

Suggested change
fi
rm -f "$npm_output"
fi
npm_err_legacy=$(cat "$npm_output")
rm -f "$npm_output"
echo "::warning::npm install with --legacy-peer-deps failed: $npm_err_legacy"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 25, 2026

@stranske I've opened a new pull request, #1657, to work on those changes. Once the pull request is ready, I'll request review from you.

…deps stderr

- Fix backoff comment: with 3 retries, only 2 sleeps occur (5s, 10s),
  not 3 (5, 10, 20) as previously stated
- Log stderr from --legacy-peer-deps fallback when it fails, so
  persistent failures are diagnosable in CI logs

https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6
@stranske stranske merged commit 2d20110 into main Feb 25, 2026
180 of 181 checks passed
@stranske stranske deleted the claude/fix-task-completion-concerns-I1gRT branch February 25, 2026 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants