Skip to content

ci: retry cosign sign on transient GHCR/Rekor failures#2100

Merged
Aureliolo merged 3 commits into
mainfrom
fix/cosign-sign-retry
May 24, 2026
Merged

ci: retry cosign sign on transient GHCR/Rekor failures#2100
Aureliolo merged 3 commits into
mainfrom
fix/cosign-sign-retry

Conversation

@Aureliolo
Copy link
Copy Markdown
Owner

Summary

The Sign image step in publish-apko-base and publish-image-loaded was the lone GHCR-touching call in the Docker pipeline without a transient-error retry. Every other GHCR interaction is wrapped (docker login via nick-fields/retry, docker push / docker manifest push via docker_push_with_retry.sh, cosign-installer via 3 continue-on-error attempts), but a cosign sign upload was a bare shell block whose only failure-handling branch was the createLogEntryConflict idempotency check.

This gap surfaced on the merge of #2098 (commit aedbba5a5), where the Publish Sandbox Base (apko) job failed mid-cosign sign with:

Error: signing [ghcr.io/aureliolo/synthorg-sandbox-base@sha256:f77d...]:
signing digest: failed to upload manifest:
GET https://github.com/-/v2/packages/container/.../referrers/sha256:
unexpected status code 502 Bad Gateway: <!DOCTYPE html>...Unicorn!

Note: the bytes bad gateway and 502 were both already in our shared transient regex; the regex just wasn't checking cosign output.

Changes

  • New helper script .github/scripts/cosign_sign_with_retry.sh (87 LOC). Wraps cosign sign --yes <ref> with the same 4-attempt / 15→30→60s exponential backoff budget as docker_push_with_retry.sh. Sources the shared transient regex via docker_push_with_retry.sh --print-transient-re so both helpers stay in lockstep. Preserves the createLogEntryConflict idempotency branch (checked BEFORE the transient check, so attempt-1 5xx → attempt-2 conflict resolves as success). Non-transient errors (auth, malformed digest, schema rejection) fail immediately, no retry.
  • publish-apko-base/action.yml Sign image step: 12-line inline block replaced with a 2-line helper call.
  • publish-image-loaded/action.yml Sign image step: 16-line inline block replaced with a 2-line helper call.
  • publish-image-loaded/action.yml comment cleanup (separate commit): stripped 3 pre-existing comment-rot violations surfaced by comment-quality-rot (date-stamped incident reference, PR back-ref, migration framing) per feedback_preexisting_fixes_always_in_scope.md.

Test plan

  • The original failing run (Publish Sandbox Base (apko) job, run 26362972921) was rerun via gh run rerun --failed; attempt 2 completed green, confirming the failure was a transient GHCR flake.
  • bash -n .github/scripts/cosign_sign_with_retry.sh clean.
  • The shared-regex sourcing pattern matches the existing docker_push_with_retry.sh callers (e.g. publish-image-loaded/action.yml:126); proven on every previous Docker run.
  • Coverage of other sign call sites verified: cli.yml:438 uses cosign sign-blob (different command, doesn't hit GHCR referrers); publish-image-retag/action.yml does not sign (digest carries the main-run signature forward).
  • Pre-commit + pre-push hooks green (ruff/mypy/pytest/eslint/go skipped: no files in those scopes; zizmor, em-dash, secret-scan, yaml-check, commitizen all passed).

Review coverage

Pre-PR pipeline ran the minimal viable agent set (per the user's "very NEEDED only" steer):

  • infra-reviewer: clean. All findings Minor or non-issues. Script logic, security, YAML correctness, and side-effect analysis all pass.
  • comment-quality-rot: 3 pre-existing violations in publish-image-loaded/action.yml (date-stamped incident, PR back-ref, migration framing) - all fixed in the second commit. My own additions are clean.

Commits

  1. 21238381d ci: retry cosign sign on transient GHCR/Rekor failures
  2. edb4a83e8 ci: strip pre-existing comment rot in publish-image-loaded action

Both commits land cleanly under squash-merge; PR title is the first-commit summary.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves the reliability of the CI pipeline by implementing a robust retry mechanism for cosign image signing. By centralizing transient error handling into a reusable script that shares logic with existing docker push retries, the pipeline is now better equipped to handle intermittent GHCR and Rekor service interruptions without failing the entire workflow.

Highlights

  • New retry helper script: Introduced .github/scripts/cosign_sign_with_retry.sh to wrap cosign sign calls with a 4-attempt exponential backoff strategy for transient GHCR/Rekor failures.
  • Pipeline resilience: Updated publish-apko-base and publish-image-loaded actions to utilize the new retry helper, ensuring consistency with existing docker push retry logic.
  • Comment maintenance: Cleaned up pre-existing comment rot in publish-image-loaded/action.yml to improve documentation clarity and maintainability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 24, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 9ec17569-017b-4ade-bcd7-929fea2f868e

📥 Commits

Reviewing files that changed from the base of the PR and between edb4a83 and 67877f8.

📒 Files selected for processing (1)
  • .github/scripts/cosign_sign_with_retry.sh
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
  • GitHub Check: Build Backend
  • GitHub Check: Build Fine-Tune (gpu, fine-tune-gpu)
  • GitHub Check: Build Fine-Tune (cpu, fine-tune-cpu)
  • GitHub Check: Dashboard Build
  • GitHub Check: Dashboard Test
  • GitHub Check: Dashboard Type Check
  • GitHub Check: Build Sidecar
  • GitHub Check: Build Web Assets (melange)
  • GitHub Check: Build Sandbox Base (apko)
  • GitHub Check: Test Unit (shard 4)
  • GitHub Check: Test Conformance (SQLite)
  • GitHub Check: Test Unit (shard 3)
  • GitHub Check: Test Unit (shard 2)
  • GitHub Check: Test Unit (shard 1)
  • GitHub Check: Test E2E
  • GitHub Check: Dashboard Lint
  • GitHub Check: Test Integration (shard 3)
  • GitHub Check: Type Check
  • GitHub Check: Test Integration (shard 4)
  • GitHub Check: Test Integration (shard 2)
  • GitHub Check: Test Integration (shard 1)
  • GitHub Check: Runtime Stats Freshness Gate
  • GitHub Check: Lint
  • GitHub Check: Build Preview
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (actions)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
.github/scripts/cosign_sign_with_retry.sh (1)

40-45: LGTM!

Also applies to: 66-66, 78-81


Walkthrough

This PR adds resilience to image signing workflows by introducing a dedicated cosign_sign_with_retry.sh script that wraps cosign sign calls with bounded retries and exponential backoff for transient failures, while treating already-signed digests (createLogEntryConflict) as idempotent success. The script shares transient-failure detection logic with docker_push_with_retry.sh. Two GitHub Actions (publish-apko-base and publish-image-loaded) are updated to call the new wrapper and documentation comments are clarified about digest pinning and retry semantics.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding retry logic for cosign sign operations on transient GHCR/Rekor failures, which aligns with the primary purpose of the PR.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, providing clear context about the gap being addressed, specific failure example, detailed change list, test plan, and review coverage.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 24, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a centralized retry mechanism for image signing by replacing inline cosign commands with a new helper script, cosign_sign_with_retry.sh. This script implements exponential backoff and handles transient registry or transparency log errors, such as createLogEntryConflict, to improve CI reliability. The review feedback suggests several improvements for shell script robustness: explicitly invoking helper scripts with bash to handle missing execute bits, using safer printf patterns to prevent misinterpretation of hyphenated output, and adding a guard clause to ensure the transient error regex is not empty before execution.

# both scripts in lockstep so a new transient signature added in one
# place automatically protects every signing call too.
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
TRANSIENT_RE="$("$SCRIPT_DIR/docker_push_with_retry.sh" --print-transient-re)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For robustness in CI environments where file permissions might not be preserved or correctly set in the repository, it is safer to explicitly invoke the helper script using bash. This ensures the script runs even if the execute bit is missing.

Suggested change
TRANSIENT_RE="$("$SCRIPT_DIR/docker_push_with_retry.sh" --print-transient-re)"
TRANSIENT_RE="$(bash "$SCRIPT_DIR/docker_push_with_retry.sh" --print-transient-re)"

# Idempotency branch: a re-sign that lost the createLogEntry race is
# success, not a transient error. Check this BEFORE the regex so
# attempt 1 -> 5xx -> attempt 2 -> conflict resolves cleanly.
if printf '%s' "$out" | grep -q 'createLogEntryConflict'; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using printf -- '%s\n' is more robust than printf '%s'. The -- prevents printf from interpreting the content of $out as options if it happens to start with a hyphen, and the trailing newline ensures better compatibility with grep across different implementations.

Suggested change
if printf '%s' "$out" | grep -q 'createLogEntryConflict'; then
if printf -- '%s\n' "$out" | grep -q 'createLogEntryConflict'; then

exit "$rc"
fi

if printf '%s' "$out" | grep -qiE "$TRANSIENT_RE"; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If TRANSIENT_RE is empty (e.g., if the sourcing from docker_push_with_retry.sh fails silently or returns an empty string), grep -E "" will match every line, causing the script to retry on non-transient errors like authentication failures. Adding a check to ensure the regex is non-empty prevents this behavior.

Suggested change
if printf '%s' "$out" | grep -qiE "$TRANSIENT_RE"; then
if [[ -n "$TRANSIENT_RE" ]] && printf -- '%s\n' "$out" | grep -qiE "$TRANSIENT_RE"; then

@Aureliolo Aureliolo temporarily deployed to cloudflare-preview May 24, 2026 15:16 — with GitHub Actions Inactive
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 24, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.12%. Comparing base (f187b31) to head (67877f8).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2100      +/-   ##
==========================================
- Coverage   87.13%   87.12%   -0.01%     
==========================================
  Files        2251     2251              
  Lines      130311   130311              
==========================================
- Hits       113541   113535       -6     
- Misses      16755    16761       +6     
  Partials       15       15              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

G1 (CRITICAL): invoke sibling docker_push_with_retry.sh via bash (script ships 100644, bare invocation would fail Permission denied on Linux runner).
G2: printf -- '%s\n' for hyphen-safety and grep compat.
G3: guard grep -E against empty TRANSIENT_RE (would otherwise match every line and retry non-transient errors).
@Aureliolo Aureliolo merged commit da9422a into main May 24, 2026
89 checks passed
@Aureliolo Aureliolo deleted the fix/cosign-sign-retry branch May 24, 2026 15:35
@Aureliolo Aureliolo temporarily deployed to cloudflare-preview May 24, 2026 15:35 — with GitHub Actions Inactive
Aureliolo pushed a commit that referenced this pull request May 24, 2026
<!-- HIGHLIGHTS_START -->
## Highlights

> _AI-generated summary (model: `openai/gpt-4.1-mini` via GitHub
Models). Commit-based changelog below._

### What you'll notice
- New brownfield codebase intake mode supports merger and acquisition
scenarios.
- Added deep CEO interview feature to improve project charter creation.
- Introduced mission control and flight recorder operator cockpit for
better operational oversight.
- Research mode added for enhanced exploratory work.
- Runtime services now log safety-spine state at boot for clearer
diagnostics.

### What's new
- Research mode feature enables deeper data exploration.
- CEO interview integration helps shape project charters.
- Mission control and flight recorder cockpit introduced for operational
tracking.

### Under the hood
- Improved codebase modularity with module-size gates and lint
tightening.
- Added __init__.py to 21 test directories for better test discovery.
- Promoted six transitive dependencies to direct dependencies for
clarity.
- Split codespell ignore list into vocabulary and source renames.
- Decomposed oversized web utilities, hooks, and libraries for
maintainability.
- Enhanced CI with Lychee link checker integration and retry logic for
cosign signing.
- Sharded unit and integration tests and added Postgres service
container in CI.
- Updated infrastructure and web dependencies; maintained lock files.

<!-- HIGHLIGHTS_END -->

:robot: I have created a release *beep* *boop*
---


##
[0.8.8](v0.8.7...v0.8.8)
(2026-05-24)


### Features

* brownfield codebase intake (merger/acquisition entry mode)
([#2042](#2042))
([e287621](e287621)),
closes [#1975](#1975)
* deep CEO interview to project charter
([#2045](#2045))
([904f2fb](904f2fb))
* mission control + flight recorder operator cockpit
([#2044](#2044))
([1c2660b](1c2660b))
* research mode
([#2041](#2041))
([f81a5ac](f81a5ac)),
closes [#1989](#1989)
* surface safety-spine state in runtime-services boot log (closes
[#2096](#2096))
([#2097](#2097))
([f187b31](f187b31))


### Refactoring

* add __init__.py to 21 leaf test directories (INP001)
([#2081](#2081))
([2592118](2592118)),
closes [#2064](#2064)
* codebase modularity (1/4) - module-size gates + lint tightening +
tools ([#2078](#2078))
([556fbd9](556fbd9)),
closes [#2047](#2047)
[#2040](#2040)
* promote 6 transitive deps to direct deps
([#2083](#2083))
([adedc6a](adedc6a))
* split codespell ignore-words-list into vocab + source renames
([#2085](#2085))
([917d98a](917d98a)),
closes [#2074](#2074)
* **web:** PR A foundation, decompose oversized utils/hooks/lib
([#2092](#2092))
([#2098](#2098))
([aedbba5](aedbba5))


### CI/CD

* exclude slsa.dev from lychee (transient timeout on canonical badge)
([#2090](#2090))
([346c51d](346c51d))
* fix paths-filter shallow-clone race and scorecard allowlist
([#2089](#2089))
([7cd7ce8](7cd7ce8))
* refresh .test_durations.{unit,integration}
([#2087](#2087))
([ddf2d86](ddf2d86))
* retry cosign sign on transient GHCR/Rekor failures
([#2100](#2100))
([da9422a](da9422a))
* shard test-unit + test-integration, sysmon coverage, Postgres service
container ([#2080](#2080))
([0768787](0768787))
* wire Lychee link-checker (workflow + installer + pre-push hook)
([#2084](#2084))
([1c0694a](1c0694a))


### Maintenance

* Lock file maintenance
([#2086](#2086))
([a78810a](a78810a))
* Update Infrastructure dependencies
([#2055](#2055))
([041ad8b](041ad8b))
* Update Web dependencies
([#2054](#2054))
([4d57b9a](4d57b9a))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: synthorg-repo-bot[bot] <279117679+synthorg-repo-bot[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Aureliolo added a commit that referenced this pull request May 24, 2026
…2102)

## Why

The `chore(main): release 0.8.8` push (commit `e2a23707`) failed because
`CLI Test (macos-latest)` could not fetch
`github.com/klauspost/compress@v1.18.6` from `proxy.golang.org`:

```
github.com/klauspost/compress@v1.18.6: Get "https://storage.googleapis.com/proxy-golang-org-prod/...":
  dial tcp [2607:f8b0:4005:816::201b]:443: connect: no route to host
```

`proxy.golang.org` 307-redirects module ZIPs to GCS; the GCS hop
occasionally fails at the TCP layer. These are NOT 404/410, so Go's
built-in `GOPROXY=...,direct` fallback does not fire and the whole
command aborts. Downstream, `Finalize Release` preflighted, saw `CLI`
failure, and short-circuited (`Triggering workflow failed; skipping`),
so `v0.8.8` stayed in Draft until a manual rerun.

## What

New composite `.github/actions/go-mod-download-retry` wraps `go mod
download -x` in a retry loop with the project's standard posture (8
attempts, 15s base, 120s cap; ~10 min budget; matches
`cla.yml::gh_api_retry` and the `docker.yml` cosign retry from #2100).

Wired into all 7 jobs in `cli.yml` (lint, test, build, bench, vuln,
fuzz, release): one prefetch step after each `setup-go`. Once modules
are on disk, the real `go test` / `go build` / `go vet` runs once with
zero network traffic, so the retry can never mask a real test or build
failure.

`cli-vuln` also got an inline retry around `go install
golang.org/x/vuln/cmd/govulncheck@v1.3.0`, since govulncheck pulls a
separate module graph not covered by the cli prefetch.

## Scope verification

Only `cli.yml` runs Go commands across the entire `.github/workflows/`
tree (`grep` confirmed), so this fully closes the blast radius.

## Tests

actionlint, yamllint, and zizmor all pass locally on both files. No
runtime tests exist for workflow files; correctness is validated by the
next push exercising every job.

## Review

Pre-reviewed by `infra-reviewer`. One Minor finding (inline `break` vs
composite `exit 0` divergence) addressed in the second commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant