ci: retry cosign sign on transient GHCR/Rekor failures#2100
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request improves the reliability of the CI pipeline by implementing a robust retry mechanism for cosign image signing. By centralizing transient error handling into a reusable script that shares logic with existing docker push retries, the pipeline is now better equipped to handle intermittent GHCR and Rekor service interruptions without failing the entire workflow. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI (base), Organization UI (inherited) Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
🔇 Additional comments (1)
WalkthroughThis PR adds resilience to image signing workflows by introducing a dedicated 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
There was a problem hiding this comment.
Code Review
This pull request introduces a centralized retry mechanism for image signing by replacing inline cosign commands with a new helper script, cosign_sign_with_retry.sh. This script implements exponential backoff and handles transient registry or transparency log errors, such as createLogEntryConflict, to improve CI reliability. The review feedback suggests several improvements for shell script robustness: explicitly invoking helper scripts with bash to handle missing execute bits, using safer printf patterns to prevent misinterpretation of hyphenated output, and adding a guard clause to ensure the transient error regex is not empty before execution.
| # both scripts in lockstep so a new transient signature added in one | ||
| # place automatically protects every signing call too. | ||
| SCRIPT_DIR="$(dirname "$(readlink -f "$0")")" | ||
| TRANSIENT_RE="$("$SCRIPT_DIR/docker_push_with_retry.sh" --print-transient-re)" |
There was a problem hiding this comment.
For robustness in CI environments where file permissions might not be preserved or correctly set in the repository, it is safer to explicitly invoke the helper script using bash. This ensures the script runs even if the execute bit is missing.
| TRANSIENT_RE="$("$SCRIPT_DIR/docker_push_with_retry.sh" --print-transient-re)" | |
| TRANSIENT_RE="$(bash "$SCRIPT_DIR/docker_push_with_retry.sh" --print-transient-re)" |
| # Idempotency branch: a re-sign that lost the createLogEntry race is | ||
| # success, not a transient error. Check this BEFORE the regex so | ||
| # attempt 1 -> 5xx -> attempt 2 -> conflict resolves cleanly. | ||
| if printf '%s' "$out" | grep -q 'createLogEntryConflict'; then |
There was a problem hiding this comment.
Using printf -- '%s\n' is more robust than printf '%s'. The -- prevents printf from interpreting the content of $out as options if it happens to start with a hyphen, and the trailing newline ensures better compatibility with grep across different implementations.
| if printf '%s' "$out" | grep -q 'createLogEntryConflict'; then | |
| if printf -- '%s\n' "$out" | grep -q 'createLogEntryConflict'; then |
| exit "$rc" | ||
| fi | ||
|
|
||
| if printf '%s' "$out" | grep -qiE "$TRANSIENT_RE"; then |
There was a problem hiding this comment.
If TRANSIENT_RE is empty (e.g., if the sourcing from docker_push_with_retry.sh fails silently or returns an empty string), grep -E "" will match every line, causing the script to retry on non-transient errors like authentication failures. Adding a check to ensure the regex is non-empty prevents this behavior.
| if printf '%s' "$out" | grep -qiE "$TRANSIENT_RE"; then | |
| if [[ -n "$TRANSIENT_RE" ]] && printf -- '%s\n' "$out" | grep -qiE "$TRANSIENT_RE"; then |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2100 +/- ##
==========================================
- Coverage 87.13% 87.12% -0.01%
==========================================
Files 2251 2251
Lines 130311 130311
==========================================
- Hits 113541 113535 -6
- Misses 16755 16761 +6
Partials 15 15 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
G1 (CRITICAL): invoke sibling docker_push_with_retry.sh via bash (script ships 100644, bare invocation would fail Permission denied on Linux runner). G2: printf -- '%s\n' for hyphen-safety and grep compat. G3: guard grep -E against empty TRANSIENT_RE (would otherwise match every line and retry non-transient errors).
<!-- HIGHLIGHTS_START --> ## Highlights > _AI-generated summary (model: `openai/gpt-4.1-mini` via GitHub Models). Commit-based changelog below._ ### What you'll notice - New brownfield codebase intake mode supports merger and acquisition scenarios. - Added deep CEO interview feature to improve project charter creation. - Introduced mission control and flight recorder operator cockpit for better operational oversight. - Research mode added for enhanced exploratory work. - Runtime services now log safety-spine state at boot for clearer diagnostics. ### What's new - Research mode feature enables deeper data exploration. - CEO interview integration helps shape project charters. - Mission control and flight recorder cockpit introduced for operational tracking. ### Under the hood - Improved codebase modularity with module-size gates and lint tightening. - Added __init__.py to 21 test directories for better test discovery. - Promoted six transitive dependencies to direct dependencies for clarity. - Split codespell ignore list into vocabulary and source renames. - Decomposed oversized web utilities, hooks, and libraries for maintainability. - Enhanced CI with Lychee link checker integration and retry logic for cosign signing. - Sharded unit and integration tests and added Postgres service container in CI. - Updated infrastructure and web dependencies; maintained lock files. <!-- HIGHLIGHTS_END --> :robot: I have created a release *beep* *boop* --- ## [0.8.8](v0.8.7...v0.8.8) (2026-05-24) ### Features * brownfield codebase intake (merger/acquisition entry mode) ([#2042](#2042)) ([e287621](e287621)), closes [#1975](#1975) * deep CEO interview to project charter ([#2045](#2045)) ([904f2fb](904f2fb)) * mission control + flight recorder operator cockpit ([#2044](#2044)) ([1c2660b](1c2660b)) * research mode ([#2041](#2041)) ([f81a5ac](f81a5ac)), closes [#1989](#1989) * surface safety-spine state in runtime-services boot log (closes [#2096](#2096)) ([#2097](#2097)) ([f187b31](f187b31)) ### Refactoring * add __init__.py to 21 leaf test directories (INP001) ([#2081](#2081)) ([2592118](2592118)), closes [#2064](#2064) * codebase modularity (1/4) - module-size gates + lint tightening + tools ([#2078](#2078)) ([556fbd9](556fbd9)), closes [#2047](#2047) [#2040](#2040) * promote 6 transitive deps to direct deps ([#2083](#2083)) ([adedc6a](adedc6a)) * split codespell ignore-words-list into vocab + source renames ([#2085](#2085)) ([917d98a](917d98a)), closes [#2074](#2074) * **web:** PR A foundation, decompose oversized utils/hooks/lib ([#2092](#2092)) ([#2098](#2098)) ([aedbba5](aedbba5)) ### CI/CD * exclude slsa.dev from lychee (transient timeout on canonical badge) ([#2090](#2090)) ([346c51d](346c51d)) * fix paths-filter shallow-clone race and scorecard allowlist ([#2089](#2089)) ([7cd7ce8](7cd7ce8)) * refresh .test_durations.{unit,integration} ([#2087](#2087)) ([ddf2d86](ddf2d86)) * retry cosign sign on transient GHCR/Rekor failures ([#2100](#2100)) ([da9422a](da9422a)) * shard test-unit + test-integration, sysmon coverage, Postgres service container ([#2080](#2080)) ([0768787](0768787)) * wire Lychee link-checker (workflow + installer + pre-push hook) ([#2084](#2084)) ([1c0694a](1c0694a)) ### Maintenance * Lock file maintenance ([#2086](#2086)) ([a78810a](a78810a)) * Update Infrastructure dependencies ([#2055](#2055)) ([041ad8b](041ad8b)) * Update Web dependencies ([#2054](#2054)) ([4d57b9a](4d57b9a)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: synthorg-repo-bot[bot] <279117679+synthorg-repo-bot[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…2102) ## Why The `chore(main): release 0.8.8` push (commit `e2a23707`) failed because `CLI Test (macos-latest)` could not fetch `github.com/klauspost/compress@v1.18.6` from `proxy.golang.org`: ``` github.com/klauspost/compress@v1.18.6: Get "https://storage.googleapis.com/proxy-golang-org-prod/...": dial tcp [2607:f8b0:4005:816::201b]:443: connect: no route to host ``` `proxy.golang.org` 307-redirects module ZIPs to GCS; the GCS hop occasionally fails at the TCP layer. These are NOT 404/410, so Go's built-in `GOPROXY=...,direct` fallback does not fire and the whole command aborts. Downstream, `Finalize Release` preflighted, saw `CLI` failure, and short-circuited (`Triggering workflow failed; skipping`), so `v0.8.8` stayed in Draft until a manual rerun. ## What New composite `.github/actions/go-mod-download-retry` wraps `go mod download -x` in a retry loop with the project's standard posture (8 attempts, 15s base, 120s cap; ~10 min budget; matches `cla.yml::gh_api_retry` and the `docker.yml` cosign retry from #2100). Wired into all 7 jobs in `cli.yml` (lint, test, build, bench, vuln, fuzz, release): one prefetch step after each `setup-go`. Once modules are on disk, the real `go test` / `go build` / `go vet` runs once with zero network traffic, so the retry can never mask a real test or build failure. `cli-vuln` also got an inline retry around `go install golang.org/x/vuln/cmd/govulncheck@v1.3.0`, since govulncheck pulls a separate module graph not covered by the cli prefetch. ## Scope verification Only `cli.yml` runs Go commands across the entire `.github/workflows/` tree (`grep` confirmed), so this fully closes the blast radius. ## Tests actionlint, yamllint, and zizmor all pass locally on both files. No runtime tests exist for workflow files; correctness is validated by the next push exercising every job. ## Review Pre-reviewed by `infra-reviewer`. One Minor finding (inline `break` vs composite `exit 0` divergence) addressed in the second commit.
Summary
The
Sign imagestep inpublish-apko-baseandpublish-image-loadedwas the lone GHCR-touching call in the Docker pipeline without a transient-error retry. Every other GHCR interaction is wrapped (docker loginvianick-fields/retry,docker push/docker manifest pushviadocker_push_with_retry.sh, cosign-installer via 3continue-on-errorattempts), but acosign signupload was a bare shell block whose only failure-handling branch was thecreateLogEntryConflictidempotency check.This gap surfaced on the merge of #2098 (commit
aedbba5a5), where thePublish Sandbox Base (apko)job failed mid-cosign signwith:Note: the bytes
bad gatewayand502were both already in our shared transient regex; the regex just wasn't checking cosign output.Changes
.github/scripts/cosign_sign_with_retry.sh(87 LOC). Wrapscosign sign --yes <ref>with the same 4-attempt / 15→30→60s exponential backoff budget asdocker_push_with_retry.sh. Sources the shared transient regex viadocker_push_with_retry.sh --print-transient-reso both helpers stay in lockstep. Preserves thecreateLogEntryConflictidempotency branch (checked BEFORE the transient check, so attempt-1 5xx → attempt-2 conflict resolves as success). Non-transient errors (auth, malformed digest, schema rejection) fail immediately, no retry.publish-apko-base/action.ymlSign image step: 12-line inline block replaced with a 2-line helper call.publish-image-loaded/action.ymlSign image step: 16-line inline block replaced with a 2-line helper call.publish-image-loaded/action.ymlcomment cleanup (separate commit): stripped 3 pre-existing comment-rot violations surfaced bycomment-quality-rot(date-stamped incident reference, PR back-ref, migration framing) perfeedback_preexisting_fixes_always_in_scope.md.Test plan
Publish Sandbox Base (apko)job, run26362972921) was rerun viagh run rerun --failed; attempt 2 completed green, confirming the failure was a transient GHCR flake.bash -n .github/scripts/cosign_sign_with_retry.shclean.docker_push_with_retry.shcallers (e.g.publish-image-loaded/action.yml:126); proven on every previous Docker run.cli.yml:438usescosign sign-blob(different command, doesn't hit GHCR referrers);publish-image-retag/action.ymldoes not sign (digest carries the main-run signature forward).Review coverage
Pre-PR pipeline ran the minimal viable agent set (per the user's "very NEEDED only" steer):
publish-image-loaded/action.yml(date-stamped incident, PR back-ref, migration framing) - all fixed in the second commit. My own additions are clean.Commits
21238381dci: retry cosign sign on transient GHCR/Rekor failuresedb4a83e8ci: strip pre-existing comment rot in publish-image-loaded actionBoth commits land cleanly under squash-merge; PR title is the first-commit summary.