fix(ci): retry Docker push on Go net/http deadline + cancellation errors#1877
Merged
Conversation
The retry wrapper at .github/scripts/docker_push_with_retry.sh was classifying GHCR HTTP-request timeouts as non-transient and bailing on attempt 1. PR 1876 hit this on the arm64 leg of Publish Backend Base (amd64 had pushed cleanly seconds earlier, same job, same credentials): Get "https://ghcr.io/v2/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) That is the canonical Go net/http client-side timeout string Docker / buildx emit when the registry does not return headers within the per-request deadline. It is the same class of transient as i/o timeout, which the wrapper already retries; the wrapper just was not matching the specific phrasing. Add four Go-net/http deadline + cancellation signatures to TRANSIENT_RE: - context deadline exceeded - Client.Timeout exceeded (Client\.Timeout to keep the dot literal) - timeout awaiting response headers - request canceled Verified with the actual failure string from PR 1876 and with a battery of fail-fast inputs (denied: insufficient_scope, tls: bad certificate, denied: name is reserved) that still classify as non-transient. The retag-inspect retry loop in .github/actions/publish-image-retag/action.yml sources the regex via --print-transient-re, so the same coverage now applies there too.
Adds ``timeout awaiting response body`` to TRANSIENT_RE alongside the headers variant. Go ``net/http`` emits the body form when the upload has already started streaming but the registry stops responding before the body finishes. ``docker push`` is idempotent (the wrapper docstring covers this), so re-pushing on a body-timeout is safe and closes the matching headers/body pair. Pre-PR review pipeline: 4 agents (docs-consistency, comment-quality-rot, comment-analyzer, infra-reviewer), 3 findings, 1 fix applied, 2 skipped with recorded reasons in _audit/pre-pr-review/triage.md.
Contributor
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
Contributor
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the docker_push_with_retry.sh script by expanding the TRANSIENT_RE regular expression to include several new transient error signatures, such as "context deadline exceeded", "Client.Timeout exceeded", and various "timeout awaiting response" variants. These additions are intended to capture Go net/http client-side timeouts emitted by Docker or buildx when GHCR fails to respond within a deadline, ensuring that the script correctly identifies these as retryable transient failures. I have no feedback to provide as there were no review comments.
Aureliolo
pushed a commit
that referenced
this pull request
May 12, 2026
<!-- HIGHLIGHTS_START --> ## Highlights > _AI-generated summary (model: `openai/gpt-4.1-mini` via GitHub Models). Commit-based changelog below._ ### What you'll notice - Password and secret fields now include an eye-toggle for easier visibility control. - Containers running without probes are shown as healthy in the doctor command. - Unloaded and missing PR-review agents are restored and available again. ### What's new - Gate baseline protection is enhanced to block em-dashes during writing. ### Under the hood - Replaced Atlas with yoyo-migrations for persistence management. - Refactored codebase extensively, including context-bound user authentication and registry pattern for enums. - Improved linting by draining magic number usages and tightening mock and constant checks. - Updated CI to retry Docker pushes on network timeout errors. - Updated apko lockfiles for dependency management. <!-- HIGHLIGHTS_END --> :robot: I have created a release *beep* *boop* --- ## [0.8.3](v0.8.2...v0.8.3) (2026-05-12) ### Features * harden gate baseline protection + block em-dashes at write time ([#1860](#1860)) ([b41f151](b41f151)) * **web:** eye-toggle on every password / secret field ([#1873](#1873)) ([9070387](9070387)) ### Bug Fixes * **ci:** retry Docker push on Go net/http deadline + cancellation errors ([#1877](#1877)) ([23a0bfa](23a0bfa)) * **cli:** render running-no-probe containers as healthy in doctor ([#1870](#1870)) ([6263795](6263795)) * restore unloaded and missing PR-review agents ([#1875](#1875)) ([db004fd](db004fd)), closes [#1871](#1871) ### Refactoring * bind authenticated user via ContextVar ([#1858](#1858)) ([57ed0b4](57ed0b4)) * code-structure cleanup (sub-tasks D + F + G + H + I) ([#1859](#1859)) ([362e5c8](362e5c8)) * convert enum dispatch to registry pattern ([#1854](#1854)) ([e90550e](e90550e)) * drain no_magic_numbers baseline to zero via Final hoists ([#1856](#1856) phase 2) ([#1872](#1872)) ([ec8109e](ec8109e)) * drain pagination + loop-init + kill-switch baselines ([#1857](#1857)) ([#1868](#1868)) ([115c3c2](115c3c2)) * **persistence:** replace Atlas with yoyo-migrations ([#1876](#1876)) ([1b7e975](1b7e975)), closes [#1874](#1874) * protocols audit follow-up (REVIEW + fold pass) ([#1869](#1869)) ([af33ddb](af33ddb)) * protocols audit follow-up REMOVE pass ([#1867](#1867)) ([dd1eebc](dd1eebc)) * tighten check_mock_spec gate, add mock_of[T], drain baseline ([#1862](#1862)) ([240a253](240a253)) * tighten check_no_magic_numbers for named module constants ([#1856](#1856)) ([#1866](#1866)) ([90c933b](90c933b)) ### CI/CD * update apko lockfiles ([#1863](#1863)) ([2bd32e6](2bd32e6)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: synthorg-repo-bot[bot] <279117679+synthorg-repo-bot[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes a coverage gap in
.github/scripts/docker_push_with_retry.shthat caused PR #1876's squash-merge push to fail on thePublish Backend Base (apko)arm64 leg. The retry wrapper saw a real transient GHCR HTTP timeout, did not match it againstTRANSIENT_RE, classified it as non-transient, and bailed on attempt 1 of 4.Failure that motivated the fix:
That string is Go's
net/httpclient-side request-timeout shape, semantically identical toi/o timeout(already retried) but syntactically different. The wrapper's regex did not match it.Five Go
net/httpdeadline / cancellation signatures are now inTRANSIENT_RE:context deadline exceeded— Go context deadline elapsed before headers receivedClient\.Timeout exceeded— Gohttp.Client.Timeoutfield exceeded (literal dot kept to avoidClientXTimeoutfalse-match)timeout awaiting response headers— header read stalledtimeout awaiting response body— body read stalled mid-stream (docker pushis idempotent, so retrying after a partial-upload body timeout is safe)request canceled— context cancellation; in GitHub Actions there is no interactivedocker pushcancellation vectorThe wrapper's
--print-transient-reconsumer at.github/actions/publish-image-retag/action.yml:109reuses the same regex (strippingmanifest unknownandblob upload invalid, neither of which is affected here), so the new patterns automatically propagate to the retag-inspect retry loop.Verification
Spot-tested the regex against the real PR-#1876 failure string and against a battery of patterns that must stay fail-fast:
bash -n .github/scripts/docker_push_with_retry.shclean.Review coverage
Pre-PR review pipeline: 4 agents (
docs-consistency,comment-quality-rot,comment-analyzer,infra-reviewer).timeout awaiting response body), 2 skipped with recorded reasons (request canceledsubstring-match risk is zero on GitHub Actions; regex line length cannot be reduced without re-quoting). Pattern correctness against the PR-refactor(persistence): replace Atlas with yoyo-migrations #1876 failure string ✓; no auth/permission masking ✓; retag-inspect propagation ✓; backoff math (~1m45s worst case) acceptable ✓.Full triage at
_audit/pre-pr-review/triage.mdin the worktree (not committed).Test plan
The patched wrapper will be exercised on the next Docker workflow run that hits a GHCR timeout. Until then, the regex changes are covered by the inline shell tests above; no Python / Go / Web test surface to add.