Skip to content

fix(ci): retry Docker push on Go net/http deadline + cancellation errors#1877

Merged
Aureliolo merged 2 commits into
mainfrom
fix/docker-push-retry-context-deadline
May 12, 2026
Merged

fix(ci): retry Docker push on Go net/http deadline + cancellation errors#1877
Aureliolo merged 2 commits into
mainfrom
fix/docker-push-retry-context-deadline

Conversation

@Aureliolo
Copy link
Copy Markdown
Owner

Summary

Closes a coverage gap in .github/scripts/docker_push_with_retry.sh that caused PR #1876's squash-merge push to fail on the Publish Backend Base (apko) arm64 leg. The retry wrapper saw a real transient GHCR HTTP timeout, did not match it against TRANSIENT_RE, classified it as non-transient, and bailed on attempt 1 of 4.

Failure that motivated the fix:

Get "https://ghcr.io/v2/": context deadline exceeded
(Client.Timeout exceeded while awaiting headers)

That string is Go's net/http client-side request-timeout shape, semantically identical to i/o timeout (already retried) but syntactically different. The wrapper's regex did not match it.

Five Go net/http deadline / cancellation signatures are now in TRANSIENT_RE:

  • context deadline exceeded — Go context deadline elapsed before headers received
  • Client\.Timeout exceeded — Go http.Client.Timeout field exceeded (literal dot kept to avoid ClientXTimeout false-match)
  • timeout awaiting response headers — header read stalled
  • timeout awaiting response body — body read stalled mid-stream (docker push is idempotent, so retrying after a partial-upload body timeout is safe)
  • request canceled — context cancellation; in GitHub Actions there is no interactive docker push cancellation vector

The wrapper's --print-transient-re consumer at .github/actions/publish-image-retag/action.yml:109 reuses the same regex (stripping manifest unknown and blob upload invalid, neither of which is affected here), so the new patterns automatically propagate to the retag-inspect retry loop.

Verification

Spot-tested the regex against the real PR-#1876 failure string and against a battery of patterns that must stay fail-fast:

TRANSIENT:  Get "https://ghcr.io/v2/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
TRANSIENT:  net/http: timeout awaiting response body
TRANSIENT:  net/http: request canceled while waiting for connection
TRANSIENT:  Client.Timeout exceeded while reading body
TRANSIENT:  i/o timeout                          (existing pattern, unchanged)
FAIL-FAST:  denied: insufficient_scope
FAIL-FAST:  tls: bad certificate
FAIL-FAST:  denied: name is reserved

bash -n .github/scripts/docker_push_with_retry.sh clean.

Review coverage

Pre-PR review pipeline: 4 agents (docs-consistency, comment-quality-rot, comment-analyzer, infra-reviewer).

  • docs-consistency: no drift, no doc page describes this script.
  • comment-quality-rot: zero violations; CI-infra file exempted from origin-citation rules; new comments are functional documentation, not narrative.
  • comment-analyzer: comment is appropriate as-is, explains WHY, follows existing comment-block precedent.
  • infra-reviewer: 3 findings -- 1 fixed (added timeout awaiting response body), 2 skipped with recorded reasons (request canceled substring-match risk is zero on GitHub Actions; regex line length cannot be reduced without re-quoting). Pattern correctness against the PR-refactor(persistence): replace Atlas with yoyo-migrations #1876 failure string ✓; no auth/permission masking ✓; retag-inspect propagation ✓; backoff math (~1m45s worst case) acceptable ✓.

Full triage at _audit/pre-pr-review/triage.md in the worktree (not committed).

Test plan

The patched wrapper will be exercised on the next Docker workflow run that hits a GHCR timeout. Until then, the regex changes are covered by the inline shell tests above; no Python / Go / Web test surface to add.

Aureliolo added 2 commits May 12, 2026 12:25
The retry wrapper at .github/scripts/docker_push_with_retry.sh was
classifying GHCR HTTP-request timeouts as non-transient and bailing on
attempt 1. PR 1876 hit this on the arm64 leg of Publish Backend Base
(amd64 had pushed cleanly seconds earlier, same job, same credentials):

  Get "https://ghcr.io/v2/": context deadline exceeded
  (Client.Timeout exceeded while awaiting headers)

That is the canonical Go net/http client-side timeout string Docker /
buildx emit when the registry does not return headers within the
per-request deadline. It is the same class of transient as i/o timeout,
which the wrapper already retries; the wrapper just was not matching the
specific phrasing.

Add four Go-net/http deadline + cancellation signatures to TRANSIENT_RE:

- context deadline exceeded
- Client.Timeout exceeded   (Client\.Timeout to keep the dot literal)
- timeout awaiting response headers
- request canceled

Verified with the actual failure string from PR 1876 and with a battery
of fail-fast inputs (denied: insufficient_scope, tls: bad certificate,
denied: name is reserved) that still classify as non-transient.

The retag-inspect retry loop in
.github/actions/publish-image-retag/action.yml sources the regex via
--print-transient-re, so the same coverage now applies there too.
Adds ``timeout awaiting response body`` to TRANSIENT_RE alongside the
headers variant. Go ``net/http`` emits the body form when the upload
has already started streaming but the registry stops responding before
the body finishes. ``docker push`` is idempotent (the wrapper docstring
covers this), so re-pushing on a body-timeout is safe and closes the
matching headers/body pair.

Pre-PR review pipeline: 4 agents (docs-consistency, comment-quality-rot,
comment-analyzer, infra-reviewer), 3 findings, 1 fix applied, 2 skipped
with recorded reasons in _audit/pre-pr-review/triage.md.
@github-actions
Copy link
Copy Markdown
Contributor

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 43a196ef-93a9-49c8-91f2-4ec580c7e914

📥 Commits

Reviewing files that changed from the base of the PR and between 1b7e975 and 656c6e6.

📒 Files selected for processing (1)
  • .github/scripts/docker_push_with_retry.sh
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
.github/scripts/docker_push_with_retry.sh (1)

37-47: Good expansion of transient timeout coverage.

The additions on Line 47 align with the stated failure mode and correctly broaden retryable Go net/http timeout/cancellation signatures without weakening fail-fast behavior for non-transient errors.


Walkthrough

This PR expands the TRANSIENT_RE regex pattern in the Docker push retry script to recognize additional transient error signatures from Go's net/http library. The pattern now detects context deadline exceeded, Client.Timeout exceeded, timeouts awaiting response headers or body, and request canceled errors. These errors will now be treated as retryable transient failures using the existing exponential backoff logic, improving reliability of Docker image pushes when encountering timeout-related registry failures.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: expanding the Docker push retry logic to handle Go net/http deadline and cancellation errors, which directly matches the changeset.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the motivation, the specific patterns added, verification steps, and pre-PR review coverage.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the docker_push_with_retry.sh script by expanding the TRANSIENT_RE regular expression to include several new transient error signatures, such as "context deadline exceeded", "Client.Timeout exceeded", and various "timeout awaiting response" variants. These additions are intended to capture Go net/http client-side timeouts emitted by Docker or buildx when GHCR fails to respond within a deadline, ensuring that the script correctly identifies these as retryable transient failures. I have no feedback to provide as there were no review comments.

@Aureliolo Aureliolo merged commit 23a0bfa into main May 12, 2026
41 checks passed
@Aureliolo Aureliolo deleted the fix/docker-push-retry-context-deadline branch May 12, 2026 10:53
Aureliolo pushed a commit that referenced this pull request May 12, 2026
<!-- HIGHLIGHTS_START -->
## Highlights

> _AI-generated summary (model: `openai/gpt-4.1-mini` via GitHub
Models). Commit-based changelog below._

### What you'll notice
- Password and secret fields now include an eye-toggle for easier
visibility control.
- Containers running without probes are shown as healthy in the doctor
command.
- Unloaded and missing PR-review agents are restored and available
again.

### What's new
- Gate baseline protection is enhanced to block em-dashes during
writing.

### Under the hood
- Replaced Atlas with yoyo-migrations for persistence management.
- Refactored codebase extensively, including context-bound user
authentication and registry pattern for enums.
- Improved linting by draining magic number usages and tightening mock
and constant checks.
- Updated CI to retry Docker pushes on network timeout errors.
- Updated apko lockfiles for dependency management.

<!-- HIGHLIGHTS_END -->

:robot: I have created a release *beep* *boop*
---


##
[0.8.3](v0.8.2...v0.8.3)
(2026-05-12)


### Features

* harden gate baseline protection + block em-dashes at write time
([#1860](#1860))
([b41f151](b41f151))
* **web:** eye-toggle on every password / secret field
([#1873](#1873))
([9070387](9070387))


### Bug Fixes

* **ci:** retry Docker push on Go net/http deadline + cancellation
errors ([#1877](#1877))
([23a0bfa](23a0bfa))
* **cli:** render running-no-probe containers as healthy in doctor
([#1870](#1870))
([6263795](6263795))
* restore unloaded and missing PR-review agents
([#1875](#1875))
([db004fd](db004fd)),
closes [#1871](#1871)


### Refactoring

* bind authenticated user via ContextVar
([#1858](#1858))
([57ed0b4](57ed0b4))
* code-structure cleanup (sub-tasks D + F + G + H + I)
([#1859](#1859))
([362e5c8](362e5c8))
* convert enum dispatch to registry pattern
([#1854](#1854))
([e90550e](e90550e))
* drain no_magic_numbers baseline to zero via Final hoists
([#1856](#1856) phase 2)
([#1872](#1872))
([ec8109e](ec8109e))
* drain pagination + loop-init + kill-switch baselines
([#1857](#1857))
([#1868](#1868))
([115c3c2](115c3c2))
* **persistence:** replace Atlas with yoyo-migrations
([#1876](#1876))
([1b7e975](1b7e975)),
closes [#1874](#1874)
* protocols audit follow-up (REVIEW + fold pass)
([#1869](#1869))
([af33ddb](af33ddb))
* protocols audit follow-up REMOVE pass
([#1867](#1867))
([dd1eebc](dd1eebc))
* tighten check_mock_spec gate, add mock_of[T], drain baseline
([#1862](#1862))
([240a253](240a253))
* tighten check_no_magic_numbers for named module constants
([#1856](#1856))
([#1866](#1866))
([90c933b](90c933b))


### CI/CD

* update apko lockfiles
([#1863](#1863))
([2bd32e6](2bd32e6))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: synthorg-repo-bot[bot] <279117679+synthorg-repo-bot[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant