From 71ccc044fd0f9f2153bea4ad229e1ae18364a477 Mon Sep 17 00:00:00 2001 From: Aurelio <19254254+Aureliolo@users.noreply.github.com> Date: Tue, 12 May 2026 12:25:53 +0200 Subject: [PATCH 1/2] fix(ci): retry Docker push on Go net/http deadline + cancellation errors The retry wrapper at .github/scripts/docker_push_with_retry.sh was classifying GHCR HTTP-request timeouts as non-transient and bailing on attempt 1. PR 1876 hit this on the arm64 leg of Publish Backend Base (amd64 had pushed cleanly seconds earlier, same job, same credentials): Get "https://ghcr.io/v2/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) That is the canonical Go net/http client-side timeout string Docker / buildx emit when the registry does not return headers within the per-request deadline. It is the same class of transient as i/o timeout, which the wrapper already retries; the wrapper just was not matching the specific phrasing. Add four Go-net/http deadline + cancellation signatures to TRANSIENT_RE: - context deadline exceeded - Client.Timeout exceeded (Client\.Timeout to keep the dot literal) - timeout awaiting response headers - request canceled Verified with the actual failure string from PR 1876 and with a battery of fail-fast inputs (denied: insufficient_scope, tls: bad certificate, denied: name is reserved) that still classify as non-transient. The retag-inspect retry loop in .github/actions/publish-image-retag/action.yml sources the regex via --print-transient-re, so the same coverage now applies there too. --- .github/scripts/docker_push_with_retry.sh | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/.github/scripts/docker_push_with_retry.sh b/.github/scripts/docker_push_with_retry.sh index 1518f882c6..10ebc87d99 100644 --- a/.github/scripts/docker_push_with_retry.sh +++ b/.github/scripts/docker_push_with_retry.sh @@ -34,7 +34,14 @@ set -euo pipefail # failures); a bare `tls: ` is intentionally NOT included because it would # also match non-transient configuration errors like # `tls: failed to verify certificate` or `tls: bad certificate`. -TRANSIENT_RE='page is taking too long|unknown blob|blob unknown|blob upload invalid|manifest unknown|received unexpected HTTP status: 5[0-9]{2}|HTTP/[0-9.]+ 5[0-9]{2}|HTTP 5[0-9]{2}|status: 5[0-9]{2}|429 Too Many Requests|temporarily unavailable|server is currently unable|service unavailable|bad gateway|gateway time-?out|i/o timeout|tls handshake|connection reset|connection refused|EOF|unexpected EOF|read: connection|net/http: TLS handshake' +# +# `context deadline exceeded` / `Client.Timeout exceeded` / `timeout awaiting +# response headers` / `request canceled` cover Go ``net/http`` client-side +# timeout strings emitted by Docker / buildx when GHCR fails to respond to a +# request within the per-request deadline. These are the canonical transient +# signatures `i/o timeout` misses on the GHCR HTTP path: the underlying +# socket may be healthy while the HTTP response just never arrives in time. +TRANSIENT_RE='page is taking too long|unknown blob|blob unknown|blob upload invalid|manifest unknown|received unexpected HTTP status: 5[0-9]{2}|HTTP/[0-9.]+ 5[0-9]{2}|HTTP 5[0-9]{2}|status: 5[0-9]{2}|429 Too Many Requests|temporarily unavailable|server is currently unable|service unavailable|bad gateway|gateway time-?out|i/o timeout|tls handshake|connection reset|connection refused|EOF|unexpected EOF|read: connection|net/http: TLS handshake|context deadline exceeded|Client\.Timeout exceeded|timeout awaiting response headers|request canceled' # Discovery flag: callers that need to share the same regex (for example the # inline retag-inspect retry loop, which must drop a couple of patterns the From 656c6e64fba09e0ca744d7936a2691aeab6266e9 Mon Sep 17 00:00:00 2001 From: Aurelio <19254254+Aureliolo@users.noreply.github.com> Date: Tue, 12 May 2026 12:48:22 +0200 Subject: [PATCH 2/2] fix(ci): also retry on Go net/http response-body timeout Adds ``timeout awaiting response body`` to TRANSIENT_RE alongside the headers variant. Go ``net/http`` emits the body form when the upload has already started streaming but the registry stops responding before the body finishes. ``docker push`` is idempotent (the wrapper docstring covers this), so re-pushing on a body-timeout is safe and closes the matching headers/body pair. Pre-PR review pipeline: 4 agents (docs-consistency, comment-quality-rot, comment-analyzer, infra-reviewer), 3 findings, 1 fix applied, 2 skipped with recorded reasons in _audit/pre-pr-review/triage.md. --- .github/scripts/docker_push_with_retry.sh | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/.github/scripts/docker_push_with_retry.sh b/.github/scripts/docker_push_with_retry.sh index 10ebc87d99..628d538f46 100644 --- a/.github/scripts/docker_push_with_retry.sh +++ b/.github/scripts/docker_push_with_retry.sh @@ -36,12 +36,15 @@ set -euo pipefail # `tls: failed to verify certificate` or `tls: bad certificate`. # # `context deadline exceeded` / `Client.Timeout exceeded` / `timeout awaiting -# response headers` / `request canceled` cover Go ``net/http`` client-side -# timeout strings emitted by Docker / buildx when GHCR fails to respond to a -# request within the per-request deadline. These are the canonical transient -# signatures `i/o timeout` misses on the GHCR HTTP path: the underlying -# socket may be healthy while the HTTP response just never arrives in time. -TRANSIENT_RE='page is taking too long|unknown blob|blob unknown|blob upload invalid|manifest unknown|received unexpected HTTP status: 5[0-9]{2}|HTTP/[0-9.]+ 5[0-9]{2}|HTTP 5[0-9]{2}|status: 5[0-9]{2}|429 Too Many Requests|temporarily unavailable|server is currently unable|service unavailable|bad gateway|gateway time-?out|i/o timeout|tls handshake|connection reset|connection refused|EOF|unexpected EOF|read: connection|net/http: TLS handshake|context deadline exceeded|Client\.Timeout exceeded|timeout awaiting response headers|request canceled' +# response headers` / `timeout awaiting response body` / `request canceled` +# cover Go ``net/http`` client-side timeout strings emitted by Docker / buildx +# when GHCR fails to respond to a request within the per-request deadline. +# These are the canonical transient signatures `i/o timeout` misses on the +# GHCR HTTP path: the underlying socket may be healthy while the HTTP +# response just never arrives in time. Both the headers and body variants +# are kept so a stall after the upload starts streaming also retries -- +# `docker push` is idempotent, so re-pushing on a body-timeout is safe. +TRANSIENT_RE='page is taking too long|unknown blob|blob unknown|blob upload invalid|manifest unknown|received unexpected HTTP status: 5[0-9]{2}|HTTP/[0-9.]+ 5[0-9]{2}|HTTP 5[0-9]{2}|status: 5[0-9]{2}|429 Too Many Requests|temporarily unavailable|server is currently unable|service unavailable|bad gateway|gateway time-?out|i/o timeout|tls handshake|connection reset|connection refused|EOF|unexpected EOF|read: connection|net/http: TLS handshake|context deadline exceeded|Client\.Timeout exceeded|timeout awaiting response headers|timeout awaiting response body|request canceled' # Discovery flag: callers that need to share the same regex (for example the # inline retag-inspect retry loop, which must drop a couple of patterns the