Skip to content

fix: retry reconcile on transient errors during reconcile #6299

Merged
arkodg merged 13 commits intoenvoyproxy:mainfrom
patrostkowski:fix/transient-errors-6284
Jul 3, 2025
Merged

fix: retry reconcile on transient errors during reconcile #6299
arkodg merged 13 commits intoenvoyproxy:mainfrom
patrostkowski:fix/transient-errors-6284

Conversation

@patrostkowski
Copy link
Contributor

@patrostkowski patrostkowski commented Jun 12, 2025

What type of PR is this?

fix: #6284

What this PR does / why we need it:

This PR:

  • Introduces isTransientError to detect transient Kubernetes errors and enable proper reconciliation retries.
  • Ignores non-transient errors from the provider layer and publishes them to the translation layer - the errors mostly come from missing referenced resources or misconfigurations that can be caught later in the Gateway API translator and surfaced in the resource status.

Description updated by @zhaohuabing on Jul 3 2025.

Which issue(s) this PR fixes:

Fixes #6284

Release Notes: Yes

@patrostkowski patrostkowski requested a review from a team as a code owner June 12, 2025 18:06
@patrostkowski patrostkowski force-pushed the fix/transient-errors-6284 branch from c45000a to a69693a Compare June 12, 2025 19:15
@codecov
Copy link

codecov bot commented Jun 13, 2025

Codecov Report

Attention: Patch coverage is 27.44186% with 156 lines in your changes missing coverage. Please review.

Project coverage is 70.73%. Comparing base (a394c61) to head (80b953e).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
internal/provider/kubernetes/controller.go 27.44% 150 Missing and 6 partials ⚠️

❌ Your patch status has failed because the patch coverage (27.44%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6299      +/-   ##
==========================================
- Coverage   70.92%   70.73%   -0.19%     
==========================================
  Files         220      220              
  Lines       37260    37354      +94     
==========================================
- Hits        26426    26423       -3     
- Misses       9289     9383      +94     
- Partials     1545     1548       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@guydc
Copy link
Contributor

guydc commented Jun 13, 2025

@arkodg, @patrostkowski
My question here would be: In case of a non-transient error, we will still store inconsistent resources (partial? invalid?) that may lead to corruption of XDS. Is there a way to avoid this, e.g. by using previously-stored resources for the affected GWC if they exist? This would be more aligned and compatible with previous behavior, where ALL XDS is essentially not updated in case of ANY error in this area of provider?

@patrostkowski patrostkowski force-pushed the fix/transient-errors-6284 branch from a69693a to ba30677 Compare June 16, 2025 05:50
@guydc
Copy link
Contributor

guydc commented Jun 18, 2025

Notes from the community meeting:

  • Provider layer should publish an accurate state-of-the-world.
  • If transient errors make it impossible to build an accurate snapshot, the reconciliation should be retried by returning early with an error
  • If validation errors are encountered, state should be published, so that status can be updated and user notified about issues with resources.
  • If users want to implement alternative strategies for handling invalid configuration (e.g. pausing XDS updates), that should be implemented in other layers.

@arkodg
Copy link
Contributor

arkodg commented Jun 19, 2025

can we also log the transient error

@arkodg
Copy link
Contributor

arkodg commented Jun 24, 2025

@patrostkowski still working on this ? we'd like to get this into the patch release scheduled to be out next week

@zhaohuabing zhaohuabing marked this pull request as draft June 26, 2025 02:40
@zhaohuabing zhaohuabing marked this pull request as draft June 26, 2025 02:40
@zhaohuabing
Copy link
Member

It seems @patrostkowski is currently unavailable. I’ll take over from here since we need this in the upcoming patch.

@zhaohuabing zhaohuabing force-pushed the fix/transient-errors-6284 branch 3 times, most recently from fcd7f69 to 40ec0c9 Compare June 26, 2025 03:31
@patrostkowski
Copy link
Contributor Author

thanks @zhaohuabing and sorry, I am very busy time atm with other things 😞

@zhaohuabing
Copy link
Member

thanks @zhaohuabing and sorry, I am very busy time atm with other things 😞

No worries. I'll take care of it.

@zhaohuabing zhaohuabing force-pushed the fix/transient-errors-6284 branch 2 times, most recently from 5eeaad0 to 5313be2 Compare June 26, 2025 11:29
@zhaohuabing zhaohuabing self-assigned this Jun 26, 2025
@zhaohuabing zhaohuabing force-pushed the fix/transient-errors-6284 branch 2 times, most recently from 74541e2 to 27b84b4 Compare June 26, 2025 13:14
@zhaohuabing zhaohuabing changed the title fix: add isTransientError helper to classify retryable errors during reconcile fix: retry reconcile onvtransient errors during reconcile and use the last state to allow Envoy continue serving traffic Jun 26, 2025
@zhaohuabing zhaohuabing force-pushed the fix/transient-errors-6284 branch 3 times, most recently from 6f9a261 to 9b633af Compare June 27, 2025 01:11
@zhaohuabing zhaohuabing force-pushed the fix/transient-errors-6284 branch from a3b57a9 to 63c2eee Compare June 27, 2025 02:33
service := new(corev1.Service)
err := r.client.Get(ctx, types.NamespacedName{Namespace: string(*backendRef.Namespace), Name: string(backendRef.Name)}, service)
if err != nil {
if isTransientError(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is going to hard to remember to add this for every client call

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can figure out a better approach to handle this later.

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
@zhaohuabing zhaohuabing force-pushed the fix/transient-errors-6284 branch from 9926890 to 8704fd8 Compare July 3, 2025 01:08
@zhaohuabing zhaohuabing requested a review from arkodg July 3, 2025 01:12
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
@zhaohuabing zhaohuabing requested a review from arkodg July 3, 2025 01:28
arkodg
arkodg previously approved these changes Jul 3, 2025
Copy link
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks

@arkodg arkodg requested review from a team and shahar-h and removed request for shahar-h July 3, 2025 01:36
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
@arkodg arkodg merged commit 71ce56f into envoyproxy:main Jul 3, 2025
27 of 28 checks passed
zhaohuabing added a commit to shawnh2/gateway that referenced this pull request Jul 4, 2025
…#6299)

* fix: add isTransientError helper to classify retryable errors

Introduces isTransientError to detect transient Kubernetes errors and
enable proper reconciliation retries.

Signed-off-by: Patryk Rostkowski <patrostkowski@gmail.com>

handle errors from processing BackendRefs

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

handle errors from processing ConfigMap

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* skip invalid GatewayClass

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* address comment

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* handle all transient errors

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* don't skip failed GCs

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Patryk Rostkowski <patrostkowski@gmail.com>
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Co-authored-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
(cherry picked from commit 71ce56f)
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
zhaohuabing added a commit that referenced this pull request Jul 4, 2025
* fix(translator): ext-proc full duplex streamed trailers and validation (#6323)
* fix ext proc validation and trailer management for full duplex streamed mode

Signed-off-by: Guy Daich <guy.daich@sap.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* feat: disable automountServiceAccountToken for proxy and ratelimit (#6364)

Signed-off-by: Jeff Davis <mr.jefedavis@gmail.com>

* bugfix: make EnvoyPatchPolicy able to replace telemetry cluster (#6367)

Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* feat: add validation of section name for Gateway listener (#6343)

* add validation of section name

Signed-off-by: kkk777-7 <kota.kimura0725@gmail.com>

* update error status reason

Signed-off-by: kkk777-7 <kota.kimura0725@gmail.com>

* refactor: define as function of validate section name for gateway listener

Signed-off-by: kkk777-7 <kota.kimura0725@gmail.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* fix: add configMap indexers for EEP reconciler (#6369)

Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>

* fix: use buildEndpointType for access and tracing (#6370)

Signed-off-by: zirain <zirain2009@gmail.com>

* fix: default accesslog not working (#6441)
* fix default accesslog

Signed-off-by: zirain <zirain2009@gmail.com>

* release notes

Signed-off-by: zirain <zirain2009@gmail.com>

---------

Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* chore: fix cve (#6446)

* fix cve

Signed-off-by: zirain <zirain2009@gmail.com>

* lint

Signed-off-by: zirain <zirain2009@gmail.com>

---------

Signed-off-by: zirain <zirain2009@gmail.com>

* fix: Do not set backendRequestTimeout when Retries are set (#6421)

* fix: Do not set backendRequestTimeout when Retries are set

Signed-off-by: sudipto baral <sudiptobaral.me@gmail.com>

* fix: update comment

Signed-off-by: sudipto baral <sudiptobaral.me@gmail.com>

---------

Signed-off-by: sudipto baral <sudiptobaral.me@gmail.com>

* gatewayapi: don't append gwcResource if there's invalid GatewayClass (#6379)

* gatewayapi: don't process gloabal resources when acceptedGateways is 0

Signed-off-by: zirain <zirain2009@gmail.com>

* update

Signed-off-by: zirain <zirain2009@gmail.com>

* fix test

Signed-off-by: zirain <zirain2009@gmail.com>

* don't skip gateways

Signed-off-by: zirain <zirain2009@gmail.com>

---------

Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* fix testdata

Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* fix k8s provider controller

Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* fix: retry reconcile on transient errors during reconcile  (#6299)

* fix: add isTransientError helper to classify retryable errors

Introduces isTransientError to detect transient Kubernetes errors and
enable proper reconciliation retries.

Signed-off-by: Patryk Rostkowski <patrostkowski@gmail.com>

handle errors from processing BackendRefs

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

handle errors from processing ConfigMap

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* skip invalid GatewayClass

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* address comment

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* handle all transient errors

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* don't skip failed GCs

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Patryk Rostkowski <patrostkowski@gmail.com>
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Co-authored-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
(cherry picked from commit 71ce56f)
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* fix: fix bug in hostname overlap detection (#6332)

fix bug in hostname overlap detection

Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>
(cherry picked from commit e78e268)
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* fix telemetry with host port not working (#6460)

Signed-off-by: zirain <zirain2009@gmail.com>
(cherry picked from commit c0a2ce7)
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* bugfix: BackendTlsPolicy should not reference across namespace (#6309)

* bugfix: BackendTlsPolicy should not reference across namespace

Signed-off-by: zirain <zirain2009@gmail.com>
(cherry picked from commit 9925189)
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Guy Daich <guy.daich@sap.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>
Signed-off-by: Jeff Davis <mr.jefedavis@gmail.com>
Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: kkk777-7 <kota.kimura0725@gmail.com>
Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>
Signed-off-by: sudipto baral <sudiptobaral.me@gmail.com>
Signed-off-by: Patryk Rostkowski <patrostkowski@gmail.com>
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Co-authored-by: Guy Daich <guy.daich@sap.com>
Co-authored-by: Jeff Davis <mr.jefedavis@gmail.com>
Co-authored-by: zirain <zirain2009@gmail.com>
Co-authored-by: Kota Kimura <86363983+kkk777-7@users.noreply.github.com>
Co-authored-by: Rudrakh Panigrahi <rudrakh97@gmail.com>
Co-authored-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Co-authored-by: Patryk Rostkowski <48490105+patrostkowski@users.noreply.github.com>
Co-authored-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
tjvdmolen pushed a commit to tjvdmolen/gateway that referenced this pull request Jul 11, 2025
…#6299)

* fix: add isTransientError helper to classify retryable errors

Introduces isTransientError to detect transient Kubernetes errors and
enable proper reconciliation retries.

Signed-off-by: Patryk Rostkowski <patrostkowski@gmail.com>

handle errors from processing BackendRefs

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

handle errors from processing ConfigMap

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* skip invalid GatewayClass

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* address comment

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* handle all transient errors

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* don't skip failed GCs

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Patryk Rostkowski <patrostkowski@gmail.com>
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Co-authored-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Signed-off-by: Tjeerd Jan van der Molen <34071+tjvdmolen@users.noreply.github.com>
shawnh2 added a commit to shawnh2/gateway that referenced this pull request Sep 15, 2025
* fix(translator): ext-proc full duplex streamed trailers and validation (envoyproxy#6323)
* fix ext proc validation and trailer management for full duplex streamed mode

Signed-off-by: Guy Daich <guy.daich@sap.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* feat: disable automountServiceAccountToken for proxy and ratelimit (envoyproxy#6364)

Signed-off-by: Jeff Davis <mr.jefedavis@gmail.com>

* bugfix: make EnvoyPatchPolicy able to replace telemetry cluster (envoyproxy#6367)

Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* feat: add validation of section name for Gateway listener (envoyproxy#6343)

* add validation of section name

Signed-off-by: kkk777-7 <kota.kimura0725@gmail.com>

* update error status reason

Signed-off-by: kkk777-7 <kota.kimura0725@gmail.com>

* refactor: define as function of validate section name for gateway listener

Signed-off-by: kkk777-7 <kota.kimura0725@gmail.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* fix: add configMap indexers for EEP reconciler (envoyproxy#6369)

Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>

* fix: use buildEndpointType for access and tracing (envoyproxy#6370)

Signed-off-by: zirain <zirain2009@gmail.com>

* fix: default accesslog not working (envoyproxy#6441)
* fix default accesslog

Signed-off-by: zirain <zirain2009@gmail.com>

* release notes

Signed-off-by: zirain <zirain2009@gmail.com>

---------

Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* chore: fix cve (envoyproxy#6446)

* fix cve

Signed-off-by: zirain <zirain2009@gmail.com>

* lint

Signed-off-by: zirain <zirain2009@gmail.com>

---------

Signed-off-by: zirain <zirain2009@gmail.com>

* fix: Do not set backendRequestTimeout when Retries are set (envoyproxy#6421)

* fix: Do not set backendRequestTimeout when Retries are set

Signed-off-by: sudipto baral <sudiptobaral.me@gmail.com>

* fix: update comment

Signed-off-by: sudipto baral <sudiptobaral.me@gmail.com>

---------

Signed-off-by: sudipto baral <sudiptobaral.me@gmail.com>

* gatewayapi: don't append gwcResource if there's invalid GatewayClass (envoyproxy#6379)

* gatewayapi: don't process gloabal resources when acceptedGateways is 0

Signed-off-by: zirain <zirain2009@gmail.com>

* update

Signed-off-by: zirain <zirain2009@gmail.com>

* fix test

Signed-off-by: zirain <zirain2009@gmail.com>

* don't skip gateways

Signed-off-by: zirain <zirain2009@gmail.com>

---------

Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* fix testdata

Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* fix k8s provider controller

Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* fix: retry reconcile on transient errors during reconcile  (envoyproxy#6299)

* fix: add isTransientError helper to classify retryable errors

Introduces isTransientError to detect transient Kubernetes errors and
enable proper reconciliation retries.

Signed-off-by: Patryk Rostkowski <patrostkowski@gmail.com>

handle errors from processing BackendRefs

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

handle errors from processing ConfigMap

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* skip invalid GatewayClass

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* address comment

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* handle all transient errors

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* don't skip failed GCs

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Patryk Rostkowski <patrostkowski@gmail.com>
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Co-authored-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
(cherry picked from commit 71ce56f)
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* fix: fix bug in hostname overlap detection (envoyproxy#6332)

fix bug in hostname overlap detection

Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>
(cherry picked from commit e78e268)
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* fix telemetry with host port not working (envoyproxy#6460)

Signed-off-by: zirain <zirain2009@gmail.com>
(cherry picked from commit c0a2ce7)
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* bugfix: BackendTlsPolicy should not reference across namespace (envoyproxy#6309)

* bugfix: BackendTlsPolicy should not reference across namespace

Signed-off-by: zirain <zirain2009@gmail.com>
(cherry picked from commit 9925189)
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Guy Daich <guy.daich@sap.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>
Signed-off-by: Jeff Davis <mr.jefedavis@gmail.com>
Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: kkk777-7 <kota.kimura0725@gmail.com>
Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>
Signed-off-by: sudipto baral <sudiptobaral.me@gmail.com>
Signed-off-by: Patryk Rostkowski <patrostkowski@gmail.com>
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Co-authored-by: Guy Daich <guy.daich@sap.com>
Co-authored-by: Jeff Davis <mr.jefedavis@gmail.com>
Co-authored-by: zirain <zirain2009@gmail.com>
Co-authored-by: Kota Kimura <86363983+kkk777-7@users.noreply.github.com>
Co-authored-by: Rudrakh Panigrahi <rudrakh97@gmail.com>
Co-authored-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Co-authored-by: Patryk Rostkowski <48490105+patrostkowski@users.noreply.github.com>
Co-authored-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Errors during reconcile causes a creation of partial xDS configuraion

7 participants