Skip to content

feat(translator): remove absent healthy endpoints#6478

Merged
arkodg merged 2 commits intoenvoyproxy:mainfrom
guydc:feat-btp-ignore-health
Jul 10, 2025
Merged

feat(translator): remove absent healthy endpoints#6478
arkodg merged 2 commits intoenvoyproxy:mainfrom
guydc:feat-btp-ignore-health

Conversation

@guydc
Copy link
Contributor

@guydc guydc commented Jul 7, 2025

What type of PR is this?

What this PR does / why we need it:
Removes absent (no longer discovered) endpoints, without waiting for their health to fail as well.

Which issue(s) this PR fixes:

Fixes #6463

Release Notes: Yes

@guydc guydc requested a review from a team as a code owner July 7, 2025 17:59
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do without an API field ?
are there use cases where the API has removed the endpoint, but we want the data plane to keep using it ?

Copy link
Contributor Author

@guydc guydc Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The envoy rationale for this (default) behavior is outlined here:

Host absent / health check OK:
Envoy will route to the target host. This is very important since the design assumes that the discovery service can fail at any time. If a host continues to pass health check even after becoming absent from the discovery data, Envoy will still route. Although it would be impossible to add new hosts in this scenario, existing hosts will continue to operate normally. When the discovery service is operating normally again the data will eventually re-converge.

When envoy depends on successful DNS resolution rather than a control plane to discover endpoints, there's a greater risk, and some users may prefer to route to stale endpoints, as long as they are known to be healthy, as mitigation to discovery failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the discovery service is not providing endpoints but DNS is, how will this setting help ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discovery works differently in these cases. XDS-based discovery is inherently more resilient (control plane is typically a local component, xds has a bunch of built-in guardrails like caching). There are several examples of DNS-based resolution failures that lead to traffic drops:

I agree that the envoy default is not intuitive. We can enable this by default and see if users are interested in the other option as time goes on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested out a few behaviors:

  • NXDOMAIN: in both cases endpoints are removed (even when considering health)
  • Timeout to DNS: in both cases endpoints are retained (even when ignoring health)

I'm fine with changing the default behavior for STRICT_DNS as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for testing this out @guydc

+1 on setting this by default, and considering adding a field to opt if if users really need it

@guydc guydc force-pushed the feat-btp-ignore-health branch from 3eaabf5 to 70509f3 Compare July 7, 2025 20:29
@codecov
Copy link

codecov bot commented Jul 7, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.77%. Comparing base (4140f61) to head (6ff2651).
Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6478      +/-   ##
==========================================
+ Coverage   70.74%   70.77%   +0.02%     
==========================================
  Files         220      220              
  Lines       37594    37594              
==========================================
+ Hits        26596    26607      +11     
+ Misses       9431     9424       -7     
+ Partials     1567     1563       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@guydc guydc changed the title feat(translator): support removal of absent healthy endpoints feat(translator): remove absent healthy endpoints Jul 8, 2025
Signed-off-by: Guy Daich <guy.daich@sap.com>
@guydc guydc force-pushed the feat-btp-ignore-health branch from 9bf3cb7 to d835a1b Compare July 8, 2025 18:18
Signed-off-by: Guy Daich <guy.daich@sap.com>
@arkodg arkodg added this to the v1.5.0-rc.1 Release milestone Jul 8, 2025
Copy link
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks

@arkodg arkodg requested review from a team July 9, 2025 00:48
Copy link
Member

@zhaohuabing zhaohuabing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

@arkodg arkodg merged commit 6b699c8 into envoyproxy:main Jul 10, 2025
44 of 47 checks passed
tjvdmolen pushed a commit to tjvdmolen/gateway that referenced this pull request Jul 11, 2025
* feat(translator): remove absent healthy endpoints

Signed-off-by: Guy Daich <guy.daich@sap.com>
Signed-off-by: Tjeerd Jan van der Molen <34071+tjvdmolen@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ignore Host Health for DNS clusters

3 participants