feat(translator): remove absent healthy endpoints#6478
feat(translator): remove absent healthy endpoints#6478arkodg merged 2 commits intoenvoyproxy:mainfrom
Conversation
api/v1alpha1/healthcheck_types.go
Outdated
There was a problem hiding this comment.
can we do without an API field ?
are there use cases where the API has removed the endpoint, but we want the data plane to keep using it ?
There was a problem hiding this comment.
The envoy rationale for this (default) behavior is outlined here:
Host absent / health check OK:
Envoy will route to the target host. This is very important since the design assumes that the discovery service can fail at any time. If a host continues to pass health check even after becoming absent from the discovery data, Envoy will still route. Although it would be impossible to add new hosts in this scenario, existing hosts will continue to operate normally. When the discovery service is operating normally again the data will eventually re-converge.
When envoy depends on successful DNS resolution rather than a control plane to discover endpoints, there's a greater risk, and some users may prefer to route to stale endpoints, as long as they are known to be healthy, as mitigation to discovery failures.
There was a problem hiding this comment.
if the discovery service is not providing endpoints but DNS is, how will this setting help ?
There was a problem hiding this comment.
Discovery works differently in these cases. XDS-based discovery is inherently more resilient (control plane is typically a local component, xds has a bunch of built-in guardrails like caching). There are several examples of DNS-based resolution failures that lead to traffic drops:
- DNS Timeout Issue After Upgrading to v1.32.1 envoy#37676
- STRICT_DNS drops cluster members on lookup failure envoy#2691
I agree that the envoy default is not intuitive. We can enable this by default and see if users are interested in the other option as time goes on.
There was a problem hiding this comment.
Tested out a few behaviors:
- NXDOMAIN: in both cases endpoints are removed (even when considering health)
- Timeout to DNS: in both cases endpoints are retained (even when ignoring health)
I'm fine with changing the default behavior for STRICT_DNS as well.
There was a problem hiding this comment.
thanks for testing this out @guydc
+1 on setting this by default, and considering adding a field to opt if if users really need it
3eaabf5 to
70509f3
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6478 +/- ##
==========================================
+ Coverage 70.74% 70.77% +0.02%
==========================================
Files 220 220
Lines 37594 37594
==========================================
+ Hits 26596 26607 +11
+ Misses 9431 9424 -7
+ Partials 1567 1563 -4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Guy Daich <guy.daich@sap.com>
9bf3cb7 to
d835a1b
Compare
* feat(translator): remove absent healthy endpoints Signed-off-by: Guy Daich <guy.daich@sap.com> Signed-off-by: Tjeerd Jan van der Molen <34071+tjvdmolen@users.noreply.github.com>
What type of PR is this?
What this PR does / why we need it:
Removes absent (no longer discovered) endpoints, without waiting for their health to fail as well.
Which issue(s) this PR fixes:
Fixes #6463
Release Notes: Yes