dns: reinitialize c-ares channel on ARES_ETIMEOUT#41718
dns: reinitialize c-ares channel on ARES_ETIMEOUT#41718agrawroh merged 2 commits intoenvoyproxy:mainfrom
Conversation
|
Hi @Najji, welcome and thank you for your contribution. We will try to review your Pull Request as quickly as possible. In the meantime, please take a look at the contribution guidelines if you have not done so already. |
668c080 to
b2128d2
Compare
|
AFAIK DNS timeouts might have nothing to do with channel state. DNS could timeout due to intermittent network and reinitializing the channel on every timeout could cause constant churning of the UDP sockets. Could you use |
Hey @agrawroh, thanks for the review! We've seen this pattern in production where c-ares DNS timeouts are followed by requests to stale IPs even hours later. Could you help me understand the rationale for not including
|
Wasn't I clear enough when I said that DNS could timeout due to intermittent network issues as well? How would you differentiate the timeouts happening due to sporadic network issues vs. the ones due to broken channel? |
Thanks for your response. Agreed it's not differentiating but the existing errors can also be due to intermittent network issues, yet we reinit on them. Is there an inconsistency here, or is there differentiation logic I'm missing? The cost trade-off I'm seeing is reinitializing on timeouts (after retries) vs. the production impact we observed (hours of stale IPs despite DNS working). |
|
If there is a concern that timeouts can cause channel churn, what do you think about adding a configuration option to re-init the channel on timeouts? In this way operators that face this problem in production can enable it without impacting other deployments. /wait-any |
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
Signed-off-by: najji <najjim7@gmail.com>
41af4a0 to
621d81d
Compare
Signed-off-by: najji <najjim7@gmail.com>
621d81d to
6289324
Compare
|
/retest |
Thanks @yanavlasov for the suggestion! I've implemented it as a configurable option (opt-in, defaults to false for backward compatibility). Let me know what you think. |
|
|
||
| // If true, reinitialize the c-ares channel when a DNS query fails with ``ARES_ETIMEOUT``. | ||
| // | ||
| // This can help recover from rare cases where the UDP sockets held by the c-ares |
There was a problem hiding this comment.
If the UDP sockets timeout, wouldn't c-ares try to use a new UDP socket next time it needs to send a DNS query? If you could explain a bit more in detail about the exact flow and socket state that leads to needing to re-initialize the channel, that would be helpful, thanks!
There was a problem hiding this comment.
Hi @abeyad,
This PR builds on work from #9899 to address #4543. We're observing a situation where after DNS timeouts occur, Envoy enters a bad state and continues using stale IP addresses (despite packet capture showing correct IPs are being returned). This is only fixed after we restart Envoy.
ares_reinit() will help achieve the same effect without needing a full restart. The other error codes (ARES_ECONNREFUSED, ARES_ESERVFAIL, etc.) already trigger reinit under the same assumption that the channel state may be broken. The c-ares project has acknowledged similar channel state issues: c-ares/c-ares#301.
Let me know what you think!
There was a problem hiding this comment.
Hi @Najji , I'm not opposed to the option, as it seems we do similar restarting of the c-ares channel for ARES_ECONNREFUSED, as long as it is false by default.
It seems like the issue is in the connection managed by c-ares though, so I'm guessing there is a bug in c-ares that doesn't deal with timed-out connections properly. But given we don't control c-ares, I think it's fine to add this option to get around the issue.
Thanks!
@Najji do you mean the DNS query times out, then a subsequent query to the DNS server sends the query to a stale IP address? |
@abeyad No, the DNS server responses are correct. However, after timeouts occur Envoy stops updating IP lists based on the responses as per our observation. |
|
/lgtm api thanks @Najji ! |
agrawroh
left a comment
There was a problem hiding this comment.
LGTM as well! Thanks for putting it behind a flag.
|
Merging it as it's already approved by a Senior Maintainer. |
|
Thanks for the review everyone! |
Commit Message: dns: reinitialize c-ares channel on ARES_ETIMEOUT Additional Description:: Add ARES_ETIMEOUT to the list of error conditions that trigger c-ares channel reinitialization. When DNS queries timeout, the c-ares channel can enter a broken state where UDP sockets become unusable and subsequent queries continue to fail. This is similar to ARES_ECONNREFUSED and other connection errors that already trigger reinitialization. Risk Level: Low Testing: Updated existing `DnsImplZeroTimeoutTest::Timeout` test to verify channel reinitialization after timeout. Docs Changes: N/A Release Notes: Added to changelogs/current.yaml Platform Specific Features: N/A --------- Signed-off-by: najji <najjim7@gmail.com> Signed-off-by: Gustavo <grnmeira@gmail.com>
Commit Message: dns: reinitialize c-ares channel on ARES_ETIMEOUT
Additional Description::
Add ARES_ETIMEOUT to the list of error conditions that trigger c-ares channel reinitialization. When DNS queries timeout, the c-ares channel can enter a broken state where UDP sockets become unusable and subsequent queries continue to fail. This is similar to ARES_ECONNREFUSED and other connection errors that already trigger reinitialization.
Risk Level: Low
Testing: Updated existing
DnsImplZeroTimeoutTest::Timeouttest to verify channel reinitialization after timeout.Docs Changes: N/A
Release Notes: Added to changelogs/current.yaml
Platform Specific Features: N/A