dns: reinitialize c-ares channel on ARES_ETIMEOUT by Najji · Pull Request #41718 · envoyproxy/envoy

Najji · 2025-10-27T08:05:53Z

Commit Message: dns: reinitialize c-ares channel on ARES_ETIMEOUT

Additional Description::
Add ARES_ETIMEOUT to the list of error conditions that trigger c-ares channel reinitialization. When DNS queries timeout, the c-ares channel can enter a broken state where UDP sockets become unusable and subsequent queries continue to fail. This is similar to ARES_ECONNREFUSED and other connection errors that already trigger reinitialization.

Risk Level: Low

Testing: Updated existing DnsImplZeroTimeoutTest::Timeout test to verify channel reinitialization after timeout.

Docs Changes: N/A

Release Notes: Added to changelogs/current.yaml

Platform Specific Features: N/A

repokitteh-read-only · 2025-10-27T08:05:59Z

Hi @Najji, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #41718 was opened by Najji.

see: more, trace.

agrawroh · 2025-10-27T10:25:09Z

AFAIK DNS timeouts might have nothing to do with channel state. DNS could timeout due to intermittent network and reinitializing the channel on every timeout could cause constant churning of the UDP sockets.

Could you use max_udp_channel_duration to periodically reinit the channel?

Najji · 2025-10-28T01:27:33Z

AFAIK DNS timeouts might have nothing to do with channel state. DNS could timeout due to intermittent network and reinitializing the channel on every timeout could cause constant churning of the UDP sockets.

Could you use max_udp_channel_duration to periodically reinit the channel?

Hey @agrawroh, thanks for the review! We've seen this pattern in production where c-ares DNS timeouts are followed by requests to stale IPs even hours later. tcpdump showed DNS resolution from the pod was working fine (returning correct IPs), but Envoy never updated. Only a restart fixed it.

Could you help me understand the rationale for not including ARES_ETIMEOUT, considering it is thrown after retries. It also looks semantically similar to the other errors in the list like ARES_ESERVFAIL and ARES_ECONNREFUSED

max_udp_channel_duration may help auto recover but it will still be reactive/ after serving stale ip addresses for some time.

agrawroh · 2025-10-28T02:32:37Z

AFAIK DNS timeouts might have nothing to do with channel state. DNS could timeout due to intermittent network and reinitializing the channel on every timeout could cause constant churning of the UDP sockets.
Could you use max_udp_channel_duration to periodically reinit the channel?

Hey @agrawroh, thanks for the review! We've seen this pattern in production where c-ares DNS timeouts are followed by requests to stale IPs even hours later. tcpdump showed DNS resolution from the pod was working fine (returning correct IPs), but Envoy never updated. Only a restart fixed it.

Could you help me understand the rationale for not including ARES_ETIMEOUT, considering it is thrown after retries. It also looks semantically similar to the other errors in the list like ARES_ESERVFAIL and ARES_ECONNREFUSED

max_udp_channel_duration may help auto recover but it will still be reactive/ after serving stale ip addresses for some time.

Wasn't I clear enough when I said that DNS could timeout due to intermittent network issues as well? How would you differentiate the timeouts happening due to sporadic network issues vs. the ones due to broken channel?

Najji · 2025-10-28T07:35:08Z

AFAIK DNS timeouts might have nothing to do with channel state. DNS could timeout due to intermittent network and reinitializing the channel on every timeout could cause constant churning of the UDP sockets.
Could you use max_udp_channel_duration to periodically reinit the channel?

Hey @agrawroh, thanks for the review! We've seen this pattern in production where c-ares DNS timeouts are followed by requests to stale IPs even hours later. tcpdump showed DNS resolution from the pod was working fine (returning correct IPs), but Envoy never updated. Only a restart fixed it.
Could you help me understand the rationale for not including ARES_ETIMEOUT, considering it is thrown after retries. It also looks semantically similar to the other errors in the list like ARES_ESERVFAIL and ARES_ECONNREFUSED
max_udp_channel_duration may help auto recover but it will still be reactive/ after serving stale ip addresses for some time.

Wasn't I clear enough when I said that DNS could timeout due to intermittent network issues as well? How would you differentiate the timeouts happening due to sporadic network issues vs. the ones due to broken channel?

Thanks for your response. Agreed it's not differentiating but the existing errors can also be due to intermittent network issues, yet we reinit on them. Is there an inconsistency here, or is there differentiation logic I'm missing?

The cost trade-off I'm seeing is reinitializing on timeouts (after retries) vs. the production impact we observed (hours of stale IPs despite DNS working).

yanavlasov · 2025-10-29T14:10:03Z

If there is a concern that timeouts can cause channel churn, what do you think about adding a configuration option to re-init the channel on timeouts? In this way operators that face this problem in production can enable it without impacting other deployments.

/wait-any

repokitteh-read-only · 2025-10-30T02:28:11Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @abeyad
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #41718 was synchronize by Najji.

see: more, trace.

Signed-off-by: najji <najjim7@gmail.com>

Najji · 2025-10-30T05:17:00Z

/retest

Najji · 2025-10-30T05:29:32Z

If there is a concern that timeouts can cause channel churn, what do you think about adding a configuration option to re-init the channel on timeouts? In this way operators that face this problem in production can enable it without impacting other deployments.

Thanks @yanavlasov for the suggestion! I've implemented it as a configurable option (opt-in, defaults to false for backward compatibility). Let me know what you think.

abeyad · 2025-10-30T16:10:26Z

api/envoy/extensions/network/dns_resolver/cares/v3/cares_dns_resolver.proto

+
+  // If true, reinitialize the c-ares channel when a DNS query fails with ``ARES_ETIMEOUT``.
+  //
+  // This can help recover from rare cases where the UDP sockets held by the c-ares


If the UDP sockets timeout, wouldn't c-ares try to use a new UDP socket next time it needs to send a DNS query? If you could explain a bit more in detail about the exact flow and socket state that leads to needing to re-initialize the channel, that would be helpful, thanks!

Hi @abeyad,

This PR builds on work from #9899 to address #4543. We're observing a situation where after DNS timeouts occur, Envoy enters a bad state and continues using stale IP addresses (despite packet capture showing correct IPs are being returned). This is only fixed after we restart Envoy.

ares_reinit() will help achieve the same effect without needing a full restart. The other error codes (ARES_ECONNREFUSED, ARES_ESERVFAIL, etc.) already trigger reinit under the same assumption that the channel state may be broken. The c-ares project has acknowledged similar channel state issues: c-ares/c-ares#301.

Let me know what you think!

Hi @Najji , I'm not opposed to the option, as it seems we do similar restarting of the c-ares channel for ARES_ECONNREFUSED, as long as it is false by default.

It seems like the issue is in the connection managed by c-ares though, so I'm guessing there is a bug in c-ares that doesn't deal with timed-out connections properly. But given we don't control c-ares, I think it's fine to add this option to get around the issue.

Thanks!

abeyad · 2025-10-30T16:15:29Z

We've seen this pattern in production where c-ares DNS timeouts are followed by requests to stale IPs even hours later.

@Najji do you mean the DNS query times out, then a subsequent query to the DNS server sends the query to a stale IP address?

Najji · 2025-10-31T01:36:54Z

We've seen this pattern in production where c-ares DNS timeouts are followed by requests to stale IPs even hours later.

@Najji do you mean the DNS query times out, then a subsequent query to the DNS server sends the query to a stale IP address?

@abeyad No, the DNS server responses are correct. However, after timeouts occur Envoy stops updating IP lists based on the responses as per our observation.

yanavlasov · 2025-10-31T20:15:47Z

@Najji thanks for making this change. I will wait for approvals from @agrawroh and @abeyad and will submit.

/wait-any

abeyad · 2025-10-31T22:19:04Z

/lgtm api

thanks @Najji !

agrawroh

LGTM as well! Thanks for putting it behind a flag.

agrawroh · 2025-10-31T22:51:22Z

Merging it as it's already approved by a Senior Maintainer.

Najji · 2025-11-01T12:38:27Z

Thanks for the review everyone!

Commit Message: dns: reinitialize c-ares channel on ARES_ETIMEOUT Additional Description:: Add ARES_ETIMEOUT to the list of error conditions that trigger c-ares channel reinitialization. When DNS queries timeout, the c-ares channel can enter a broken state where UDP sockets become unusable and subsequent queries continue to fail. This is similar to ARES_ECONNREFUSED and other connection errors that already trigger reinitialization. Risk Level: Low Testing: Updated existing `DnsImplZeroTimeoutTest::Timeout` test to verify channel reinitialization after timeout. Docs Changes: N/A Release Notes: Added to changelogs/current.yaml Platform Specific Features: N/A --------- Signed-off-by: najji <najjim7@gmail.com> Signed-off-by: Gustavo <grnmeira@gmail.com>

Najji requested review from mattklein123 and yanavlasov as code owners October 27, 2025 08:05

Najji force-pushed the fix-cares-timeout-reinit branch from 668c080 to b2128d2 Compare October 27, 2025 08:36

agrawroh added the waiting:any label Oct 27, 2025

repokitteh-read-only bot removed the waiting:any label Oct 28, 2025

ggreenway assigned yanavlasov Oct 28, 2025

repokitteh-read-only bot added waiting:any api and removed waiting:any labels Oct 29, 2025

repokitteh-read-only bot assigned abeyad Oct 30, 2025

dns: reinitialize c-ares channel on ARES_ETIMEOUT

84a23b8

Signed-off-by: najji <najjim7@gmail.com>

Najji force-pushed the fix-cares-timeout-reinit branch 4 times, most recently from 41af4a0 to 621d81d Compare October 30, 2025 04:03

api/dns: add c-ares reinit-on-timeout

6289324

Signed-off-by: najji <najjim7@gmail.com>

Najji force-pushed the fix-cares-timeout-reinit branch from 621d81d to 6289324 Compare October 30, 2025 04:22

abeyad reviewed Oct 30, 2025

View reviewed changes

yanavlasov approved these changes Oct 31, 2025

View reviewed changes

repokitteh-read-only bot added the waiting:any label Oct 31, 2025

repokitteh-read-only bot removed api waiting:any labels Oct 31, 2025

agrawroh approved these changes Oct 31, 2025

View reviewed changes

agrawroh merged commit 8eed84f into envoyproxy:main Oct 31, 2025
26 checks passed

Conversation

Najji commented Oct 27, 2025

Uh oh!

repokitteh-read-only bot commented Oct 27, 2025

Uh oh!

agrawroh commented Oct 27, 2025

Uh oh!

Najji commented Oct 28, 2025

Uh oh!

agrawroh commented Oct 28, 2025

Uh oh!

Najji commented Oct 28, 2025

Uh oh!

yanavlasov commented Oct 29, 2025

Uh oh!

repokitteh-read-only bot commented Oct 30, 2025

Uh oh!

Najji commented Oct 30, 2025

Uh oh!

Najji commented Oct 30, 2025

Uh oh!

abeyad Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Najji Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

abeyad Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

abeyad commented Oct 30, 2025

Uh oh!

Najji commented Oct 31, 2025

Uh oh!

yanavlasov commented Oct 31, 2025

Uh oh!

abeyad commented Oct 31, 2025

Uh oh!

agrawroh left a comment

Choose a reason for hiding this comment

Uh oh!

agrawroh commented Oct 31, 2025

Uh oh!

Uh oh!

Najji commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants