Skip to content

dns: reinitialize c-ares channel on ARES_ETIMEOUT#41718

Merged
agrawroh merged 2 commits intoenvoyproxy:mainfrom
Najji:fix-cares-timeout-reinit
Oct 31, 2025
Merged

dns: reinitialize c-ares channel on ARES_ETIMEOUT#41718
agrawroh merged 2 commits intoenvoyproxy:mainfrom
Najji:fix-cares-timeout-reinit

Conversation

@Najji
Copy link
Contributor

@Najji Najji commented Oct 27, 2025

Commit Message: dns: reinitialize c-ares channel on ARES_ETIMEOUT

Additional Description::
Add ARES_ETIMEOUT to the list of error conditions that trigger c-ares channel reinitialization. When DNS queries timeout, the c-ares channel can enter a broken state where UDP sockets become unusable and subsequent queries continue to fail. This is similar to ARES_ECONNREFUSED and other connection errors that already trigger reinitialization.

Risk Level: Low

Testing: Updated existing DnsImplZeroTimeoutTest::Timeout test to verify channel reinitialization after timeout.

Docs Changes: N/A

Release Notes: Added to changelogs/current.yaml

Platform Specific Features: N/A

@repokitteh-read-only
Copy link

Hi @Najji, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #41718 was opened by Najji.

see: more, trace.

@Najji Najji force-pushed the fix-cares-timeout-reinit branch from 668c080 to b2128d2 Compare October 27, 2025 08:36
@agrawroh
Copy link
Member

AFAIK DNS timeouts might have nothing to do with channel state. DNS could timeout due to intermittent network and reinitializing the channel on every timeout could cause constant churning of the UDP sockets.

Could you use max_udp_channel_duration to periodically reinit the channel?

@Najji
Copy link
Contributor Author

Najji commented Oct 28, 2025

AFAIK DNS timeouts might have nothing to do with channel state. DNS could timeout due to intermittent network and reinitializing the channel on every timeout could cause constant churning of the UDP sockets.

Could you use max_udp_channel_duration to periodically reinit the channel?

Hey @agrawroh, thanks for the review! We've seen this pattern in production where c-ares DNS timeouts are followed by requests to stale IPs even hours later. tcpdump showed DNS resolution from the pod was working fine (returning correct IPs), but Envoy never updated. Only a restart fixed it.

Could you help me understand the rationale for not including ARES_ETIMEOUT, considering it is thrown after retries. It also looks semantically similar to the other errors in the list like ARES_ESERVFAIL and ARES_ECONNREFUSED

max_udp_channel_duration may help auto recover but it will still be reactive/ after serving stale ip addresses for some time.

@agrawroh
Copy link
Member

AFAIK DNS timeouts might have nothing to do with channel state. DNS could timeout due to intermittent network and reinitializing the channel on every timeout could cause constant churning of the UDP sockets.
Could you use max_udp_channel_duration to periodically reinit the channel?

Hey @agrawroh, thanks for the review! We've seen this pattern in production where c-ares DNS timeouts are followed by requests to stale IPs even hours later. tcpdump showed DNS resolution from the pod was working fine (returning correct IPs), but Envoy never updated. Only a restart fixed it.

Could you help me understand the rationale for not including ARES_ETIMEOUT, considering it is thrown after retries. It also looks semantically similar to the other errors in the list like ARES_ESERVFAIL and ARES_ECONNREFUSED

max_udp_channel_duration may help auto recover but it will still be reactive/ after serving stale ip addresses for some time.

Wasn't I clear enough when I said that DNS could timeout due to intermittent network issues as well? How would you differentiate the timeouts happening due to sporadic network issues vs. the ones due to broken channel?

@Najji
Copy link
Contributor Author

Najji commented Oct 28, 2025

AFAIK DNS timeouts might have nothing to do with channel state. DNS could timeout due to intermittent network and reinitializing the channel on every timeout could cause constant churning of the UDP sockets.
Could you use max_udp_channel_duration to periodically reinit the channel?

Hey @agrawroh, thanks for the review! We've seen this pattern in production where c-ares DNS timeouts are followed by requests to stale IPs even hours later. tcpdump showed DNS resolution from the pod was working fine (returning correct IPs), but Envoy never updated. Only a restart fixed it.
Could you help me understand the rationale for not including ARES_ETIMEOUT, considering it is thrown after retries. It also looks semantically similar to the other errors in the list like ARES_ESERVFAIL and ARES_ECONNREFUSED
max_udp_channel_duration may help auto recover but it will still be reactive/ after serving stale ip addresses for some time.

Wasn't I clear enough when I said that DNS could timeout due to intermittent network issues as well? How would you differentiate the timeouts happening due to sporadic network issues vs. the ones due to broken channel?

Thanks for your response. Agreed it's not differentiating but the existing errors can also be due to intermittent network issues, yet we reinit on them. Is there an inconsistency here, or is there differentiation logic I'm missing?

The cost trade-off I'm seeing is reinitializing on timeouts (after retries) vs. the production impact we observed (hours of stale IPs despite DNS working).

@yanavlasov
Copy link
Contributor

If there is a concern that timeouts can cause channel churn, what do you think about adding a configuration option to re-init the channel on timeouts? In this way operators that face this problem in production can enable it without impacting other deployments.

/wait-any

@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @abeyad
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #41718 was synchronize by Najji.

see: more, trace.

Signed-off-by: najji <najjim7@gmail.com>
@Najji Najji force-pushed the fix-cares-timeout-reinit branch 4 times, most recently from 41af4a0 to 621d81d Compare October 30, 2025 04:03
Signed-off-by: najji <najjim7@gmail.com>
@Najji Najji force-pushed the fix-cares-timeout-reinit branch from 621d81d to 6289324 Compare October 30, 2025 04:22
@Najji
Copy link
Contributor Author

Najji commented Oct 30, 2025

/retest

@Najji
Copy link
Contributor Author

Najji commented Oct 30, 2025

If there is a concern that timeouts can cause channel churn, what do you think about adding a configuration option to re-init the channel on timeouts? In this way operators that face this problem in production can enable it without impacting other deployments.

Thanks @yanavlasov for the suggestion! I've implemented it as a configurable option (opt-in, defaults to false for backward compatibility). Let me know what you think.


// If true, reinitialize the c-ares channel when a DNS query fails with ``ARES_ETIMEOUT``.
//
// This can help recover from rare cases where the UDP sockets held by the c-ares
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the UDP sockets timeout, wouldn't c-ares try to use a new UDP socket next time it needs to send a DNS query? If you could explain a bit more in detail about the exact flow and socket state that leads to needing to re-initialize the channel, that would be helpful, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @abeyad,

This PR builds on work from #9899 to address #4543. We're observing a situation where after DNS timeouts occur, Envoy enters a bad state and continues using stale IP addresses (despite packet capture showing correct IPs are being returned). This is only fixed after we restart Envoy.

ares_reinit() will help achieve the same effect without needing a full restart. The other error codes (ARES_ECONNREFUSED, ARES_ESERVFAIL, etc.) already trigger reinit under the same assumption that the channel state may be broken. The c-ares project has acknowledged similar channel state issues: c-ares/c-ares#301.

Let me know what you think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Najji , I'm not opposed to the option, as it seems we do similar restarting of the c-ares channel for ARES_ECONNREFUSED, as long as it is false by default.

It seems like the issue is in the connection managed by c-ares though, so I'm guessing there is a bug in c-ares that doesn't deal with timed-out connections properly. But given we don't control c-ares, I think it's fine to add this option to get around the issue.

Thanks!

@abeyad
Copy link
Contributor

abeyad commented Oct 30, 2025

We've seen this pattern in production where c-ares DNS timeouts are followed by requests to stale IPs even hours later.

@Najji do you mean the DNS query times out, then a subsequent query to the DNS server sends the query to a stale IP address?

@Najji
Copy link
Contributor Author

Najji commented Oct 31, 2025

We've seen this pattern in production where c-ares DNS timeouts are followed by requests to stale IPs even hours later.

@Najji do you mean the DNS query times out, then a subsequent query to the DNS server sends the query to a stale IP address?

@abeyad No, the DNS server responses are correct. However, after timeouts occur Envoy stops updating IP lists based on the responses as per our observation.

@yanavlasov
Copy link
Contributor

@Najji thanks for making this change. I will wait for approvals from @agrawroh and @abeyad and will submit.

/wait-any

@abeyad
Copy link
Contributor

abeyad commented Oct 31, 2025

/lgtm api

thanks @Najji !

Copy link
Member

@agrawroh agrawroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well! Thanks for putting it behind a flag.

@agrawroh
Copy link
Member

Merging it as it's already approved by a Senior Maintainer.

@agrawroh agrawroh merged commit 8eed84f into envoyproxy:main Oct 31, 2025
26 checks passed
@Najji
Copy link
Contributor Author

Najji commented Nov 1, 2025

Thanks for the review everyone!

grnmeira pushed a commit to grnmeira/envoy that referenced this pull request Mar 20, 2026
Commit Message: dns: reinitialize c-ares channel on ARES_ETIMEOUT

Additional Description::
Add ARES_ETIMEOUT to the list of error conditions that trigger c-ares
channel reinitialization. When DNS queries timeout, the c-ares channel
can enter a broken state where UDP sockets become unusable and
subsequent queries continue to fail. This is similar to
ARES_ECONNREFUSED and other connection errors that already trigger
reinitialization.

Risk Level: Low

Testing: Updated existing `DnsImplZeroTimeoutTest::Timeout` test to
verify channel reinitialization after timeout.

Docs Changes: N/A

Release Notes: Added to changelogs/current.yaml

Platform Specific Features: N/A

---------

Signed-off-by: najji <najjim7@gmail.com>
Signed-off-by: Gustavo <grnmeira@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants