Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ option (udpa.annotations.file_status).package_version_status = ACTIVE;
// [#extension: envoy.network.dns_resolver.cares]

// Configuration for c-ares DNS resolver.
// [#next-free-field: 11]
// [#next-free-field: 12]
message CaresDnsResolverConfig {
// A list of DNS resolver addresses.
// :ref:`use_resolvers_as_fallback <envoy_v3_api_field_extensions.network.dns_resolver.cares.v3.CaresDnsResolverConfig.use_resolvers_as_fallback>`
Expand Down Expand Up @@ -99,4 +99,18 @@ message CaresDnsResolverConfig {
//
// If not specified, no periodic refresh will be performed.
google.protobuf.Duration max_udp_channel_duration = 10 [(validate.rules).duration = {gte {}}];

// If true, reinitialize the c-ares channel when a DNS query fails with ``ARES_ETIMEOUT``.
//
// This can help recover from rare cases where the UDP sockets held by the c-ares
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the UDP sockets timeout, wouldn't c-ares try to use a new UDP socket next time it needs to send a DNS query? If you could explain a bit more in detail about the exact flow and socket state that leads to needing to re-initialize the channel, that would be helpful, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @abeyad,

This PR builds on work from #9899 to address #4543. We're observing a situation where after DNS timeouts occur, Envoy enters a bad state and continues using stale IP addresses (despite packet capture showing correct IPs are being returned). This is only fixed after we restart Envoy.

ares_reinit() will help achieve the same effect without needing a full restart. The other error codes (ARES_ECONNREFUSED, ARES_ESERVFAIL, etc.) already trigger reinit under the same assumption that the channel state may be broken. The c-ares project has acknowledged similar channel state issues: c-ares/c-ares#301.

Let me know what you think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Najji , I'm not opposed to the option, as it seems we do similar restarting of the c-ares channel for ARES_ECONNREFUSED, as long as it is false by default.

It seems like the issue is in the connection managed by c-ares though, so I'm guessing there is a bug in c-ares that doesn't deal with timed-out connections properly. But given we don't control c-ares, I think it's fine to add this option to get around the issue.

Thanks!

// channel become unusable after timeouts, causing subsequent queries to fail or
// Envoy to keep serving stale DNS results. When enabled, a timeout-triggered
// reinitialization attempts to restore healthy state quickly. In environments
// where timeouts are caused by intermittent network issues, enabling this may
// increase channel churn; consider using
// :ref:`max_udp_channel_duration <envoy_v3_api_field_extensions.network.dns_resolver.cares.v3.CaresDnsResolverConfig.max_udp_channel_duration>`
// for periodic refresh instead.
//
// Default is false.
bool reinit_channel_on_timeout = 11;
}
4 changes: 4 additions & 0 deletions changelogs/current.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,10 @@ bug_fixes:
Fixed per-route configuration for composite filter to support matching on response headers and trailers.
Previously, per-route matchers would silently fail when attempting to match on ``HttpResponseHeaderMatchInput``
or ``HttpResponseTrailerMatchInput``, causing the delegated filter to be skipped without error.
- area: dns
change: |
c-ares resolver: add optional ``reinit_channel_on_timeout`` to reinitialize
the resolver after DNS timeouts.

removed_config_or_runtime:
# *Normally occurs at the end of the* :ref:`deprecation period <deprecated>`
Expand Down
4 changes: 2 additions & 2 deletions source/extensions/network/dns_resolver/cares/dns_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ DnsResolverImpl::DnsResolverImpl(
? std::chrono::milliseconds(Protobuf::util::TimeUtil::DurationToMilliseconds(
config.max_udp_channel_duration()))
: std::chrono::milliseconds::zero()),
resolvers_csv_(resolvers_csv),
reinit_channel_on_timeout_(config.reinit_channel_on_timeout()), resolvers_csv_(resolvers_csv),
filter_unroutable_families_(config.filter_unroutable_families()),
scope_(root_scope.createScope("dns.cares.")), stats_(generateCaresDnsResolverStats(*scope_)) {
AresOptions options = defaultAresOptions();
Expand Down Expand Up @@ -252,7 +252,7 @@ void DnsResolverImpl::AddrInfoPendingResolution::onAresGetAddrInfoCallback(
// If c-ares returns ARES_ECONNREFUSED and there is no fallback we assume that the channel_ is
// broken and hence we reinitialize it here.
if (status == ARES_ECONNREFUSED || status == ARES_EREFUSED || status == ARES_ESERVFAIL ||
status == ARES_ENOTIMP) {
status == ARES_ENOTIMP || (status == ARES_ETIMEOUT && parent_.reinit_channel_on_timeout_)) {
parent_.reinitializeChannel();
}
}
Expand Down
1 change: 1 addition & 0 deletions source/extensions/network/dns_resolver/cares/dns_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,7 @@ class DnsResolverImpl : public DnsResolver, protected Logger::Loggable<Logger::I
const bool rotate_nameservers_;
const uint32_t edns0_max_payload_size_;
const std::chrono::milliseconds max_udp_channel_duration_;
const bool reinit_channel_on_timeout_;
const absl::optional<std::string> resolvers_csv_;
const bool filter_unroutable_families_;
Stats::ScopeSharedPtr scope_;
Expand Down
11 changes: 9 additions & 2 deletions test/extensions/network/dns_resolver/cares/dns_impl_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -733,6 +733,9 @@ class DnsImplTest : public testing::TestWithParam<Address::IpVersion> {
cares.mutable_edns0_max_payload_size()->set_value(getEdns0MaxPayloadSize());
}

// Enable `reinit_channel_on_timeout` if requested by the test case.
cares.set_reinit_channel_on_timeout(reinitOnTimeout());

// Copy over the dns_resolver_options_.
cares.mutable_dns_resolver_options()->MergeFrom(dns_resolver_options);
// setup the typed config
Expand All @@ -741,6 +744,8 @@ class DnsImplTest : public testing::TestWithParam<Address::IpVersion> {

return typed_dns_resolver_config;
}
// Whether to enable `reinit_channel_on_timeout` in the resolver config for this test.
virtual bool reinitOnTimeout() const { return false; }

void SetUp() override {
// Instantiate TestDnsServer and listen on a random port on the loopback address.
Expand Down Expand Up @@ -1958,23 +1963,25 @@ TEST_P(DnsImplFilterUnroutableFamiliesDontFilterTest, DontFilterAllV6) {
class DnsImplZeroTimeoutTest : public DnsImplTest {
protected:
bool queryTimeout() const override { return true; }
bool reinitOnTimeout() const override { return true; }
};

// Parameterize the DNS test server socket address.
INSTANTIATE_TEST_SUITE_P(IpVersions, DnsImplZeroTimeoutTest,
testing::ValuesIn(TestEnvironment::getIpVersionsForTest()),
TestUtility::ipTestParamsToString);

// Validate that timeouts result in an empty callback.
// Validate that timeouts result in an empty callback and trigger channel reinitialization.
TEST_P(DnsImplZeroTimeoutTest, Timeout) {
server_->addHosts("some.good.domain", {"201.134.56.7"}, RecordType::A);

EXPECT_NE(nullptr,
resolveWithExpectations("some.good.domain", DnsLookupFamily::V4Only,
DnsResolver::ResolutionStatus::Failure, {}, {}, absl::nullopt));
dispatcher_->run(Event::Dispatcher::RunType::Block);
// After `ARES_ETIMEOUT`, the channel should reinitialize.
checkStats(1 /*resolve_total*/, 0 /*pending_resolutions*/, 0 /*not_found*/,
0 /*get_addr_failure*/, 3 /*timeouts*/, 0 /*reinitializations*/);
0 /*get_addr_failure*/, 3 /*timeouts*/, 1 /*reinitializations*/);
}

// Validate that c-ares query cache is disabled by default.
Expand Down
Loading