dns resolvers: add All lookup mode#18464
Conversation
Signed-off-by: Jose Nino <jnino@lyft.com>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
| @@ -104,16 +104,24 @@ void DnsResolverImpl::PendingResolution::onAresGetAddrInfoCallback(int status, i | |||
| // ARES_ECONNREFUSED. If the PendingResolution has not been cancelled that means that the | |||
There was a problem hiding this comment.
The c-ares based resolver changed a bit more because it was not originally written to accumulate responses from multiple c-ares callbacks.
There was a problem hiding this comment.
I refactored for ease of understanding but open to comments to simplify further!
source/common/network/dns_impl.cc
Outdated
| dns_name_); | ||
|
|
||
| if (!pending_response_.address_list_.empty()) { | ||
| ASSERT(dns_lookup_family_ == DnsLookupFamily::All && !dual_resolution_); |
There was a problem hiding this comment.
Here and below I considered a failure in the second resolution an overall success if the first resolution was successful, and hence fired the callback with the existing addresses and a success state. Up to discussing if this should be the behavior.
There was a problem hiding this comment.
yeah IMO a resolve success of either v4 or v6 should be considered success.
cc @DavidSchinazi @RyanTheOptimist for informed opinions :-)
There was a problem hiding this comment.
+1, success of v4 OR v6 should mean overall success since we have usable addresses
Signed-off-by: Jose Nino <jnino@lyft.com>
2344225 to
386c2cb
Compare
source/common/network/dns_impl.cc
Outdated
| dns_name_); | ||
|
|
||
| if (!pending_response_.address_list_.empty()) { | ||
| ASSERT(dns_lookup_family_ == DnsLookupFamily::All && !dual_resolution_); |
There was a problem hiding this comment.
yeah IMO a resolve success of either v4 or v6 should be considered success.
cc @DavidSchinazi @RyanTheOptimist for informed opinions :-)
| } | ||
| if (!fallback_if_failed_) { | ||
|
|
||
| if (!dual_resolution_) { |
There was a problem hiding this comment.
hm, are we holding all forward progress on the dns resolver returning v4 and v6 addreses? That might work for a first pass but I believe that the chrome folks have told me it's a bad idea in general for latency / Quality of Experience reasons. If they confirm, I think we need a TODO to handle ipv4 and ipv6 asynchronously.
There was a problem hiding this comment.
Yeah I was thinking about this as well from two different angles:
- Should both c-ares function calls happen concurrently?
- This applies both for c-ares and for apple: should we stream results as they come in to the callback target.
(1) is easy and I could do that in this PR. (2) is a bit harder, and I wouldn't want to do it in this PR.
There was a problem hiding this comment.
Personally I would do (1) and not (2) for now.
There was a problem hiding this comment.
@alyssawilk @mattklein123 re:(1) do we have a preference for all three modes (All,Auto,V4Preferred) to do the calls concurrently, vs only All doing it concurrently and Auto/V4Preferred only do the second call if the first one fails/returns no addresses?
There was a problem hiding this comment.
Hm, less sure on that - I'll defer to Matt and David.
Totally fine with not doing 2 in this PR but I think we should either comment or TODO that it has latency implications and should probably be fixed at some point.
There was a problem hiding this comment.
@alyssawilk @mattklein123 I settled on doing both lookups concurrrently for All and leaving Auto and V4Preferred with sequential lookups. I did not want to block this PR on making the necessary changes to enable concurrent lookups for Auto/V4Preferred given that the existing behavior is already sequential, and concurrent lookup for All did not need bigger changes.
I am already part of the way on the TODO, so will put that up as a separate PR.
There was a problem hiding this comment.
Agree that waiting for both means users are waiting unnecessarily. But tackling in a separate PR with TODO here sounds good
There was a problem hiding this comment.
SG. Do we have a TODO on not waiting for both?
There was a problem hiding this comment.
This definitely seems like a good way to handle this for now. While suboptimal because it block progres until both return, Chrome does exactly this today. Solving #18572 (later) is a great chance to make envoy mobile better than Chrome!
|
oh,I had one other question which is if we're doing the scoping to 5 as disussed on slack, if we should do it here, or where we get the resolution? I have a TODO in my code for where we scope it, but no strong opinion I'll send mine out in draft form today so you can TAL |
My preference is for the callback target to decide what it does with all the addresses that the resolver gives it. That was static dns cluster takes everything, logical takes one, dns cache with happy eyeballs does it thing, etc. |
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
|
@alyssawilk @mattklein123 updated! |
| return pending_response_.v4_responses_; | ||
| case DnsLookupFamily::All: | ||
| ASSERT(pending_response_.all_responses_.empty()); | ||
| pending_response_.all_responses_.insert(pending_response_.all_responses_.end(), |
There was a problem hiding this comment.
If I'm understanding this correctly, it's adding v4 addresses before v6 addresses. In general I'd suggest v6 first but you can get fancier as per the happy eyeballs spec
There was a problem hiding this comment.
yep, that's the case. But the callback receiver can do what they want with the list, and is where I would expect the ordering complexity to happen; so I am not too opinionated with order here.
There was a problem hiding this comment.
I think we should follow the RFC 8305 (the happy eyeballs spec) as David suggest, and I agree with Jose that we should probably do that in HappyEyeballsConnectionImpl. I'd be happy to take a whack at that.
source/common/network/dns_impl.cc
Outdated
| dns_name_); | ||
|
|
||
| if (!pending_response_.address_list_.empty()) { | ||
| ASSERT(dns_lookup_family_ == DnsLookupFamily::All && !dual_resolution_); |
There was a problem hiding this comment.
+1, success of v4 OR v6 should mean overall success since we have usable addresses
| } | ||
| if (!fallback_if_failed_) { | ||
|
|
||
| if (!dual_resolution_) { |
There was a problem hiding this comment.
Agree that waiting for both means users are waiting unnecessarily. But tackling in a separate PR with TODO here sounds good
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
|
I see the coverage failure. Writing a test for the 1 uncovered line |
Signed-off-by: Jose Nino <jnino@lyft.com>
|
(needs main merge) |
mattklein123
left a comment
There was a problem hiding this comment.
Awesome, thanks for working on this. A few comments from a read through, thanks.
/wait
| return pending_response_.v4_responses_; | ||
| case DnsLookupFamily::All: | ||
| ASSERT(pending_response_.all_responses_.empty()); | ||
| pending_response_.all_responses_.insert(pending_response_.all_responses_.end(), |
source/common/network/dns_impl.h
Outdated
| // or Auto, perform a second resolution if the first one fails. If dns_lookup_family_ is All | ||
| // perform the second resolution whether the first one fails or succeeds. |
There was a problem hiding this comment.
Is this comment out of date? It conflicts with the one above about concurrent? The behavior is that both are concurrent, but we don't raise the final callback until both are done? Or is that not correct? (From further reading I'm pretty sure it's correct but I would do a 2nd comment pass as the logic is actually pretty tricky and I would be concerned we might be missing some corner cases.
source/common/network/dns_impl.cc
Outdated
|
|
||
| void DnsResolverImpl::PendingResolution::onAresGetAddrInfoCallback(int status, int timeouts, | ||
| ares_addrinfo* addrinfo) { | ||
| // Purely for checking for invalid state. This will change is Auto and V4Preferred are changed to |
There was a problem hiding this comment.
| // Purely for checking for invalid state. This will change is Auto and V4Preferred are changed to | |
| // Purely for checking for invalid state. This will change if Auto and V4Preferred are changed to |
?
source/common/network/dns_impl.cc
Outdated
| // The only way the resolver would have addresses here is if it is configured to resolve All | ||
| // families, and the first resolution was successful. In that case, we might as well return | ||
| // the addresses that resolved. | ||
| ASSERT(dns_lookup_family_ == DnsLookupFamily::All && !dual_resolution_); |
There was a problem hiding this comment.
nit: this seems a bit fragile. Should you change status to ARES_SUCCESS in this case? Do we have test coverage for this case?
source/common/network/dns_impl.cc
Outdated
| if (pending_response_.address_list_.empty()) { | ||
| pending_response_.status_ = ResolutionStatus::Failure; |
There was a problem hiding this comment.
Is this true? Couldn't the list actually be empty? Or is this due to the new concurrent behavior and the 2nd one failing? Maybe this is related to my comment above about switching status? Can you add more comments?
52d1a88
|
@mattklein123 Thanks for the comments! I needed to rearrange a piece of code because there was not way to get coverage i.e., I couldn't force c-ares to behave in a way to get us on that path. At first glance I think I covered the new arrangement, but will make sure to double check. At the very least I will add more comments in order to leave everything a bit more clear. I will update when I am back 10/26. cc @alyssawilk |
|
/wait |
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
|
Alright, @mattklein123 when I came back to this with fresh eyes I realized that constructing the pending response as a failure and only changing to success when the c-ares callback gave us an ARES_SUCCESS made the code simpler. All the cases I can think about are covered with the existing and added test cases. Let me know if this is better! |
mattklein123
left a comment
There was a problem hiding this comment.
The DnsResolverImpl::PendingResolution::onAresGetAddrInfoCallback is a lot more complicated than I would like and pretty hard to follow, so if you see any further opportunities to make that easier to understand that would be a nice follow up. Otherwise LGTM after staring at this for a while. Thanks for adding!
Yeah it has never been the easiest to follow. I think this PR is a good first step by taking out |
* main: (221 commits) deps: Bump `protobuf` -> 3.19.0 (envoyproxy#18471) tooling: auto-assign dependency shephards (envoyproxy#18794) clang-tidy: Return from diff fun if empty diff (envoyproxy#18815) repokitteh: Block PRs pending deps approval (envoyproxy#18814) deps: Bump `org_llvm_llvm` -> 12.0.1, `com_github_wavm_wavm` -> 9ffd3e2 (envoyproxy#18747) dns resolvers: add All lookup mode (envoyproxy#18464) doc: fix link formatting for TLS session_timeout (envoyproxy#18790) ext_authz: Set response flag and code details to UAEX when denied (envoyproxy#18740) socket options: add support for directly creating ipv4/ipv6 pairs (envoyproxy#18769) ecds: make onConfigUpdate generic over filter type (envoyproxy#18061) bazel: update CMake instructions in EXTERNAL_DEPS.md (envoyproxy#18799) upstream: fix typo in comment (envoyproxy#18798) runtime: removing envoy.reloadable_features.grpc_json_transcoder_adhere_to_buffer_limits (envoyproxy#18696) bazel: Add CC=clang to clang configuration (envoyproxy#18732) fix error request id in the dubbbo local reply (envoyproxy#18741) event: assert the case of both read and closed event registered (envoyproxy#18265) tcp proxy connect tunneling: improved testing (envoyproxy#18784) deps: Bump `protoc-gen-validate` -> 0.6.2 (envoyproxy#18742) deps: Bump `rules_pkg` -> ad57589 (envoyproxy#18746) bazel: copy .bazelversion for envoy filter examples (envoyproxy#18730) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Commit Message: dns resolvers - add All lookup mode
Additional Description: this change will enable callback targets to access both address families for a given hostname.
Risk Level: low - new behavior guarded by config option
Testing: added UT
Docs Changes: updated
Release Notes: updated