dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED#9899
dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED#9899junr03 merged 8 commits intoenvoyproxy:masterfrom junr03:cares-channel
Conversation
|
opening as draft until I write tests monday. cc @htuch as the original writer. |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks for investigating and fixing this issue. Very excited to see this finally understood and resolved.
|
Whoops did not mean for you to review my half baked code. Thanks for the comments though, I agree with all of them @mattklein123. Will update and write tests Monday. |
Signed-off-by: Jose Nino <jnino@lyft.com>
junr03
left a comment
There was a problem hiding this comment.
@mattklein123 updated with your comments
| // The channel cannot be destroyed and reinitialized here because that leads to a c-ares | ||
| // segfault. | ||
| if (status == ARES_ECONNREFUSED) { | ||
| parent_.dirty_channel_ = true; |
There was a problem hiding this comment.
I thought about this for a bit and debated between returning here, vs. allowing execution to continue and eventually call the callback_.
I ended up going for allowing the callback for a couple reasons:
- Ease of testing, as having the callback allows for nice event loop triggers and a way to test assertions on the address_list.
- It preserves previous behavior where every time resolution completed, whether successfully or not (with the exception of destruction), the callback_ was called. Obviously after fallbacks were exhausted, and if the query was not cancelled.
wdyt?
There was a problem hiding this comment.
Yeah I agree we need to call the callback. As you already pointed out this is related to the other issue, so let's just fix the callbacks to expose errors in a future change.
mattklein123
left a comment
There was a problem hiding this comment.
Awesome work investigating this and fixing! Very excited to see this finally understood.
| // The channel cannot be destroyed and reinitialized here because that leads to a c-ares | ||
| // segfault. | ||
| if (status == ARES_ECONNREFUSED) { | ||
| parent_.dirty_channel_ = true; |
There was a problem hiding this comment.
Yeah I agree we need to call the callback. As you already pointed out this is related to the other issue, so let's just fix the callbacks to expose errors in a future change.
| // If that flag needs to be set, or c-ares changes its handling this test will need to be updated | ||
| // to create another condition where c-ares invokes onAresGetAddrInfoCallback with status == | ||
| // ARES_ECONNREFUSED. |
There was a problem hiding this comment.
Agreed this is a little fragile but I think it's OK for now. Nice work figuring out a way to test this!
Bumping to include: - `dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED`: envoyproxy/envoy#9899. This should resolve the issues we've been seeing with Envoy Mobile hanging on startup and never properly issuing requests if started in the offline state on iOS - `gzip: add force load factory declaration`: envoyproxy/envoy#9942. This will allow us to use the gzip filter with Envoy Mobile Signed-off-by: Michael Rebello <me@michaelrebello.com>
Bumping to include the following fixes: - `dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED`: envoyproxy/envoy#9899. This should resolve the issues we've been seeing with Envoy Mobile hanging on startup and never properly issuing requests if started in the offline state on iOS. Fixes #672, though more improvements to DNS resolution will be investigated in #673 - `gzip: add force load factory declaration`: envoyproxy/envoy#9942. This will allow us to use the gzip filter with Envoy Mobile - `api listener: add shutdown method and call during server termination`: envoyproxy/envoy#9959. Fixes #667. Fixes #674 Includes the following PRs to fix iOS liveliness tests as well: - `metrics service: force link v2 config`: envoyproxy/envoy#9875 - `config: remove ApiTypeOracle assert`: envoyproxy/envoy#9973 Also contains necessary updates to accommodate [this change](envoyproxy/envoy@dea4eb0). Signed-off-by: Michael Rebello <me@michaelrebello.com>
Description:PendingResolutions get destroyed when complete or when c-ares sent ARES_EDESTRUCTION. Prior to #9899 ARES_EDESTRUCTION only happened when the resolver was destroyed. However, after #9899 the channel, and thus PendingResolutions can be destroyed, without the callback target being aware. This leads to potential use after free issues. This PR calls the resolve callback when the status received is ARES_EDESTRUCTION. Risk Level: med Testing: added test that reproduced segfault and passed after fix. Signed-off-by: Jose Nino <jnino@lyft.com>
Bumping to include the following fixes: - `dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED`: #9899. This should resolve the issues we've been seeing with Envoy Mobile hanging on startup and never properly issuing requests if started in the offline state on iOS. Fixes envoyproxy/envoy-mobile#672, though more improvements to DNS resolution will be investigated in envoyproxy/envoy-mobile#673 - `gzip: add force load factory declaration`: #9942. This will allow us to use the gzip filter with Envoy Mobile - `api listener: add shutdown method and call during server termination`: #9959. Fixes envoyproxy/envoy-mobile#667. Fixes envoyproxy/envoy-mobile#674 Includes the following PRs to fix iOS liveliness tests as well: - `metrics service: force link v2 config`: #9875 - `config: remove ApiTypeOracle assert`: #9973 Also contains necessary updates to accommodate [this change](dea4eb0). Signed-off-by: Michael Rebello <me@michaelrebello.com> Signed-off-by: JP Simard <jp@jpsim.com>
Bumping to include the following fixes: - `dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED`: #9899. This should resolve the issues we've been seeing with Envoy Mobile hanging on startup and never properly issuing requests if started in the offline state on iOS. Fixes envoyproxy/envoy-mobile#672, though more improvements to DNS resolution will be investigated in envoyproxy/envoy-mobile#673 - `gzip: add force load factory declaration`: #9942. This will allow us to use the gzip filter with Envoy Mobile - `api listener: add shutdown method and call during server termination`: #9959. Fixes envoyproxy/envoy-mobile#667. Fixes envoyproxy/envoy-mobile#674 Includes the following PRs to fix iOS liveliness tests as well: - `metrics service: force link v2 config`: #9875 - `config: remove ApiTypeOracle assert`: #9973 Also contains necessary updates to accommodate [this change](dea4eb0). Signed-off-by: Michael Rebello <me@michaelrebello.com> Signed-off-by: JP Simard <jp@jpsim.com>
Description: this PR adds logic to the DnsResolverImpl to destroy and re-initialize its c-ares channel under certain circumstances. A better option would require work in c-ares c-ares/c-ares#301.
Risk Level: med changes in low-level DNS resolution.
Testing: unit tests
Fixes #4543
Signed-off-by: Jose Nino jnino@lyft.com