grpc stream: reduce log level depending on remote close status#17300
grpc stream: reduce log level depending on remote close status#17300snowp merged 11 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: Taylor Barrella <tabarr@google.com>
source/common/config/grpc_stream.h
Outdated
| bool onlyWarnOnRepeatedFailure(Grpc::Status::GrpcStatus status) { | ||
| return Grpc::Status::WellKnownGrpcStatus::Unavailable == status || | ||
| Grpc::Status::WellKnownGrpcStatus::DeadlineExceeded == status || | ||
| Grpc::Status::WellKnownGrpcStatus::Internal == status; |
There was a problem hiding this comment.
This is included due to the example in #14591 (error 13), but gRPC docs classify this as a serious error:
Internal errors. This means that some invariants expected by the underlying system have been broken. This error code is reserved for serious errors.
I'm questioning whether the server should return this in the first place if it doesn't seem to be serious. @howardjohn @kyessenov @mandarjog thoughts?
There was a problem hiding this comment.
I am pretty sure StreamAggregatedResources gRPC config stream closed: 13, comes from keepalive closing the connection in the bug. I am a bit surprised it shows up as 13, and not Unavailable. It may be worth exploring a bit more.
FWIW if you want to test you can set --keepaliveMaxServerConnectionAge=5s on Istiod to get an XDS server that closes connection this way. If you use out-of-the-box Istio we have an XDS proxy the translates the error to OK anyways though
There was a problem hiding this comment.
Thank you. That makes sense to me since it's happening every 30 minutes in that example. It looks like in recent Istio (1.10.2) the recurring error is now 14 (Unavailable)
2021-07-12T23:58:48.196332Z warning envoy config StreamAggregatedResources gRPC config stream closed: 14, transport is closing
After building an Envoy with this change and not including this line/treating Internal as retriable, I no longer got warnings for the above error. So I think we should not special case Internal for now and only DeadlineExceeded/Unavailable
Signed-off-by: Taylor Barrella <tabarr@google.com>
Signed-off-by: Taylor Barrella <tabarr@google.com>
Signed-off-by: Taylor Barrella <tabarr@google.com>
snowp
left a comment
There was a problem hiding this comment.
Thanks, some comments to get you started.
Signed-off-by: Taylor Barrella <tabarr@google.com>
Signed-off-by: Taylor Barrella <tabarr@google.com>
Signed-off-by: Taylor Barrella <tabarr@google.com>
Signed-off-by: Taylor Barrella <tabarr@google.com>
|
/retest |
|
Retrying Azure Pipelines: |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks generally makes sense to me with some small comments.
/wait
source/common/config/grpc_stream.h
Outdated
| stream_ = async_client_->start(service_method_, *this, Http::AsyncClient::StreamOptions()); | ||
| if (stream_ == nullptr) { | ||
| ENVOY_LOG(warn, "Unable to establish new stream"); | ||
| ENVOY_LOG(debug, "Unable to establish new stream"); |
There was a problem hiding this comment.
Isn't this one pretty important though if the config server is completely busted? (No healthy hosts, etc.)
IMO we should make this error message easier to understand:
| ENVOY_LOG(debug, "Unable to establish new stream"); | |
| ENVOY_LOG(debug, "Unable to establish new stream to configuration server"); |
But also potentially rate limit the output? WDYT?
There was a problem hiding this comment.
#14616 (comment) aserted that this message is always accompanied by the remote close message. Tracing code (see e.g. code around here), this does seem to be the case at least for AyncStreamImpl. If you prefer to be conservative, I may just keep this as a warning for now rather than adding more logic for it
There was a problem hiding this comment.
If you are sure it always prints together I think it's fine to downgrade, but I would confirm with manual testing. The issue here is I don't think every case that this would get printed in would result in a remote close, for example no healthy host.
| return Grpc::Status::WellKnownGrpcStatus::DeadlineExceeded == status || | ||
| Grpc::Status::WellKnownGrpcStatus::Unavailable == status; |
There was a problem hiding this comment.
Can you add some comments on how you decided these?
Looking at https://grpc.github.io/grpc/core/md_doc_statuscodes.html I would naively assume RESOURCE_EXHAUSTED should also be included. Maybe others?
There was a problem hiding this comment.
Sure, will add a comment. I could see why RESOURCE_EXHAUSTED would be included, so I'll add that. At first I was thinking the resource is ambiguous and wasn't sure how likely it would be that retrying would help, but either way it seems suppressing the log for the first 30s makes sense. I have a hard time seeing how any of the others would be included though. The only other ones that seemed potentially worth retrying immediately to me were ABORTED and FAILED_PRECONDITION, for which it says
(a) Use UNAVAILABLE if the client can retry just the failing call. (b) Use ABORTED if the client should retry at a higher level (e.g., when a client-specified test-and-set fails, indicating the client should restart a read-modify-write sequence). (c) Use FAILED_PRECONDITION if the client should not retry until the system state has been explicitly fixed.
Signed-off-by: Taylor Barrella <tabarr@google.com>
Signed-off-by: Taylor Barrella <tabarr@google.com>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks this looks great and is an awesome improvement. Just one question remainnig. @snowp any further comments?
/wait-any
| stream_ = async_client_->start(service_method_, *this, Http::AsyncClient::StreamOptions()); | ||
| if (stream_ == nullptr) { | ||
| ENVOY_LOG(warn, "Unable to establish new stream"); | ||
| ENVOY_LOG(debug, "Unable to establish new stream to configuration server"); |
There was a problem hiding this comment.
Where did we land on debug vs. warn on this? I'm concerned we may be losing information here but I'm not positive.
There was a problem hiding this comment.
Sorry, testing this manually
There was a problem hiding this comment.
I'm finding in manually testing with Istio I can't trigger this branch (even while setting the discovery address to a dead end), though I do get the error
2021-07-20T20:11:11.828138Z debug envoy config StreamAggregatedResources gRPC config stream closed: 14, connection error: desc = "transport: Error while dialing dial tcp: lookup istiod-fail.istio-system.svc on 10.20.0.10:53: no such host"
In my manual testing Istio has an xDS proxy and I wonder if the proxying behavior is preventing this branch from being reached. This causes the last close status to be reset a few lines below and the log level is never escalated. Does this sound expected @howardjohn?
I'll continue with trying to test in a way that causes this branch to be triggered, i.e. without Istio or without the xDS proxy
There was a problem hiding this comment.
I think one way to trigger this branch for sure is to have the xDS cluster have zero hosts (zero DNS records, etc.). I think it should definitely trigger then.
There was a problem hiding this comment.
(If time is limited the other option is to just leave it at warn for now.)
There was a problem hiding this comment.
Potentially clearCloseStatus should be moved to here...
There was a problem hiding this comment.
Sorry I posted my last message before seeing yours. That makes sense, although currently this doesn't work with Istio due to the status always being cleared, so I'd like to fix that before merging
There was a problem hiding this comment.
Ok, updates:
-
Even when there's no host,
onRemoteCloseis triggered, so it currently seems it's fine to keep this log as debug. Example messages:2021-07-21T00:06:41.418077Z debug envoy config StreamAggregatedResources gRPC config stream closed: 14, Cluster not available
...
2021-07-21T00:06:41.938635Z debug envoy config StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream
...
2021-07-21T00:07:12.423970Z warning envoy config StreamAggregatedResources gRPC config stream closed since 31005ms ago: 14, Cluster not available -
To also handle setups where there's a proxy between Envoy and the xDS server, it seems like it makes sense to move
clearCloseStatusfromestablishNewStreamtoonReceiveMessage, which already containsbackoff_strategy_->reset(). This is because the xDS proxy needs to accept the stream from Envoy before forwarding it to the xDS server. This seems to also give some robustness in case there are other situations in which Envoy repeatedly establishes a config stream only for it to immediately be closed withUnavailable
@mattklein123 WDYT?
There was a problem hiding this comment.
Sure that sounds fine, thanks.
Signed-off-by: Taylor Barrella <tabarr@google.com>
|
/retest |
|
Retrying Azure Pipelines: |
|
/retest |
|
Retrying Azure Pipelines: |
|
@snowp any additional comments on this one? |
…proxy#17300) Signed-off-by: Taylor Barrella <tabarr@google.com>
Commit Message:
grpc stream: reduce log level depending on remote close status
Signed-off-by: Taylor Barrella tabarr@google.com
Additional Description:
Risk Level: Low
Testing: Unit
Docs Changes: N/A
Release Notes: Noted log level reduction
Fixes #14591