grpc stream: reduce log level depending on remote close status by tbarrella · Pull Request #17300 · envoyproxy/envoy

tbarrella · 2021-07-12T18:39:06Z

Commit Message:
grpc stream: reduce log level depending on remote close status

Signed-off-by: Taylor Barrella tabarr@google.com

Additional Description:
Risk Level: Low
Testing: Unit
Docs Changes: N/A
Release Notes: Noted log level reduction
Fixes #14591

Signed-off-by: Taylor Barrella <tabarr@google.com>

tbarrella · 2021-07-12T18:48:17Z

source/common/config/grpc_stream.h

+  bool onlyWarnOnRepeatedFailure(Grpc::Status::GrpcStatus status) {
+    return Grpc::Status::WellKnownGrpcStatus::Unavailable == status ||
+           Grpc::Status::WellKnownGrpcStatus::DeadlineExceeded == status ||
+           Grpc::Status::WellKnownGrpcStatus::Internal == status;


This is included due to the example in #14591 (error 13), but gRPC docs classify this as a serious error:

Internal errors. This means that some invariants expected by the underlying system have been broken. This error code is reserved for serious errors.

I'm questioning whether the server should return this in the first place if it doesn't seem to be serious. @howardjohn @kyessenov @mandarjog thoughts?

I am pretty sure StreamAggregatedResources gRPC config stream closed: 13, comes from keepalive closing the connection in the bug. I am a bit surprised it shows up as 13, and not Unavailable. It may be worth exploring a bit more.

FWIW if you want to test you can set --keepaliveMaxServerConnectionAge=5s on Istiod to get an XDS server that closes connection this way. If you use out-of-the-box Istio we have an XDS proxy the translates the error to OK anyways though

Thank you. That makes sense to me since it's happening every 30 minutes in that example. It looks like in recent Istio (1.10.2) the recurring error is now 14 (Unavailable)

2021-07-12T23:58:48.196332Z warning envoy config StreamAggregatedResources gRPC config stream closed: 14, transport is closing

After building an Envoy with this change and not including this line/treating Internal as retriable, I no longer got warnings for the above error. So I think we should not special case Internal for now and only DeadlineExceeded/Unavailable

Signed-off-by: Taylor Barrella <tabarr@google.com>

snowp

Thanks, some comments to get you started.

docs/root/version_history/current.rst

source/common/config/grpc_stream.h

test/common/config/grpc_stream_test.cc

Signed-off-by: Taylor Barrella <tabarr@google.com>

tbarrella · 2021-07-15T16:30:31Z

/retest

repokitteh-read-only · 2021-07-15T16:30:35Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17300 (comment) was created by @tbarrella.

see: more, trace.

mattklein123

Thanks generally makes sense to me with some small comments.

/wait

docs/root/version_history/current.rst

mattklein123 · 2021-07-19T19:44:58Z

source/common/config/grpc_stream.h

    stream_ = async_client_->start(service_method_, *this, Http::AsyncClient::StreamOptions());
    if (stream_ == nullptr) {
-      ENVOY_LOG(warn, "Unable to establish new stream");
+      ENVOY_LOG(debug, "Unable to establish new stream");


Isn't this one pretty important though if the config server is completely busted? (No healthy hosts, etc.)

IMO we should make this error message easier to understand:

Suggested change

ENVOY_LOG(debug, "Unable to establish new stream");

ENVOY_LOG(debug, "Unable to establish new stream to configuration server");

But also potentially rate limit the output? WDYT?

#14616 (comment) aserted that this message is always accompanied by the remote close message. Tracing code (see e.g. code around here), this does seem to be the case at least for AyncStreamImpl. If you prefer to be conservative, I may just keep this as a warning for now rather than adding more logic for it

If you are sure it always prints together I think it's fine to downgrade, but I would confirm with manual testing. The issue here is I don't think every case that this would get printed in would result in a remote close, for example no healthy host.

source/common/config/grpc_stream.h

mattklein123 · 2021-07-19T19:49:33Z

source/common/config/grpc_stream.h

+    return Grpc::Status::WellKnownGrpcStatus::DeadlineExceeded == status ||
+           Grpc::Status::WellKnownGrpcStatus::Unavailable == status;


Can you add some comments on how you decided these?

Looking at https://grpc.github.io/grpc/core/md_doc_statuscodes.html I would naively assume RESOURCE_EXHAUSTED should also be included. Maybe others?

Sure, will add a comment. I could see why RESOURCE_EXHAUSTED would be included, so I'll add that. At first I was thinking the resource is ambiguous and wasn't sure how likely it would be that retrying would help, but either way it seems suppressing the log for the first 30s makes sense. I have a hard time seeing how any of the others would be included though. The only other ones that seemed potentially worth retrying immediately to me were ABORTED and FAILED_PRECONDITION, for which it says

(a) Use UNAVAILABLE if the client can retry just the failing call. (b) Use ABORTED if the client should retry at a higher level (e.g., when a client-specified test-and-set fails, indicating the client should restart a read-modify-write sequence). (c) Use FAILED_PRECONDITION if the client should not retry until the system state has been explicitly fixed.

source/common/config/grpc_stream.h

Signed-off-by: Taylor Barrella <tabarr@google.com>

mattklein123

Thanks this looks great and is an awesome improvement. Just one question remainnig. @snowp any further comments?

/wait-any

mattklein123 · 2021-07-20T17:15:15Z

source/common/config/grpc_stream.h

    stream_ = async_client_->start(service_method_, *this, Http::AsyncClient::StreamOptions());
    if (stream_ == nullptr) {
-      ENVOY_LOG(warn, "Unable to establish new stream");
+      ENVOY_LOG(debug, "Unable to establish new stream to configuration server");


Where did we land on debug vs. warn on this? I'm concerned we may be losing information here but I'm not positive.

Sorry, testing this manually

I'm finding in manually testing with Istio I can't trigger this branch (even while setting the discovery address to a dead end), though I do get the error

2021-07-20T20:11:11.828138Z debug envoy config StreamAggregatedResources gRPC config stream closed: 14, connection error: desc = "transport: Error while dialing dial tcp: lookup istiod-fail.istio-system.svc on 10.20.0.10:53: no such host"

In my manual testing Istio has an xDS proxy and I wonder if the proxying behavior is preventing this branch from being reached. This causes the last close status to be reset a few lines below and the log level is never escalated. Does this sound expected @howardjohn?

I'll continue with trying to test in a way that causes this branch to be triggered, i.e. without Istio or without the xDS proxy

I think one way to trigger this branch for sure is to have the xDS cluster have zero hosts (zero DNS records, etc.). I think it should definitely trigger then.

(If time is limited the other option is to just leave it at warn for now.)

Potentially clearCloseStatus should be moved to here...

Sorry I posted my last message before seeing yours. That makes sense, although currently this doesn't work with Istio due to the status always being cleared, so I'd like to fix that before merging

Ok, updates:

Even when there's no host, onRemoteClose is triggered, so it currently seems it's fine to keep this log as debug. Example messages:

2021-07-21T00:06:41.418077Z debug envoy config StreamAggregatedResources gRPC config stream closed: 14, Cluster not available
...
2021-07-21T00:06:41.938635Z debug envoy config StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream
...
2021-07-21T00:07:12.423970Z warning envoy config StreamAggregatedResources gRPC config stream closed since 31005ms ago: 14, Cluster not available

To also handle setups where there's a proxy between Envoy and the xDS server, it seems like it makes sense to move clearCloseStatus from establishNewStream to onReceiveMessage, which already contains backoff_strategy_->reset(). This is because the xDS proxy needs to accept the stream from Envoy before forwarding it to the xDS server. This seems to also give some robustness in case there are other situations in which Envoy repeatedly establishes a config stream only for it to immediately be closed with Unavailable

@mattklein123 WDYT?

Sure that sounds fine, thanks.

Signed-off-by: Taylor Barrella <tabarr@google.com>

tbarrella · 2021-07-21T19:33:48Z

/retest

repokitteh-read-only · 2021-07-21T19:33:51Z

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17300 (comment) was created by @tbarrella.

see: more, trace.

mattklein123

Thanks!

mattklein123 · 2021-07-22T16:25:38Z

/retest

repokitteh-read-only · 2021-07-22T16:25:43Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17300 (comment) was created by @mattklein123.

see: more, trace.

mattklein123 · 2021-07-22T19:29:04Z

@snowp any additional comments on this one?

snowp

LGTM, thanks!

…proxy#17300) Signed-off-by: Taylor Barrella <tabarr@google.com>

grpc stream: reduce log level depending on remote close status

643e9ff

Signed-off-by: Taylor Barrella <tabarr@google.com>

tbarrella commented Jul 12, 2021

View reviewed changes

tbarrella added 3 commits July 12, 2021 18:53

add period to release note

f1ffe40

Signed-off-by: Taylor Barrella <tabarr@google.com>

don't replace MockDispatcher time system

0741a6b

Signed-off-by: Taylor Barrella <tabarr@google.com>

don't consider an Internal failure to be retriable

bc53b0a

Signed-off-by: Taylor Barrella <tabarr@google.com>

snowp assigned mattklein123 and snowp Jul 14, 2021

snowp suggested changes Jul 14, 2021

View reviewed changes

tbarrella added 2 commits July 14, 2021 16:29

Merge remote-tracking branch 'upstream/main' into logging

91c34bd

Signed-off-by: Taylor Barrella <tabarr@google.com>

address comments

ddb10b2

Signed-off-by: Taylor Barrella <tabarr@google.com>

mattklein123 added the waiting label Jul 14, 2021

repokitteh-read-only bot removed the waiting label Jul 14, 2021

fix version history merge after release

ca83f79

Signed-off-by: Taylor Barrella <tabarr@google.com>

tbarrella requested a review from snowp July 14, 2021 16:54

Merge remote-tracking branch 'upstream/main' into logging

a842b98

Signed-off-by: Taylor Barrella <tabarr@google.com>

mattklein123 requested changes Jul 19, 2021

View reviewed changes

repokitteh-read-only bot added the waiting label Jul 19, 2021

tbarrella added 2 commits July 19, 2021 23:06

Merge remote-tracking branch 'origin/main' into logging

ba64bc1

Signed-off-by: Taylor Barrella <tabarr@google.com>

respond to comments

1d88d72

Signed-off-by: Taylor Barrella <tabarr@google.com>

repokitteh-read-only bot removed the waiting label Jul 19, 2021

mattklein123 reviewed Jul 20, 2021

View reviewed changes

repokitteh-read-only bot added waiting:any and removed waiting:any labels Jul 20, 2021

mattklein123 added the waiting label Jul 20, 2021

clear status after receiving a message instead

970e5e2

Signed-off-by: Taylor Barrella <tabarr@google.com>

repokitteh-read-only bot removed the waiting label Jul 21, 2021

tbarrella requested a review from mattklein123 July 21, 2021 19:34

mattklein123 approved these changes Jul 22, 2021

View reviewed changes

snowp approved these changes Jul 22, 2021

View reviewed changes

snowp merged commit a8033fa into envoyproxy:main Jul 22, 2021

tbarrella deleted the logging branch July 22, 2021 20:00

leyao-daily pushed a commit to leyao-daily/envoy that referenced this pull request Sep 30, 2021

grpc stream: reduce log level depending on remote close status (envoy…

55b449d

…proxy#17300) Signed-off-by: Taylor Barrella <tabarr@google.com>

	ENVOY_LOG(debug, "Unable to establish new stream");
	ENVOY_LOG(debug, "Unable to establish new stream to configuration server");

		return Grpc::Status::WellKnownGrpcStatus::DeadlineExceeded == status \|\|
		Grpc::Status::WellKnownGrpcStatus::Unavailable == status;

Conversation

tbarrella commented Jul 12, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snowp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tbarrella commented Jul 15, 2021

Uh oh!

repokitteh-read-only bot commented Jul 15, 2021

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tbarrella Jul 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tbarrella Jul 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tbarrella commented Jul 21, 2021

Uh oh!

repokitteh-read-only bot commented Jul 21, 2021

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

mattklein123 commented Jul 22, 2021

Uh oh!

repokitteh-read-only bot commented Jul 22, 2021

tbarrella Jul 19, 2021 •

edited

Loading

tbarrella Jul 19, 2021 •

edited

Loading