Skip to content

XDS Connection closure smart logging#14616

Closed
mandarjog wants to merge 3 commits intoenvoyproxy:mainfrom
mandarjog:config_noise
Closed

XDS Connection closure smart logging#14616
mandarjog wants to merge 3 commits intoenvoyproxy:mainfrom
mandarjog:config_noise

Conversation

@mandarjog
Copy link
Copy Markdown
Contributor

XDS connection closures are logged as warnings only for
repeated failures for certain type of status codes.

Fixes #14591

Signed-off-by: Mandar U Jog mjog@google.com

Will follow up with tests if this approach is ok.

@mandarjog mandarjog force-pushed the config_noise branch 2 times, most recently from bd2d53d to 49c88ac Compare January 9, 2021 05:26
XDS connection closures are logged as warnings only for
repeated failures for certain type of status codes.

Signed-off-by: Mandar U Jog <mjog@google.com>
Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; @mandarjog can you add tests for all these new behaviors? There's a lot of different branch conditions, so we should have some coverage of the logic here.
/wait

stream_ = async_client_->start(service_method_, *this, Http::AsyncClient::StreamOptions());
if (stream_ == nullptr) {
ENVOY_LOG(warn, "Unable to establish new stream");
ENVOY_LOG(debug, "Unable to establish new grpc config stream");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a warning? There's a real failure ocurring.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real failure with error message is recorded in the call back. I don;t think there is a case where something is not recorded in the callback, but is recorded here. This is also the source of noise since it does not tell the reason.

void onReceiveMessage(ResponseProtoPtr<ResponseProto>&& message) override {
// Reset here so that it starts with fresh backoff interval on next disconnect.
backoff_strategy_->reset();
unsetFailure();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Isn't it sufficient to unset on successful connect?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the comment on the next line that lead me to believe I need a reset here, to.
But I think it makes sense to remove it. ack.

@htuch htuch self-assigned this Jan 11, 2021
@htuch htuch requested a review from ramaraochavali January 11, 2021 16:28
@howardjohn
Copy link
Copy Markdown
Contributor

A release note may be nice, as this will impact everyone's logs

Copy link
Copy Markdown
Contributor

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure masking errors to debug level is the right approach.

Moving status code 0 to debug level makes sense to me, but I think we want to see errors?

Signed-off-by: Mandar U Jog <mjog@google.com>
@howardjohn
Copy link
Copy Markdown
Contributor

I wasn't sure from the code, do we log the first error?

XDS Failed 0
XDS Failed 1
XDS Failed 2
XDS Failed 3
XDS connected

For example, in the above ^, do we log the failure 3 or failure 0?

In my opinion, logging the first failure is ideal, because if we have a log on first failure and successful connections, then we know that everything between failure 0 and "connected" is more failures, even if they are not explicitly logged. if we log only latter failures we may miss when the issues first start.

@mandarjog
Copy link
Copy Markdown
Contributor Author

@htuch can you point me to an example of faking time ?

@mandarjog
Copy link
Copy Markdown
Contributor Author

If xds reconnected in the allotted time, then we do not log anything. ( there is no connection successful message yet)

at debug level, we log every event.

If the connection does not recover in allotted time, we will print the original error and state the amount of time this condition has been going on.

@htuch
Copy link
Copy Markdown
Member

htuch commented Jan 13, 2021

@mandarjog you want a fake time source for the dispatcher, see the MockDispatcher GlobalTimeSystem. You probably want to return a SimulatedTimeSystem or MockTimeSystem instead.

Signed-off-by: Mandar U Jog <mjog@google.com>
@mandarjog
Copy link
Copy Markdown
Contributor Author

precheck deps succeeds locally bazel query 'deps(//source/exe:envoy_main_common_with_core_extensions_lib)' for me.

@mandarjog
Copy link
Copy Markdown
Contributor Author

I am thinking of adding and INFO message on successful connection, and it will also state how long was it disconnected.
onReceiveMessage seems like a positive sign that we are connected to the controlplane, more so than
async_client_->start(service_method_, *this, Http::AsyncClient::StreamOptions()) != nullptr. @PiotrSikora @htuch ?

@htuch
Copy link
Copy Markdown
Member

htuch commented Jan 14, 2021

Sounds reasonable, this is a pretty important diagnostic that applies to many use cases. We can always drop log level if folks find it too spammy.

Base automatically changed from master to main January 15, 2021 23:02
@htuch htuch added the waiting label Feb 7, 2021
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 9, 2021

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 9, 2021
@github-actions
Copy link
Copy Markdown

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale stalebot believes this issue/PR has not been touched recently waiting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce warning - grpc Config Stream closed: 14 noise

3 participants