Skip to content

update tracing error tag for grpc status codes#20090

Merged
lizan merged 14 commits intoenvoyproxy:mainfrom
bryanwux:main
Mar 10, 2022
Merged

update tracing error tag for grpc status codes#20090
lizan merged 14 commits intoenvoyproxy:mainfrom
bryanwux:main

Conversation

@bryanwux
Copy link
Copy Markdown
Contributor

@bryanwux bryanwux commented Feb 23, 2022

Signed-off-by: Jiayu Wu jiayu1.wu@intel.com

Commit Message: Change tracing error tag behaviour for grpc codes #18877
Additional Description:

Below is the proposed change for tracing error tag behaviour. A brief description about whether each response code should be indicated as an error has also been listed.

OK | 0 | error=false 
CANCELLED | 1 |  error=false Not a server error. The operation is cancelled by the client.
UNKNOWN | 2 |  error=true   Server error. Status value received from unknown address space. 
INVALID_ARGUMENT | 3 | error=false  Not a server error. The client enters an invalid argument. 
DEADLINE_EXCEEDED | 4 | error=true  Server error. Response from the server has been delayed long, even if it may be successful.
NOT_FOUND | 5 | error=false   Not a server error. The requested file or directory is not found.
ALREADY_EXISTS | 6 | error=false   Not a server error. The file or directory the client attempts to create already exists.
PERMISSION_DENIED | 7 | error=false   Not a server error. The client doesn't have permission to execute the operation. 
RESOURCE_EXHAUSTED | 8 | error=false   Not a server error. Some resource has been exhausted like the file system is out of space.
FAILED_PRECONDITION | 9 | error=false   Not a server error. The requested operation is rejected because it doesn't meet preconditions. 
ABORTED | 10 | error=false   Not a server error. The operation is aborted, typically due to concurrency issues.
OUT_OF_RANGE | 11 | error=false   Not a server error. The operation was attempted to read past the valid range.
UNIMPLEMENTED | 12 | error=true   Server error. The service is not implemented or not enabled.
INTERNAL | 13 | error=true   Server error. The system is broken, typically reserved for serious cases. 
UNAVAILABLE | 14 | error=true   Server error. The service is currently not available.
DATA_LOSS | 15 | error=true   Server error. Unrecoverable data loss or corruption.
UNAUTHENTICATED | 16 | error=false   Not a server error. The request does not have valid credentials for the operation.

Risk Level: Low
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]: #18877
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: Jiayu Wu <jiayu1.wu@intel.com>
@repokitteh-read-only
Copy link
Copy Markdown

Hi @bryanwux, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #20090 was opened by bryanwux.

see: more, trace.

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #20090 (comment) was created by @bryanwux.

see: more, trace.

span.setTag(Tracing::Tags::get().Error, Tracing::Tags::get().True);
if (grpc_status_code.has_value()) {
const auto& status = grpc_status_code.value();
if (status != Grpc::Status::WellKnownGrpcStatus::InvalidCode &&
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does these cover all the error status?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, detailed introduction can be found here: https://grpc.github.io/grpc/core/md_doc_statuscodes.html

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of converting to a switch, so if gRPC adds new error codes and we pick up on import, we compile fail?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can we link https://grpc.github.io/grpc/core/md_doc_statuscodes.html here in the comment as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of converting to a switch, so if gRPC adds new error codes and we pick up on import, we compile fail?

Yes, I agree a switch makes more sense in this case.

Also can we link https://grpc.github.io/grpc/core/md_doc_statuscodes.html here in the comment as well?

Done

Signed-off-by: Jiayu Wu <jiayu1.wu@intel.com>
@rojkov
Copy link
Copy Markdown
Member

rojkov commented Feb 28, 2022

Looks like this is a continuation for #19603 which got mistakenly closed.

/assign @alyssawilk

Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in review - I was out and it got misplaced when reopened
/wait

span.setTag(Tracing::Tags::get().Error, Tracing::Tags::get().True);
if (grpc_status_code.has_value()) {
const auto& status = grpc_status_code.value();
if (status != Grpc::Status::WellKnownGrpcStatus::InvalidCode &&
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of converting to a switch, so if gRPC adds new error codes and we pick up on import, we compile fail?

span.setTag(Tracing::Tags::get().Error, Tracing::Tags::get().True);
if (grpc_status_code.has_value()) {
const auto& status = grpc_status_code.value();
if (status != Grpc::Status::WellKnownGrpcStatus::InvalidCode &&
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can we link https://grpc.github.io/grpc/core/md_doc_statuscodes.html here in the comment as well?

absl::optional<Grpc::Status::GrpcStatus> grpc_status_code = Grpc::Common::getGrpcStatus(headers);
if (grpc_status_code && grpc_status_code.value() != Grpc::Status::WellKnownGrpcStatus::Ok) {
span.setTag(Tracing::Tags::get().Error, Tracing::Tags::get().True);
if (grpc_status_code.has_value()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're going to want to runtime guard this change, as described in CONTRIBUTING.md

Let's also add a comment where the tracing error code is defined to be more clear about what error means in this case (upstream or envoy error, not client error)

Copy link
Copy Markdown
Contributor Author

@bryanwux bryanwux Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #20090 (comment) was created by @bryanwux.

see: more, trace.

@bryanwux
Copy link
Copy Markdown
Contributor Author

bryanwux commented Mar 1, 2022

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #20090 (comment) was created by @bryanwux.

see: more, trace.

Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking better and better!

case Grpc::Status::WellKnownGrpcStatus::OutOfRange:
case Grpc::Status::WellKnownGrpcStatus::Unauthenticated:
break;
default:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if you have a default we still won't catch new error codes when added. Can we avoid this?

Comment thread source/common/tracing/http_tracer_impl.cc
bryanwux added 5 commits March 2, 2022 16:00
Signed-off-by: Jiayu Wu <jiayu1.wu@intel.com>
Signed-off-by: Jiayu Wu <jiayu1.wu@intel.com>
Signed-off-by: Jiayu Wu <jiayu1.wu@intel.com>
Signed-off-by: bryanwux <jiayu1.wu@intel.com>
@bryanwux bryanwux requested review from lizan and snowp as code owners March 4, 2022 08:30
@bryanwux bryanwux closed this Mar 4, 2022
@bryanwux bryanwux reopened this Mar 4, 2022
bryanwux added 2 commits March 4, 2022 16:54
Signed-off-by: Jiayu Wu <jiayu1.wu@intel.com>
Signed-off-by: bryanwux <jiayu1.wu@intel.com>
@bryanwux
Copy link
Copy Markdown
Contributor Author

bryanwux commented Mar 4, 2022

Sorry @lizan @snowp , requestd review was done by mistake, please ignore.

Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Sorry for the review delay I was out Friday.
Just two more questions, so I'll add in a second reviewer as well
/wait

Comment thread source/common/tracing/http_tracer_impl.cc
case Grpc::Status::WellKnownGrpcStatus::AlreadyExists:
case Grpc::Status::WellKnownGrpcStatus::PermissionDenied:
case Grpc::Status::WellKnownGrpcStatus::FailedPrecondition:
case Grpc::Status::WellKnownGrpcStatus::Aborted:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, canceled says "usually by the caller" but aborted says it's a concurrency issue. Would that be client side or server side error do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aborted is used due to a concurrency issue which means the client should restart a read-modify-write sequence. I think this would be a client side error.

@alyssawilk
Copy link
Copy Markdown
Contributor

@lizan I think this is ready for your pass (only 2 nits left on my end)

@lizan lizan merged commit cc6b501 into envoyproxy:main Mar 10, 2022
JuniorHsu pushed a commit to JuniorHsu/envoy that referenced this pull request Mar 17, 2022
Below is the proposed change for tracing error tag behaviour.  A brief description about whether each response code should be indicated as an error has also been listed.
```
OK | 0 | error=false
CANCELLED | 1 |  error=false Not a server error. The operation is cancelled by the client.
UNKNOWN | 2 |  error=true   Server error. Status value received from unknown address space.
INVALID_ARGUMENT | 3 | error=false  Not a server error. The client enters an invalid argument.
DEADLINE_EXCEEDED | 4 | error=true  Server error. Response from the server has been delayed long, even if it may be successful.
NOT_FOUND | 5 | error=false   Not a server error. The requested file or directory is not found.
ALREADY_EXISTS | 6 | error=false   Not a server error. The file or directory the client attempts to create already exists.
PERMISSION_DENIED | 7 | error=false   Not a server error. The client doesn't have permission to execute the operation.
RESOURCE_EXHAUSTED | 8 | error=false   Not a server error. Some resource has been exhausted like the file system is out of space.
FAILED_PRECONDITION | 9 | error=false   Not a server error. The requested operation is rejected because it doesn't meet preconditions.
ABORTED | 10 | error=false   Not a server error. The operation is aborted, typically due to concurrency issues.
OUT_OF_RANGE | 11 | error=false   Not a server error. The operation was attempted to read past the valid range.
UNIMPLEMENTED | 12 | error=true   Server error. The service is not implemented or not enabled.
INTERNAL | 13 | error=true   Server error. The system is broken, typically reserved for serious cases.
UNAVAILABLE | 14 | error=true   Server error. The service is currently not available.
DATA_LOSS | 15 | error=true   Server error. Unrecoverable data loss or corruption.
UNAUTHENTICATED | 16 | error=false   Not a server error. The request does not have valid credentials for the operation.

```
Risk Level: Low
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
Fixes envoyproxy#18877

Signed-off-by: bryanwux <jiayu1.wu@intel.com>
Signed-off-by: kuochunghsu <kuochunghsu@pinterest.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants