Skip to content

grpc: Update tracing error tags for grpc status code#19603

Closed
bryanwux wants to merge 5 commits intoenvoyproxy:mainfrom
bryanwux:main
Closed

grpc: Update tracing error tags for grpc status code#19603
bryanwux wants to merge 5 commits intoenvoyproxy:mainfrom
bryanwux:main

Conversation

@bryanwux
Copy link
Copy Markdown
Contributor

@bryanwux bryanwux commented Jan 19, 2022

Signed-off-by: Jiayu Wu jiayu1.wu@intel.com

Commit Message: Change tracing error tag behaviour for grpc codes #18877
Additional Description:

Below is the proposed change for tracing error tag behaviour. A brief description about whether it is an error for each response code has also been listed.

OK | 0 | error=false 
CANCELLED | 1 |  error=false Not a server error. The operation is cancelled by the client.
UNKNOWN | 2 |  error=true   Server error. Status value received from unknown address space. 
INVALID_ARGUMENT | 3 | error=false  Not a server error. The client enters an invalid argument. 
DEADLINE_EXCEEDED | 4 | error=true  Server error. Response from the server has been delayed long, even if it may be successful.
NOT_FOUND | 5 | error=false   Not a server error. The requested file or directory is not found.
ALREADY_EXISTS | 6 | error=false   Not a server error. The file or directory the client attempts to create already exists.
PERMISSION_DENIED | 7 | error=false   Not a server error. The client doesn't have permission to execute the operation. 
RESOURCE_EXHAUSTED | 8 | error=false   Not a server error. Some resource has been exhausted like the file system is out of space.
FAILED_PRECONDITION | 9 | error=false   Not a server error. The requested operation is rejected because it doesn't meet preconditions. 
ABORTED | 10 | error=false   Not a server error. The operation is aborted, typically due to concurrency issues.
OUT_OF_RANGE | 11 | error=false   Not a server error. The operation was attempted to read past the valid range.
UNIMPLEMENTED | 12 | error=true   Server error. The service is not implemented or not enabled.
INTERNAL | 13 | error=true   Server error. The system is broken, typically reserved for serious cases. 
UNAVAILABLE | 14 | error=true   Server error. The service is currently not available.
DATA_LOSS | 15 | error=true   Server error. Unrecoverable data loss or corruption.
UNAUTHENTICATED | 16 | error=false   Not a server error. The request does not have valid credentials for the operation.

Risk Level: Low
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]: #18877
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

@repokitteh-read-only
Copy link
Copy Markdown

Hi @bryanwux, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #19603 was opened by bryanwux.

see: more, trace.

@rojkov
Copy link
Copy Markdown
Member

rojkov commented Jan 19, 2022

Could you please fix DCO and check formatting?

CONTRIBUTING.md contains some useful info about it.

@rojkov rojkov self-assigned this Jan 19, 2022
Copy link
Copy Markdown
Member

@tyxia tyxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on it!

Copy link
Copy Markdown
Member

@tyxia tyxia Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it is better that we have a local variable stores the value upfront instead of calling value() repeatedly, both from performance and readability perspective. e.g., :
const auto& status = grpc_status_code.value();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though MaximumKnown = Unauthenticated in terms of enum value, MaximumKnown is used to represent the boundary of the grpc status code. for example you can do for (uint32_t i = 0; i <= Status::WellKnownGrpcStatus::MaximumKnown; ++i)

I think we should use Unauthenticated here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should include this as error code. InvalidCode means status code in gRPC headers was invalid that we can not use grpc_status_code.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check grpc_status_code.value() != Grpc::Status::WellKnownGrpcStatus::InvalidCode) here rather than !ok

NOT_OK is redundant once you added those new checks in this PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Copy Markdown
Member

@tyxia tyxia Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your code, we need to be careful about the implicit conversion to bool especially when we are dealing with absl::optional. Detailed reading of this open source version of Google coding guide can be found here https://abseil.io/tips/141

I would suggest to change it to grpc_status_code.has_value().

Copy link
Copy Markdown
Contributor Author

@bryanwux bryanwux Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@tyxia
Copy link
Copy Markdown
Member

tyxia commented Jan 19, 2022

/assign @tyxia

@rojkov
Copy link
Copy Markdown
Member

rojkov commented Jan 24, 2022

/wait

Signed-off-by: Jiayu Wu <jiayu1.wu@intel.com>
@bryanwux
Copy link
Copy Markdown
Contributor Author

Thank you for your @tyxia review. I have updated the code accordingly.

@bryanwux
Copy link
Copy Markdown
Contributor Author

Could you please fix DCO and check formatting?

CONTRIBUTING.md contains some useful info about it.

Done.

Copy link
Copy Markdown
Member

@tyxia tyxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo one nit. Thanks!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, for consistency you may want to move stauts assignment up

if (grpc_status_code.has_value()) {
    const auto& status = grpc_status_code.value();
    if (status != Grpc::Status::WellKnownGrpcStatus::InvalidCode && 
        ( ... == unkown || ...== DeadlineExceed ||... == Unimplemented 
             ......)) {
        span.setTag(Tracing::Tags::get().Error, Tracing::Tags::get().True);
    }
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line 91 seems to be redundant: status is checked for equality with Grpc::Status::WellKnownGrpcStatus::InvalidCode below. Can we drop the line?

Signed-off-by: Jiayu Wu <jiayu1.wu@intel.com>
Copy link
Copy Markdown
Member

@rojkov rojkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Added one comment.

Please don't use git force-push. It makes reviewing new changes more difficult.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line 91 seems to be redundant: status is checked for equality with Grpc::Status::WellKnownGrpcStatus::InvalidCode below. Can we drop the line?

@rojkov
Copy link
Copy Markdown
Member

rojkov commented Feb 1, 2022

/wait

Signed-off-by: Jiayu Wu <jiayu1.wu@intel.com>
@bryanwux
Copy link
Copy Markdown
Contributor Author

bryanwux commented Feb 2, 2022

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #19603 (comment) was created by @bryanwux.

see: more, trace.

rojkov
rojkov previously approved these changes Feb 3, 2022
Copy link
Copy Markdown
Member

@rojkov rojkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks good to me.

@rojkov
Copy link
Copy Markdown
Member

rojkov commented Feb 3, 2022

/assign-from @envoyproxy/senior-maintainers

@repokitteh-read-only
Copy link
Copy Markdown

@envoyproxy/senior-maintainers assignee is @alyssawilk

🐱

Caused by: a #19603 (comment) was created by @rojkov.

see: more, trace.

Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good but two requests.
One can you edit your original table in the linked bug with a bit more detail as to why each error is considered a server error and non-error is a user error (or other) and two, can you add an entry to the docs for the current release (docs/root/version_history/current.rst) to cover the change in behavior? Thanks!
/wait

@alyssawilk
Copy link
Copy Markdown
Contributor

still waiting on the issue update I think
/wait

@bryanwux bryanwux closed this Feb 23, 2022
@bryanwux
Copy link
Copy Markdown
Contributor Author

bryanwux commented Feb 23, 2022

Misclicked...

@bryanwux bryanwux reopened this Feb 23, 2022
@bryanwux
Copy link
Copy Markdown
Contributor Author

bryanwux commented Feb 23, 2022

Code looks good but two requests. One can you edit your original table in the linked bug with a bit more detail as to why each error is considered a server error and non-error is a user error (or other) and two, can you add an entry to the docs for the current release (docs/root/version_history/current.rst) to cover the change in behavior? Thanks! /wait

Done. Apologies for the delay.

@alyssawilk
Copy link
Copy Markdown
Contributor

Pr is still closed - I think you have to reopen

@alyssawilk
Copy link
Copy Markdown
Contributor

ah, nm, I see you opened as a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants