fault injection: add support for setting gRPC status#10841
fault injection: add support for setting gRPC status#10841mattklein123 merged 10 commits intoenvoyproxy:masterfrom
Conversation
Bringing fork up to date
Sync to base
Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Signed-off-by: Vijay Rajput <vrajput@lyft.com>
|
/lgtm api |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks generally looks great! Flushing out a few comments to get started. Thank you!
/wait
| uint32 http_status = 2 [(validate.rules).uint32 = {lt: 600 gte: 200}]; | ||
|
|
||
| // gRPC status code to use to abort the gRPC request. | ||
| uint32 grpc_status = 5 [(validate.rules).uint32 = {lte: 16}]; |
There was a problem hiding this comment.
I think we should drop the "lte" restriction here so that we can test unknown gRPC codes above those well defined. This is a really common source of bugs (including in envoy itself) so it will be nice to be able to test this. I might clarify in the docs that we allow testing any numeric code, even those that are not well defined. (docs need updating also)
| In order for the header to work, :ref:`header_abort | ||
| <envoy_api_field_config.filter.http.fault.v2.FaultAbort.header_abort>` needs to be set. | ||
|
|
||
| x-envoy-fault-abort-grpc-request |
There was a problem hiding this comment.
Should we have a % header here also? cc @Augustyniak
There was a problem hiding this comment.
Yes, I think that we should add a support for x-envoy-fault-abort-grpc-request-percentage HTTP header (whose implementation would be based of how x-envoy-fault-abort-request-percentage HTTP header is defined).
Together with @buildbreaker @murki we discussed potential scenarios where having x-envoy-fault-abort-grpc-request-percentage HTTP header would be helpful to test client-side implementation of some of the GRPC requests that Lyft mobile applications perform. This lets me believe that the addition of such header will be useful for engineers from other companies too.
There was a problem hiding this comment.
Sorry I missed one thing when I posted my initial comment here.
Initially, I thought that GRPC aborts would be applied on top of http aborts meaning something like:
- Requests comes in
- Check whether to abort the request with http fault - look at
x-envoy-fault-abort-request-percentageandx-envoy-fault-abort-requestHTTP headers. - If no http fault was applied (because percentage was <100% and we decided not to apply HTTP fault) check whether grpc faults should be applied - look at
x-envoy-fault-abort-grpc-request-percentageandx-envoy-fault-abort-grpc-requestHTTP headers.
and for this reason I thought that we needed the addition of x-envoy-fault-abort-grpc-request-percentage HTTP header.
Now that I know that the implementation looks more like apply "either HTTP or GRPC abort" (only one of them) I don't see any benefits of us having x-envoy-fault-abort-grpc-request-percentage HTTP header. Don't get me wrong, as far as I can tell current implementation works correctly and it unlocks GRPC abort tests for mobile people at Lyft but the addition of x-envoy-fault-abort-grpc-request-percentage HTTP header itself doesn't add any value in here - we could continue to use x-envoy-fault-abort-request-percentage HTTP request header and functionality-wise we wouldn't lose anything. We could keep it but I think that removal makes a little more sense since it would keep the implementation simpler.
I really feel bad asking you to do this @vijayrajput1 but could we get rid of x-envoy-fault-abort-grpc-request-percentage HTTP header and use x-envoy-fault-abort-request-percentage for both HTTP and GRPC aborts? My deep apologizes here - I know that the addition of this header costed you a significant amount of work in the first place so it really feels bad to ask you to revert some of your changes.
There was a problem hiding this comment.
Thank you @Augustyniak , I actually had this (using one header) approach in mind earlier but then I kinda convinced myself having a matching header would make things more clear. Meaning use x-envoy-fault-abort-grpc-request-percentage for x-envoy-fault-abort-grpc-request vs use x-envoy-fault-abort-request-percentage for x-envoy-fault-abort-grpc-request. I should have raised this point earlier. I'm okay, with either of those approaches - my take is if having a separate header x-envoy-fault-abort-grpc-request-percentage isn't making things anymore clear than we should use only one header i.e x-envoy-fault-abort-request-percentage - @mattklein123 any preference ?
There was a problem hiding this comment.
Yeah agreed on using 1 header. Sorry, I didn't look very deeply at this before punting review over to @Augustyniak. Apologies also!
| // HTTP status code depending on the provider implementation. | ||
| virtual absl::optional<Http::Code> | ||
| statusCode(const Http::RequestHeaderMap* request_headers) const PURE; | ||
| virtual absl::optional<Http::Code> httpStatusCode(const Http::HeaderEntry* header) const PURE; |
There was a problem hiding this comment.
Why not continue to pass the full header map here and below? It seems like it would be more clear if the header was referenced closer to where it is used?
| In order for the header to work, :ref:`header_abort | ||
| <envoy_api_field_config.filter.http.fault.v2.FaultAbort.header_abort>` needs to be set. | ||
|
|
||
| x-envoy-fault-abort-grpc-request |
There was a problem hiding this comment.
Yes, I think that we should add a support for x-envoy-fault-abort-grpc-request-percentage HTTP header (whose implementation would be based of how x-envoy-fault-abort-request-percentage HTTP header is defined).
Together with @buildbreaker @murki we discussed potential scenarios where having x-envoy-fault-abort-grpc-request-percentage HTTP header would be helpful to test client-side implementation of some of the GRPC requests that Lyft mobile applications perform. This lets me believe that the addition of such header will be useful for engineers from other companies too.
|
|
||
| // If gRPC status code is set, then HTTP will be set to 200 | ||
| return grpc_status.has_value() ? std::make_pair(Http::Code::OK, grpc_status) | ||
| : std::make_pair(abortHttpStatus(request_headers), grpc_status); |
There was a problem hiding this comment.
Can we pass absl::nullptr instead of grpc_status here to make it clearer that we don't return GRPC status code when we return an HTTP status code from abortHttpStatus(request_headers) ?
| auto grpc_status = fault_settings_->requestAbort()->grpcStatusCode( | ||
| request_headers.get(Filters::Common::Fault::HeaderNames::get().AbortGrpcRequest)); | ||
|
|
||
| auto default_grpc_status_code = grpc_status.has_value() |
There was a problem hiding this comment.
shouldn't we return an absl::nullptr from here if grpc_status doesn't have a value? We do a similar thing in the implementation of abortStatus method.
There was a problem hiding this comment.
intention here is to support setting gRPC status even through just runtime config. For example, even if abort fault filter configuration and header doesn't have grpc_status but if the abort_grpc_status_runtime is set we still want to set the grpc_status. Does this make sense ?
There was a problem hiding this comment.
The use case you want to support definitely makes sense. 👍
I'm just a little bit worried that GRPC aborts will end up using proto config in a way that's different from the way other faults use their protos. "Fixed" fault providers - including HTTP status code, response rate limit and delay request - providers, all require proto config values to be present. For the user of the Envoy, it may be a little bit confusing if for GRPC aborts we follow a different convention.
That being said, if runtime-only config is the thing you want to support I'm going to leave it up to more Envoy experienced folks to decide whether it's ok or no to have this fault provider not require full proto config (no GRPC status code required in proto config).
//EDIT
It may fill a little bit off to have a GRPC abort percentage defined but no GRPC abort status code defined in proto config even if fixed GRPC aborts are used.
There was a problem hiding this comment.
I agree with @Augustyniak that we should be consistent here. I'm going to defer to @Augustyniak for initial review of this PR and I will do a pass after. :)
and address review feedback. Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Augustyniak
left a comment
There was a problem hiding this comment.
Thank you for addressing previous comments! It looks good but I left a few questions regarding cases for when we have http and grpc abort faults defined for a given request.
| envoy::type::v3::FractionalPercent FaultAbortConfig::HeaderAbortProvider::percentage( | ||
| const Http::RequestHeaderMap* request_headers) const { | ||
| // if the abort fault is for grpc status, then use grpc fault request percentage. | ||
| if (request_headers != nullptr && |
There was a problem hiding this comment.
we don't need to check for whether request_headers are nullptr here - it should never happen. Per my other discussion with @mattklein123 here I believe that we don't want to do this kind of defensive programming if we know that a given condition (existing request_headers in this case) is always fulfilled.
|
|
||
| envoy::type::v3::FractionalPercent | ||
| HeaderPercentageProvider::percentage(const Http::RequestHeaderMap* request_headers) const { | ||
| if (request_headers == nullptr) { |
There was a problem hiding this comment.
Similarly to my other comment - request_headers should be always available here (!= nullptr) and for this reason we don't need to have this check here. I had a discussion about this with Matt here #10724 (comment).
| const Http::RequestHeaderMap* request_headers) const { | ||
| absl::optional<Http::Code> ret; | ||
| const auto header = request_headers->get(HeaderNames::get().AbortRequest); | ||
| if (request_headers == nullptr) { |
There was a problem hiding this comment.
Same as above - no need to check for nullptr here.
|
|
||
| absl::optional<Grpc::Status::GrpcStatus> FaultAbortConfig::HeaderAbortProvider::grpcStatusCode( | ||
| const Http::RequestHeaderMap* request_headers) const { | ||
| if (request_headers == nullptr) { |
There was a problem hiding this comment.
No need to have this check here - the same reasoning as in one of the above comments.
| return static_cast<Grpc::Status::GrpcStatus>(code); | ||
| } | ||
|
|
||
| envoy::type::v3::FractionalPercent FaultAbortConfig::HeaderAbortProvider::percentage( |
There was a problem hiding this comment.
I'm not sure whether with this implementation abort faults always work the way we want them to work. Let me give you an example that I was able to come up with and you may let me know whether my concerns are valid (in this case we should probably fix the issue) or whether I'm missing something here.
Example:
Let's say that I send a request with the following HTTP headers:
x-envoy-fault-abort-grpc-request-percentage:20x-envoy-fault-abort-request:500
and a proto config telling Envoy to look for faults in HTTP headers.
Behavior:
There is 20% change that we apply 500 HTTP status code abort fault.
Expected behavior:
Per our documentation, in this case we should ingest HTTP status code abort fault in % of requests that's controlled by percentage field from our config file.
There was a problem hiding this comment.
I'm not sure whether you will need to do this but you may need to create separate fault providers for HTTP and GRPC aborts.
There was a problem hiding this comment.
In the current implementation, x-envoy-fault-abort-grpc-request-percentage value will be used only when x-envoy-fault-abort-grpc-request header is passed in. For the above example, we will end up using the percentage value from config file as mentioned in the documentation.
Here is the control logic for it, let me know if the comment and code around it is not clear, I can reword it. I'll add a unit test to capture this scenario.
// if the abort fault is for grpc status, then use grpc fault request percentage.
There was a problem hiding this comment.
I see 👍 Thanks for pointing this out - your are right, it works just fine as it is now. The only thing that I could be a little bit picky about here is the fact that we read the value x-envoy-fault-abort-grpc-request HTTP header twice and apparently (see this convo for more details) reading HTTP headers is not cheap performance wise.
Maybe we could read x-envoy-fault-abort-request and x-envoy-fault-grpc-request headers first and pass it to percentage method for it to determine which percentage should be used? Better yet, if we read x-envoy-fault-abort-request first and it's present we don't even need to check the value of x-envoy-fault-abort-grpc-request
There was a problem hiding this comment.
In this particular case it's not a blocker tho - just something to think about (maybe we can address it if it's easily fixable, if not I think that we can keep it the way it is)
There was a problem hiding this comment.
Thanks @Augustyniak, I couldn’t find a clean and easy way to avoid this check, so leaving it as is, but I’ve changed the precedence of http_status and grpc_status per recommendation.
| FaultAbortConfig config(proto_config); | ||
|
|
||
| // No header. | ||
| EXPECT_EQ(absl::nullopt, config.httpStatusCode(nullptr)); |
There was a problem hiding this comment.
This cannot happen in prod. See my previous comments related to this case.
| auto grpc_status = abortGrpcStatus(request_headers); | ||
|
|
||
| // If gRPC status code is set, then HTTP will be set to Http::Code::OK (200). | ||
| return grpc_status.has_value() |
There was a problem hiding this comment.
The way it's implemented now we prioritize grpc faults over http faults. Not sure whether it's a requirement for your implementation but to me http faults should take a precedence over grpc faults since grpc faults are kind of build on top of http requests -> so say it differently, my intuition tell me that grpc faults can happen only if http request succeeds.
Let's say that I send
x-envoy-fault-abort-request: 500 and x-envoy-fault-abort-grpc-request: 14 HTTP headers as part of my request. In this case, we end up with a request that finishes with a 200 HTTP status code and 14 grpc status code. I wouldn't not expect to receive 200 HTTP status code if I send x-envoy-fault-abort-request: 500 HTTP header.
What would you say to the idea of changing the order here? Check for the HTTP status code abort header first and look into grpc status code abort header only if the http status code abort header is not present.
Going back to my example, if we change the order and send x-envoy-fault-abort-request: 500 and x-envoy-fault-abort-grpc-request: 14 HTTP headers. We end up with a request that fails with 500 HTTP status code, it feels OK that we didn't receive 14 grpc status code since the underlying HTTP request failed.
There was a problem hiding this comment.
I've changed the precedence order favoring http_status over grpc_status headers. And more importantly, updated the documentation on what to expect during such conflicts.
| Http::TestRequestHeaderMapImpl good_headers{{"x-envoy-fault-abort-grpc-request", "5"}}; | ||
| EXPECT_EQ(Grpc::Status::NotFound, config.grpcStatusCode(&good_headers)); | ||
|
|
||
| // Valid header - unknown gRPC code (>0). |
There was a problem hiding this comment.
this comment doesn't seem right
Thank you @Augustyniak for the good feedback - appreciated! 🙏 Regarding, http_status vs grpc_status precedence when both headers are present - we can surely reverse the order to use grpc_status value only when http_status is not present and/or we can explicitly call out in the documentation to use only one of those header to make it similar to how its enforced in proto's via |
1. If both x-envoy-fault-abort-grpc-request and x-envoy-fault-abort-request headers are set, prefer x-envoy-fault-abort-request. 2. Remove nullptr checks per feedback. Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Augustyniak
left a comment
There was a problem hiding this comment.
Left one comment, other than this I don't have any other comments in this PR. It looks great! Thank you for being so easy to work with. I'm going to approve once you respond to/address the comment I left.
Signed-off-by: Vijay Rajput <vrajput@lyft.com>
Augustyniak
left a comment
There was a problem hiding this comment.
Looks good to me now! One tiny style related comment.
Thank you for your active collaboration here - I think that it helped to make this PR better.
| const Http::RequestHeaderMap* request_headers) const { | ||
| absl::optional<Http::Code> ret; | ||
| const auto header = request_headers->get(HeaderNames::get().AbortRequest); | ||
| absl::optional<Http::Code> ret = absl::nullopt; |
There was a problem hiding this comment.
I don't think that we need = absl::nullopt part here - I think that we should get rid of it if it's not needed.
There was a problem hiding this comment.
Yep its not required, I put that in there to make it apparent. I had to read absl::optional default details to find out this, so I thought making it apparent might help readability. I can remove it if its something basic.
There was a problem hiding this comment.
It's fine either way. I wouldn't worry about it unless we need another push.
Signed-off-by: Spencer Lewis <slewis@squareup.com> * master: fault injection: add support for setting gRPC status (envoyproxy#10841) tests: tag tests that fail on Windows with fails_on_windows (envoyproxy#10940) Fix typo on Postgres Proxy documentation. (envoyproxy#10930) fuzz: improve header/data stop/continue modeling in HCM fuzzer. (envoyproxy#10931) gzip filter: allow setting zlib compressor's chunk size (envoyproxy#10508) http: replace vector/reserve with InlinedVector in codec helper (envoyproxy#10941) stats: add utilities to create stats from a vector of tokens, mixing dynamic and symbolic elements. (envoyproxy#10735) hcm: avoid invoking 100-continue handling on decode filter. (envoyproxy#10929) prometheus stats: Correctly group lines of the same metric name. (envoyproxy#10833) status: Fix ASAN error in Status payload handling (envoyproxy#10906) path: Fix merge slash for paths ending with slash and present query args (envoyproxy#10922) compressor filter: add benchmark (envoyproxy#10464) xray: expected_span_name is not captured by the lambda with MSVC (envoyproxy#10934) ci: update before purge in cleanup (envoyproxy#10938) tracer: Improve test coverage for x-ray (envoyproxy#10890) Revert "init: order dynamic resource initialization to make RTDS always be first (envoyproxy#10362)" (envoyproxy#10919)
Description: Adds an option to specify gRPC status via config and header, to be used in fault injection response. More details #10323
Risk Level: Low (new feature)
Testing: Added unit and integration tests
Docs Changes: Updated
Release Notes: Updated
Fixes #10323
cc @wgallagher @mattklein123