fault injection: add support for setting gRPC status by vijayrajput1 · Pull Request #10841 · envoyproxy/envoy

vijayrajput1 · 2020-04-19T06:14:30Z

Description: Adds an option to specify gRPC status via config and header, to be used in fault injection response. More details #10323

Risk Level: Low (new feature)
Testing: Added unit and integration tests
Docs Changes: Updated
Release Notes: Updated
Fixes #10323

cc @wgallagher @mattklein123

Bringing fork up to date

Sync to base

Signed-off-by: Vijay Rajput <vrajput@lyft.com>

repokitteh-read-only · 2020-04-19T06:14:35Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #10841 was opened by vijayrajput1.

see: more, trace.

Signed-off-by: Vijay Rajput <vrajput@lyft.com>

htuch · 2020-04-20T01:29:29Z

/lgtm api

mattklein123

Thanks generally looks great! Flushing out a few comments to get started. Thank you!

/wait

mattklein123 · 2020-04-20T22:25:27Z

api/envoy/extensions/filters/http/fault/v3/fault.proto

    uint32 http_status = 2 [(validate.rules).uint32 = {lt: 600 gte: 200}];

+    // gRPC status code to use to abort the gRPC request.
+    uint32 grpc_status = 5 [(validate.rules).uint32 = {lte: 16}];


I think we should drop the "lte" restriction here so that we can test unknown gRPC codes above those well defined. This is a really common source of bugs (including in envoy itself) so it will be nice to be able to test this. I might clarify in the docs that we allow testing any numeric code, even those that are not well defined. (docs need updating also)

mattklein123 · 2020-04-20T22:26:23Z

docs/root/configuration/http/http_filters/fault_filter.rst

  In order for the header to work, :ref:`header_abort
  <envoy_api_field_config.filter.http.fault.v2.FaultAbort.header_abort>` needs to be set.

+x-envoy-fault-abort-grpc-request


Should we have a % header here also? cc @Augustyniak

Yes, I think that we should add a support for x-envoy-fault-abort-grpc-request-percentage HTTP header (whose implementation would be based of how x-envoy-fault-abort-request-percentage HTTP header is defined).

Together with @buildbreaker @murki we discussed potential scenarios where having x-envoy-fault-abort-grpc-request-percentage HTTP header would be helpful to test client-side implementation of some of the GRPC requests that Lyft mobile applications perform. This lets me believe that the addition of such header will be useful for engineers from other companies too.

Sorry I missed one thing when I posted my initial comment here.

Initially, I thought that GRPC aborts would be applied on top of http aborts meaning something like:

Requests comes in

Check whether to abort the request with http fault - look at x-envoy-fault-abort-request-percentage and x-envoy-fault-abort-request HTTP headers.

If no http fault was applied (because percentage was <100% and we decided not to apply HTTP fault) check whether grpc faults should be applied - look at x-envoy-fault-abort-grpc-request-percentage and x-envoy-fault-abort-grpc-request HTTP headers.

and for this reason I thought that we needed the addition of x-envoy-fault-abort-grpc-request-percentage HTTP header.

Now that I know that the implementation looks more like apply "either HTTP or GRPC abort" (only one of them) I don't see any benefits of us having x-envoy-fault-abort-grpc-request-percentage HTTP header. Don't get me wrong, as far as I can tell current implementation works correctly and it unlocks GRPC abort tests for mobile people at Lyft but the addition of x-envoy-fault-abort-grpc-request-percentage HTTP header itself doesn't add any value in here - we could continue to use x-envoy-fault-abort-request-percentage HTTP request header and functionality-wise we wouldn't lose anything. We could keep it but I think that removal makes a little more sense since it would keep the implementation simpler.

I really feel bad asking you to do this @vijayrajput1 but could we get rid of x-envoy-fault-abort-grpc-request-percentage HTTP header and use x-envoy-fault-abort-request-percentage for both HTTP and GRPC aborts? My deep apologizes here - I know that the addition of this header costed you a significant amount of work in the first place so it really feels bad to ask you to revert some of your changes.

Thank you @Augustyniak , I actually had this (using one header) approach in mind earlier but then I kinda convinced myself having a matching header would make things more clear. Meaning use x-envoy-fault-abort-grpc-request-percentage for x-envoy-fault-abort-grpc-request vs use x-envoy-fault-abort-request-percentage for x-envoy-fault-abort-grpc-request. I should have raised this point earlier. I'm okay, with either of those approaches - my take is if having a separate header x-envoy-fault-abort-grpc-request-percentage isn't making things anymore clear than we should use only one header i.e x-envoy-fault-abort-request-percentage - @mattklein123 any preference ?

Yeah agreed on using 1 header. Sorry, I didn't look very deeply at this before punting review over to @Augustyniak. Apologies also!

mattklein123 · 2020-04-20T23:01:46Z

source/extensions/filters/common/fault/fault_config.h

    // HTTP status code depending on the provider implementation.
-    virtual absl::optional<Http::Code>
-    statusCode(const Http::RequestHeaderMap* request_headers) const PURE;
+    virtual absl::optional<Http::Code> httpStatusCode(const Http::HeaderEntry* header) const PURE;


Why not continue to pass the full header map here and below? It seems like it would be more clear if the header was referenced closer to where it is used?

Augustyniak · 2020-04-21T19:11:20Z

docs/root/configuration/http/http_filters/fault_filter.rst

  In order for the header to work, :ref:`header_abort
  <envoy_api_field_config.filter.http.fault.v2.FaultAbort.header_abort>` needs to be set.

+x-envoy-fault-abort-grpc-request


Yes, I think that we should add a support for x-envoy-fault-abort-grpc-request-percentage HTTP header (whose implementation would be based of how x-envoy-fault-abort-request-percentage HTTP header is defined).

Together with @buildbreaker @murki we discussed potential scenarios where having x-envoy-fault-abort-grpc-request-percentage HTTP header would be helpful to test client-side implementation of some of the GRPC requests that Lyft mobile applications perform. This lets me believe that the addition of such header will be useful for engineers from other companies too.

Augustyniak · 2020-04-21T19:16:18Z

source/extensions/filters/http/fault/fault_filter.cc

+
+  // If gRPC status code is set, then HTTP will be set to 200
+  return grpc_status.has_value() ? std::make_pair(Http::Code::OK, grpc_status)
+                                 : std::make_pair(abortHttpStatus(request_headers), grpc_status);


Can we pass absl::nullptr instead of grpc_status here to make it clearer that we don't return GRPC status code when we return an HTTP status code from abortHttpStatus(request_headers) ?

Augustyniak · 2020-04-21T19:21:59Z

source/extensions/filters/http/fault/fault_filter.cc

+  auto grpc_status = fault_settings_->requestAbort()->grpcStatusCode(
+      request_headers.get(Filters::Common::Fault::HeaderNames::get().AbortGrpcRequest));
+
+  auto default_grpc_status_code = grpc_status.has_value()


shouldn't we return an absl::nullptr from here if grpc_status doesn't have a value? We do a similar thing in the implementation of abortStatus method.

intention here is to support setting gRPC status even through just runtime config. For example, even if abort fault filter configuration and header doesn't have grpc_status but if the abort_grpc_status_runtime is set we still want to set the grpc_status. Does this make sense ?

The use case you want to support definitely makes sense. 👍

I'm just a little bit worried that GRPC aborts will end up using proto config in a way that's different from the way other faults use their protos. "Fixed" fault providers - including HTTP status code, response rate limit and delay request - providers, all require proto config values to be present. For the user of the Envoy, it may be a little bit confusing if for GRPC aborts we follow a different convention.

That being said, if runtime-only config is the thing you want to support I'm going to leave it up to more Envoy experienced folks to decide whether it's ok or no to have this fault provider not require full proto config (no GRPC status code required in proto config).

//EDIT
It may fill a little bit off to have a GRPC abort percentage defined but no GRPC abort status code defined in proto config even if fixed GRPC aborts are used.

I agree with @Augustyniak that we should be consistent here. I'm going to defer to @Augustyniak for initial review of this PR and I will do a pass after. :)

and address review feedback. Signed-off-by: Vijay Rajput <vrajput@lyft.com>

Signed-off-by: Vijay Rajput <vrajput@lyft.com>

Augustyniak

Thank you for addressing previous comments! It looks good but I left a few questions regarding cases for when we have http and grpc abort faults defined for a given request.

Augustyniak · 2020-04-23T20:56:33Z

source/extensions/filters/common/fault/fault_config.cc

+envoy::type::v3::FractionalPercent FaultAbortConfig::HeaderAbortProvider::percentage(
+    const Http::RequestHeaderMap* request_headers) const {
+  // if the abort fault is for grpc status, then use grpc fault request percentage.
+  if (request_headers != nullptr &&


we don't need to check for whether request_headers are nullptr here - it should never happen. Per my other discussion with @mattklein123 here I believe that we don't want to do this kind of defensive programming if we know that a given condition (existing request_headers in this case) is always fulfilled.

Augustyniak · 2020-04-23T20:58:33Z

source/extensions/filters/common/fault/fault_config.cc


 envoy::type::v3::FractionalPercent
 HeaderPercentageProvider::percentage(const Http::RequestHeaderMap* request_headers) const {
+  if (request_headers == nullptr) {


Similarly to my other comment - request_headers should be always available here (!= nullptr) and for this reason we don't need to have this check here. I had a discussion about this with Matt here #10724 (comment).

Augustyniak · 2020-04-23T20:58:57Z

source/extensions/filters/common/fault/fault_config.cc

    const Http::RequestHeaderMap* request_headers) const {
-  absl::optional<Http::Code> ret;
-  const auto header = request_headers->get(HeaderNames::get().AbortRequest);
+  if (request_headers == nullptr) {


Same as above - no need to check for nullptr here.

Augustyniak · 2020-04-23T20:59:24Z

source/extensions/filters/common/fault/fault_config.cc


+absl::optional<Grpc::Status::GrpcStatus> FaultAbortConfig::HeaderAbortProvider::grpcStatusCode(
+    const Http::RequestHeaderMap* request_headers) const {
+  if (request_headers == nullptr) {


No need to have this check here - the same reasoning as in one of the above comments.

Augustyniak · 2020-04-23T21:08:34Z

source/extensions/filters/common/fault/fault_config.cc

+  return static_cast<Grpc::Status::GrpcStatus>(code);
+}
+
+envoy::type::v3::FractionalPercent FaultAbortConfig::HeaderAbortProvider::percentage(


I'm not sure whether with this implementation abort faults always work the way we want them to work. Let me give you an example that I was able to come up with and you may let me know whether my concerns are valid (in this case we should probably fix the issue) or whether I'm missing something here.

Example:

Let's say that I send a request with the following HTTP headers:

x-envoy-fault-abort-grpc-request-percentage: 20

x-envoy-fault-abort-request: 500
and a proto config telling Envoy to look for faults in HTTP headers.

Behavior:
There is 20% change that we apply 500 HTTP status code abort fault.

Expected behavior:
Per our documentation, in this case we should ingest HTTP status code abort fault in % of requests that's controlled by percentage field from our config file.

I'm not sure whether you will need to do this but you may need to create separate fault providers for HTTP and GRPC aborts.

In the current implementation, x-envoy-fault-abort-grpc-request-percentage value will be used only when x-envoy-fault-abort-grpc-request header is passed in. For the above example, we will end up using the percentage value from config file as mentioned in the documentation.

Here is the control logic for it, let me know if the comment and code around it is not clear, I can reword it. I'll add a unit test to capture this scenario.

// if the abort fault is for grpc status, then use grpc fault request percentage.

envoy/source/extensions/filters/common/fault/fault_config.cc

Line 103 in 4576e19

// if the abort fault is for grpc status, then use grpc fault request percentage.

I see 👍 Thanks for pointing this out - your are right, it works just fine as it is now. The only thing that I could be a little bit picky about here is the fact that we read the value x-envoy-fault-abort-grpc-request HTTP header twice and apparently (see this convo for more details) reading HTTP headers is not cheap performance wise.

Maybe we could read x-envoy-fault-abort-request and x-envoy-fault-grpc-request headers first and pass it to percentage method for it to determine which percentage should be used? Better yet, if we read x-envoy-fault-abort-request first and it's present we don't even need to check the value of x-envoy-fault-abort-grpc-request

In this particular case it's not a blocker tho - just something to think about (maybe we can address it if it's easily fixable, if not I think that we can keep it the way it is)

Thanks @Augustyniak, I couldn’t find a clean and easy way to avoid this check, so leaving it as is, but I’ve changed the precedence of http_status and grpc_status per recommendation.

Augustyniak · 2020-04-23T21:12:20Z

test/extensions/filters/common/fault/fault_config_test.cc

  FaultAbortConfig config(proto_config);

+  // No header.
+  EXPECT_EQ(absl::nullopt, config.httpStatusCode(nullptr));


This cannot happen in prod. See my previous comments related to this case.

Augustyniak · 2020-04-23T21:20:37Z

source/extensions/filters/http/fault/fault_filter.cc

+  auto grpc_status = abortGrpcStatus(request_headers);
+
+  // If gRPC status code is set, then HTTP will be set to Http::Code::OK (200).
+  return grpc_status.has_value()


The way it's implemented now we prioritize grpc faults over http faults. Not sure whether it's a requirement for your implementation but to me http faults should take a precedence over grpc faults since grpc faults are kind of build on top of http requests -> so say it differently, my intuition tell me that grpc faults can happen only if http request succeeds.

Let's say that I send
x-envoy-fault-abort-request: 500 and x-envoy-fault-abort-grpc-request: 14 HTTP headers as part of my request. In this case, we end up with a request that finishes with a 200 HTTP status code and 14 grpc status code. I wouldn't not expect to receive 200 HTTP status code if I send x-envoy-fault-abort-request: 500 HTTP header.

What would you say to the idea of changing the order here? Check for the HTTP status code abort header first and look into grpc status code abort header only if the http status code abort header is not present.

Going back to my example, if we change the order and send x-envoy-fault-abort-request: 500 and x-envoy-fault-abort-grpc-request: 14 HTTP headers. We end up with a request that fails with 500 HTTP status code, it feels OK that we didn't receive 14 grpc status code since the underlying HTTP request failed.

I've changed the precedence order favoring http_status over grpc_status headers. And more importantly, updated the documentation on what to expect during such conflicts.

Augustyniak · 2020-04-23T21:23:35Z

test/extensions/filters/common/fault/fault_config_test.cc

+  Http::TestRequestHeaderMapImpl good_headers{{"x-envoy-fault-abort-grpc-request", "5"}};
+  EXPECT_EQ(Grpc::Status::NotFound, config.grpcStatusCode(&good_headers));
+
+  // Valid header - unknown gRPC code (>0).


this comment doesn't seem right

vijayrajput1 · 2020-04-23T23:10:44Z

Thank you for addressing previous comments! It looks good but I left a few questions regarding cases for when we have http and grpc abort faults defined for a given request.

Thank you @Augustyniak for the good feedback - appreciated! 🙏

Regarding, http_status vs grpc_status precedence when both headers are present - we can surely reverse the order to use grpc_status value only when http_status is not present and/or we can explicitly call out in the documentation to use only one of those header to make it similar to how its enforced in proto's via oneof clause.

1. If both x-envoy-fault-abort-grpc-request and x-envoy-fault-abort-request headers are set, prefer x-envoy-fault-abort-request. 2. Remove nullptr checks per feedback. Signed-off-by: Vijay Rajput <vrajput@lyft.com>

Augustyniak

Left one comment, other than this I don't have any other comments in this PR. It looks great! Thank you for being so easy to work with. I'm going to approve once you respond to/address the comment I left.

Signed-off-by: Vijay Rajput <vrajput@lyft.com>

Augustyniak

Looks good to me now! One tiny style related comment.

Thank you for your active collaboration here - I think that it helped to make this PR better.

Augustyniak · 2020-04-27T16:30:51Z

source/extensions/filters/common/fault/fault_config.cc

    const Http::RequestHeaderMap* request_headers) const {
-  absl::optional<Http::Code> ret;
-  const auto header = request_headers->get(HeaderNames::get().AbortRequest);
+  absl::optional<Http::Code> ret = absl::nullopt;


I don't think that we need = absl::nullopt part here - I think that we should get rid of it if it's not needed.

Yep its not required, I put that in there to make it apparent. I had to read absl::optional default details to find out this, so I thought making it apparent might help readability. I can remove it if its something basic.

It's fine either way. I wouldn't worry about it unless we need another push.

mattklein123

Nice work, thanks!

Signed-off-by: Spencer Lewis <slewis@squareup.com> * master: fault injection: add support for setting gRPC status (envoyproxy#10841) tests: tag tests that fail on Windows with fails_on_windows (envoyproxy#10940) Fix typo on Postgres Proxy documentation. (envoyproxy#10930) fuzz: improve header/data stop/continue modeling in HCM fuzzer. (envoyproxy#10931) gzip filter: allow setting zlib compressor's chunk size (envoyproxy#10508) http: replace vector/reserve with InlinedVector in codec helper (envoyproxy#10941) stats: add utilities to create stats from a vector of tokens, mixing dynamic and symbolic elements. (envoyproxy#10735) hcm: avoid invoking 100-continue handling on decode filter. (envoyproxy#10929) prometheus stats: Correctly group lines of the same metric name. (envoyproxy#10833) status: Fix ASAN error in Status payload handling (envoyproxy#10906) path: Fix merge slash for paths ending with slash and present query args (envoyproxy#10922) compressor filter: add benchmark (envoyproxy#10464) xray: expected_span_name is not captured by the lambda with MSVC (envoyproxy#10934) ci: update before purge in cleanup (envoyproxy#10938) tracer: Improve test coverage for x-ray (envoyproxy#10890) Revert "init: order dynamic resource initialization to make RTDS always be first (envoyproxy#10362)" (envoyproxy#10919)

vijayrajput1 and others added 3 commits April 9, 2020 09:22

Merge pull request #1 from envoyproxy/master

3990560

Bringing fork up to date

Merge pull request #3 from envoyproxy/master

5a77205

Sync to base

Add support for setting grpc status in FI response

7db1fb3

Signed-off-by: Vijay Rajput <vrajput@lyft.com>

repokitteh-read-only bot added the api label Apr 19, 2020

Add header info

0b369ea

Signed-off-by: Vijay Rajput <vrajput@lyft.com>

vijayrajput1 marked this pull request as ready for review April 19, 2020 13:53

vijayrajput1 requested a review from alyssawilk as a code owner April 19, 2020 13:53

repokitteh-read-only bot removed the api label Apr 20, 2020

mattklein123 self-assigned this Apr 20, 2020

mattklein123 requested changes Apr 20, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Apr 20, 2020

mattklein123 mentioned this pull request Apr 20, 2020

local rate limit HTTP filter: add config interface #10820

Closed

Augustyniak reviewed Apr 21, 2020

View reviewed changes

vijayrajput1 added 2 commits April 22, 2020 09:23

Allow setting not well-defined gRPC codes

160a6fa

and address review feedback. Signed-off-by: Vijay Rajput <vrajput@lyft.com>

support grpc fault requests percentage via header

535fa00

Signed-off-by: Vijay Rajput <vrajput@lyft.com>

repokitteh-read-only bot added api and removed waiting labels Apr 23, 2020

vijayrajput1 added 2 commits April 22, 2020 22:42

Fix variables name

e6ff272

Signed-off-by: Vijay Rajput <vrajput@lyft.com>

Attempt to make "Linux-x64 clang_tidy" happy

4576e19

Signed-off-by: Vijay Rajput <vrajput@lyft.com>

mattklein123 assigned Augustyniak Apr 23, 2020

Augustyniak reviewed Apr 23, 2020

View reviewed changes

mattklein123 added the waiting label Apr 23, 2020

Change the precedence order of grpc_status vs http_status

8d19697

1. If both x-envoy-fault-abort-grpc-request and x-envoy-fault-abort-request headers are set, prefer x-envoy-fault-abort-request. 2. Remove nullptr checks per feedback. Signed-off-by: Vijay Rajput <vrajput@lyft.com>

repokitteh-read-only bot removed the waiting label Apr 24, 2020

vijayrajput1 requested a review from Augustyniak April 24, 2020 21:37

Augustyniak reviewed Apr 24, 2020

View reviewed changes

mattklein123 added the waiting label Apr 24, 2020

Switch back to using single header for abort fault percentage

0ff968f

Signed-off-by: Vijay Rajput <vrajput@lyft.com>

repokitteh-read-only bot removed the waiting label Apr 25, 2020

vijayrajput1 requested a review from Augustyniak April 25, 2020 01:55

Augustyniak approved these changes Apr 27, 2020

View reviewed changes

mattklein123 approved these changes Apr 27, 2020

View reviewed changes

repokitteh-read-only bot removed the api label Apr 27, 2020

mattklein123 merged commit fd36665 into envoyproxy:master Apr 27, 2020

ckcd mentioned this pull request Jun 16, 2020

Change grpc_status from string to int32 in fault abort istio/istio#24698

Closed

Conversation

vijayrajput1 commented Apr 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

repokitteh-read-only bot commented Apr 19, 2020

Uh oh!

htuch commented Apr 20, 2020

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vijayrajput1 Apr 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Augustyniak Apr 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Augustyniak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Augustyniak Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Augustyniak Apr 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vijayrajput1 commented Apr 23, 2020

Uh oh!

Augustyniak left a comment

Choose a reason for hiding this comment

vijayrajput1 commented Apr 19, 2020 •

edited

Loading

vijayrajput1 Apr 21, 2020 •

edited

Loading

Augustyniak Apr 22, 2020 •

edited

Loading

Augustyniak Apr 24, 2020 •

edited

Loading

Augustyniak Apr 23, 2020 •

edited

Loading

vijayrajput1 Apr 27, 2020 •

edited

Loading