Health checks: Add retriable http health check statuses. by wez470 · Pull Request #17948 · envoyproxy/envoy

wez470 · 2021-09-01T15:03:49Z

Sorry about the confusion on whether I was working on this or not. Ended up picking it up.

Adds a new API field for http health checks that allows specifying ranges of status codes that are considered retriable. If these status codes are received, those failures will contribute towards the configured unhealthy threshold rather that immediately considering the cluster member unhealthy as is the case today.

cc: @mattklein123 since you were commenting on the issue

Commit Message: Add retriable http health check status codes.
Additional Description:
Risk Level: Small
Testing: Unit, Integration
Docs Changes: Fixed proto docs around HTTP health checks and well as arch overview HTTP health check docs
Release Notes: Added line for new api field.
Platform Specific Features: None
[Optional Fixes #Issue] #7171
[Optional API Considerations:]

Signed-off-by: Weston Carlson <wez470@gmail.com>

repokitteh-read-only · 2021-09-01T15:03:52Z

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #17948 was opened by wez470.

see: more, trace.

repokitteh-read-only · 2021-09-01T15:03:57Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
envoyproxy/api-shepherds assignee is @adisuissa
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #17948 was opened by wez470.

see: more, trace.

Signed-off-by: Weston Carlson <wez470@gmail.com>

wez470 · 2021-09-07T14:14:34Z

@adisuissa this should be good for review now :)

adisuissa

Thanks for working on this!
Left a few API comments.

adisuissa · 2021-09-07T13:57:35Z

api/envoy/config/core/v3/health_check.proto

+    // By default all responses not in :ref:`expected_statuses <envoy_v3_api_field_config.core.v3.HealthCheck.HttpHealthCheck.expected_statuses>`
+    // will result in the host being considered immediately unhealthy. Ranges follow half-open semantics of
+    // :ref:`Int64Range <envoy_v3_api_msg_type.v3.Int64Range>`. The start and end of each range are required.
+    // Only statuses in the range [100, 600) are allowed.


What happens if this includes the 2XX responses? How can the proxy count positive healths then?
Maybe this should be restricted to the range [400, 600)

I'm open to documenting this differently or restricting the range but currently expected statuses supersede retriable statuses. i.e. if 200 is expected and retriable, getting one will just count as a successful health check. I had initially thought this would be simpler than validating against any overlaps. wdyt?

It is indeed simpler. I would suggest at least clarifying this as part of the comments

Updated docs.

adisuissa · 2021-09-07T13:58:12Z

api/envoy/config/core/v3/health_check.proto

+    // will result in the host being considered immediately unhealthy. Ranges follow half-open semantics of
+    // :ref:`Int64Range <envoy_v3_api_msg_type.v3.Int64Range>`. The start and end of each range are required.
+    // Only statuses in the range [100, 600) are allowed.
+    repeated type.v3.Int64Range retriable_statuses = 12;


The type can be Int32Range, but I see that you kept compatibility with expected_statuses, so I guess it's ok.

api/envoy/config/core/v3/health_check.proto

adisuissa · 2021-09-07T14:05:53Z

api/envoy/config/core/v3/health_check.proto

+    // Specifies a list of HTTP response statuses considered retriable. If provided, responses in this range
+    // will count towards the configured :ref:`unhealthy_threshold <envoy_v3_api_field_config.core.v3.HealthCheck.unhealthy_threshold>`.
+    // By default all responses not in :ref:`expected_statuses <envoy_v3_api_field_config.core.v3.HealthCheck.HttpHealthCheck.expected_statuses>`
+    // will result in the host being considered immediately unhealthy. Ranges follow half-open semantics of


IIUC the main difference is what's counted as immediately unhealthy. Can you please update the comment to describe what is immediately unhealthy.

Sorry, could you add a bit more on what's unclear here? It mentions these ranges counting towards unhealthy threshold and then says everything not in expected statuses by default is considered immediately unhealthy.

I think the following emphasizes that the field is related to hosts not considered immediately unhealthy (the name retriable implies it, but I think this is more explicit):

Suggested change

// will result in the host being considered immediately unhealthy. Ranges follow half-open semantics of

// Specifies a list of HTTP response statuses considered retriable. If provided, responses in this range

// will count towards the configured :ref:`unhealthy_threshold <envoy_v3_api_field_config.core.v3.HealthCheck.unhealthy_threshold>`, and will not result in the host being considered immediately unhealthy

// (By default all responses not in :ref:`expected_statuses <envoy_v3_api_field_config.core.v3.HealthCheck.HttpHealthCheck.expected_statuses>`

// will result in the host being considered immediately unhealthy). Ranges follow half-open semantics of

adisuissa · 2021-09-07T14:18:12Z

source/common/upstream/health_checker_impl.cc

+          "Invalid http retriable status range: expecting end <= 600, but found end={}", end));
+    }
+
+    retriable_ranges_.emplace_back(


Consider adding a range intersection verification against the expected_ranges_

Left this for now since I'm initially leaning towards allowing overlap for simplicity. This is now reflected in the API docs.

Signed-off-by: Weston Carlson <wez470@gmail.com>

…th-checks Signed-off-by: Weston Carlson <wez470@gmail.com>

adisuissa

Thanks for working on this!
/lgtm api
Left a minor comment.

adisuissa · 2021-09-08T20:39:04Z

source/common/upstream/health_checker_impl.cc

+  return false;
+}
+
+bool HttpHealthCheckerImpl::HttpStatusChecker::inRetriableRange(uint64_t http_status) const {


nit: avoid code duplication by refactoring inExpectedRange and inRetriableRange to a single function inRange that receives the http_status and the range (either expected_range_ or retriable_range_)

I agree with this but I think the two ranges fields should remain private and the callers should still use lightly wrapped functions inExpectedRange and inRetriablyRange. The inner private function would have the common factored out code. What do you think?

This is in response to c7e875e

Yes, that is better, thanks. Will update

adisuissa · 2021-09-09T03:12:11Z

source/common/upstream/health_checker_impl.cc

+          "Invalid http retriable status range: expecting end <= 600, but found end={}", end));
+    }
+
+    retriable_ranges_.emplace_back(


Signed-off-by: Weston Carlson <wez470@gmail.com>

zuercher · 2021-09-09T17:28:37Z

/assign-from @envoyproxy/first-pass-reviewers

repokitteh-read-only · 2021-09-09T17:28:40Z

@envoyproxy/first-pass-reviewers assignee is @adisuissa

🐱

Caused by: a #17948 (comment) was created by @zuercher.

see: more, trace.

Signed-off-by: Weston Carlson <wez470@gmail.com>

junr03

lgtm

wez470 · 2021-09-22T17:30:49Z

@adisuissa Could you take another quick look? I updated the doc wording a little bit based on feedback.

adisuissa

/lgtm api

source/common/upstream/health_checker_impl.h

Signed-off-by: Weston Carlson <wez470@gmail.com>

…th-checks Signed-off-by: Weston Carlson <wez470@gmail.com>

Signed-off-by: Weston Carlson <wez470@gmail.com>

rojkov

Awesome! Thanks!

/retest

repokitteh-read-only · 2021-09-27T07:58:06Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17948 (review) was submitted by @rojkov.

see: more, trace.

wez470 · 2021-09-29T01:24:49Z

Can this be merged now? Or is there something else I need to do here?

mattklein123 · 2021-10-06T16:26:39Z

Sorry for the delay. Can you merge main and we can get this in?

/wait

…th-checks Signed-off-by: Weston Carlson <wez470@gmail.com>

Signed-off-by: Weston Carlson <wez470@gmail.com>

mattklein123

Thanks!

rojkov

Thanks!

wez470 added 2 commits September 1, 2021 08:49

Add retryable http health check statuses.

1565ef0

Signed-off-by: Weston Carlson <wez470@gmail.com>

Replace retryable with retriable.

f631c7f

Signed-off-by: Weston Carlson <wez470@gmail.com>

repokitteh-read-only bot added the api label Sep 1, 2021

repokitteh-read-only bot assigned adisuissa Sep 1, 2021

Fix typo.

981b5f2

Signed-off-by: Weston Carlson <wez470@gmail.com>

wez470 changed the title ~~Add retryable http health check statuses.~~ Health checks: Add retryable http health check statuses. Sep 1, 2021

wez470 added 3 commits September 2, 2021 10:51

Add http unhealthy threshold integration test.

edbe062

Signed-off-by: Weston Carlson <wez470@gmail.com>

Fix api docs references.

9646866

Signed-off-by: Weston Carlson <wez470@gmail.com>

Add version history doc.

3f7db44

Signed-off-by: Weston Carlson <wez470@gmail.com>

wez470 changed the title ~~Health checks: Add retryable http health check statuses.~~ Health checks: Add retriable http health check statuses. Sep 2, 2021

wez470 added 2 commits September 2, 2021 17:42

Add in http health check range tests.

f4295fb

Signed-off-by: Weston Carlson <wez470@gmail.com>

Wait for counter/guages in test.

ab37ada

Signed-off-by: Weston Carlson <wez470@gmail.com>

wez470 marked this pull request as ready for review September 4, 2021 00:44

adisuissa reviewed Sep 7, 2021

View reviewed changes

wez470 added 5 commits September 7, 2021 09:09

Move API field.

239f73a

Signed-off-by: Weston Carlson <wez470@gmail.com>

Kick CI

4a27d0a

Signed-off-by: Weston Carlson <wez470@gmail.com>

Update docs. Add integration test.

0366c98

Signed-off-by: Weston Carlson <wez470@gmail.com>

Kick CI

1176ebc

Signed-off-by: Weston Carlson <wez470@gmail.com>

Merge remote-tracking branch 'upstream/main' into retryable-http-heal…

66d4129

…th-checks Signed-off-by: Weston Carlson <wez470@gmail.com>

adisuissa reviewed Sep 9, 2021

View reviewed changes

repokitteh-read-only bot removed the api label Sep 9, 2021

wez470 added 3 commits September 9, 2021 09:26

Refactor range check funcs.

c7e875e

Signed-off-by: Weston Carlson <wez470@gmail.com>

Pass proper range to in ranges func.

889adec

Signed-off-by: Weston Carlson <wez470@gmail.com>

Remove uneeded include.

70714be

Signed-off-by: Weston Carlson <wez470@gmail.com>

zuercher unassigned adisuissa Sep 9, 2021

Update docs.

52bcc5f

Signed-off-by: Weston Carlson <wez470@gmail.com>

wez470 dismissed esmet’s stale review via 52bcc5f September 21, 2021 22:09

repokitteh-read-only bot added api and removed waiting labels Sep 21, 2021

junr03 previously approved these changes Sep 22, 2021

View reviewed changes

adisuissa reviewed Sep 23, 2021

View reviewed changes

repokitteh-read-only bot removed the api label Sep 23, 2021

esmet previously approved these changes Sep 23, 2021

View reviewed changes

rojkov reviewed Sep 24, 2021

View reviewed changes

source/common/upstream/health_checker_impl.h Outdated Show resolved Hide resolved

Upgrade const methods to static.

37c41e8

Signed-off-by: Weston Carlson <wez470@gmail.com>

wez470 dismissed stale reviews from esmet and junr03 via 37c41e8 September 24, 2021 14:48

wez470 added 2 commits September 24, 2021 12:40

Merge remote-tracking branch 'upstream/main' into retryable-http-heal…

6d80517

…th-checks Signed-off-by: Weston Carlson <wez470@gmail.com>

Kick CI

6a5269e

Signed-off-by: Weston Carlson <wez470@gmail.com>

rojkov previously approved these changes Sep 27, 2021

View reviewed changes

mattklein123 self-assigned this Oct 6, 2021

repokitteh-read-only bot added the waiting label Oct 6, 2021

Merge remote-tracking branch 'upstream/main' into retryable-http-heal…

5599a92

…th-checks Signed-off-by: Weston Carlson <wez470@gmail.com>

repokitteh-read-only bot removed the waiting label Oct 6, 2021

Fix merge of version history.

974191a

Signed-off-by: Weston Carlson <wez470@gmail.com>

wez470 dismissed rojkov’s stale review via 974191a October 6, 2021 16:41

mattklein123 approved these changes Oct 6, 2021

View reviewed changes

rojkov approved these changes Oct 7, 2021

View reviewed changes

rojkov merged commit 42f9fc3 into envoyproxy:main Oct 7, 2021

wez470 deleted the retryable-http-health-checks branch October 7, 2021 15:12

fishcakez mentioned this pull request Feb 4, 2026

Add retriable_serving_statuses to grpc health check to support retrying NOT_SERVING #43331

Open

-    // will result in the host being considered immediately unhealthy. Ranges follow half-open semantics of
+    // Specifies a list of HTTP response statuses considered retriable. If provided, responses in this range
+    // will count towards the configured :ref:`unhealthy_threshold <envoy_v3_api_field_config.core.v3.HealthCheck.unhealthy_threshold>`, and will not result in the host being considered immediately unhealthy
+    // (By default all responses not in :ref:`expected_statuses <envoy_v3_api_field_config.core.v3.HealthCheck.HttpHealthCheck.expected_statuses>`
+    // will result in the host being considered immediately unhealthy). Ranges follow half-open semantics of

Conversation

wez470 commented Sep 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

repokitteh-read-only bot commented Sep 1, 2021

Uh oh!

repokitteh-read-only bot commented Sep 1, 2021

Uh oh!

wez470 commented Sep 7, 2021

Uh oh!

adisuissa left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wez470 Sep 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wez470 Sep 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adisuissa left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

esmet Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuercher commented Sep 9, 2021

Uh oh!

repokitteh-read-only bot commented Sep 9, 2021

Uh oh!

junr03 left a comment

Choose a reason for hiding this comment

Uh oh!

wez470 commented Sep 22, 2021

Uh oh!

adisuissa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rojkov left a comment

Choose a reason for hiding this comment

Uh oh!

repokitteh-read-only bot commented Sep 27, 2021

Uh oh!

wez470 commented Sep 29, 2021

Uh oh!

mattklein123 commented Oct 6, 2021

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

wez470 commented Sep 1, 2021 •

edited

Loading

wez470 Sep 7, 2021 •

edited

Loading

wez470 Sep 7, 2021 •

edited

Loading

esmet Sep 10, 2021 •

edited

Loading