Skip to content

Health checks: Add retriable http health check statuses.#17948

Merged
rojkov merged 32 commits intoenvoyproxy:mainfrom
wez470:retryable-http-health-checks
Oct 7, 2021
Merged

Health checks: Add retriable http health check statuses.#17948
rojkov merged 32 commits intoenvoyproxy:mainfrom
wez470:retryable-http-health-checks

Conversation

@wez470
Copy link
Copy Markdown
Contributor

@wez470 wez470 commented Sep 1, 2021

Sorry about the confusion on whether I was working on this or not. Ended up picking it up.

Adds a new API field for http health checks that allows specifying ranges of status codes that are considered retriable. If these status codes are received, those failures will contribute towards the configured unhealthy threshold rather that immediately considering the cluster member unhealthy as is the case today.

cc: @mattklein123 since you were commenting on the issue

Commit Message: Add retriable http health check status codes.
Additional Description:
Risk Level: Small
Testing: Unit, Integration
Docs Changes: Fixed proto docs around HTTP health checks and well as arch overview HTTP health check docs
Release Notes: Added line for new api field.
Platform Specific Features: None
[Optional Fixes #Issue] #7171
[Optional API Considerations:]

Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
@repokitteh-read-only
Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #17948 was opened by wez470.

see: more, trace.

@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
envoyproxy/api-shepherds assignee is @adisuissa
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #17948 was opened by wez470.

see: more, trace.

Signed-off-by: Weston Carlson <wez470@gmail.com>
@wez470 wez470 changed the title Add retryable http health check statuses. Health checks: Add retryable http health check statuses. Sep 1, 2021
Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
@wez470 wez470 changed the title Health checks: Add retryable http health check statuses. Health checks: Add retriable http health check statuses. Sep 2, 2021
Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
@wez470 wez470 marked this pull request as ready for review September 4, 2021 00:44
@wez470
Copy link
Copy Markdown
Contributor Author

wez470 commented Sep 7, 2021

@adisuissa this should be good for review now :)

Copy link
Copy Markdown
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!
Left a few API comments.

// By default all responses not in :ref:`expected_statuses <envoy_v3_api_field_config.core.v3.HealthCheck.HttpHealthCheck.expected_statuses>`
// will result in the host being considered immediately unhealthy. Ranges follow half-open semantics of
// :ref:`Int64Range <envoy_v3_api_msg_type.v3.Int64Range>`. The start and end of each range are required.
// Only statuses in the range [100, 600) are allowed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this includes the 2XX responses? How can the proxy count positive healths then?
Maybe this should be restricted to the range [400, 600)

Copy link
Copy Markdown
Contributor Author

@wez470 wez470 Sep 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to documenting this differently or restricting the range but currently expected statuses supersede retriable statuses. i.e. if 200 is expected and retriable, getting one will just count as a successful health check. I had initially thought this would be simpler than validating against any overlaps. wdyt?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed simpler. I would suggest at least clarifying this as part of the comments

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated docs.

// will result in the host being considered immediately unhealthy. Ranges follow half-open semantics of
// :ref:`Int64Range <envoy_v3_api_msg_type.v3.Int64Range>`. The start and end of each range are required.
// Only statuses in the range [100, 600) are allowed.
repeated type.v3.Int64Range retriable_statuses = 12;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type can be Int32Range, but I see that you kept compatibility with expected_statuses, so I guess it's ok.

// Specifies a list of HTTP response statuses considered retriable. If provided, responses in this range
// will count towards the configured :ref:`unhealthy_threshold <envoy_v3_api_field_config.core.v3.HealthCheck.unhealthy_threshold>`.
// By default all responses not in :ref:`expected_statuses <envoy_v3_api_field_config.core.v3.HealthCheck.HttpHealthCheck.expected_statuses>`
// will result in the host being considered immediately unhealthy. Ranges follow half-open semantics of
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the main difference is what's counted as immediately unhealthy. Can you please update the comment to describe what is immediately unhealthy.

Copy link
Copy Markdown
Contributor Author

@wez470 wez470 Sep 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, could you add a bit more on what's unclear here? It mentions these ranges counting towards unhealthy threshold and then says everything not in expected statuses by default is considered immediately unhealthy.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the following emphasizes that the field is related to hosts not considered immediately unhealthy (the name retriable implies it, but I think this is more explicit):

Suggested change
// will result in the host being considered immediately unhealthy. Ranges follow half-open semantics of
// Specifies a list of HTTP response statuses considered retriable. If provided, responses in this range
// will count towards the configured :ref:`unhealthy_threshold <envoy_v3_api_field_config.core.v3.HealthCheck.unhealthy_threshold>`, and will not result in the host being considered immediately unhealthy
// (By default all responses not in :ref:`expected_statuses <envoy_v3_api_field_config.core.v3.HealthCheck.HttpHealthCheck.expected_statuses>`
// will result in the host being considered immediately unhealthy). Ranges follow half-open semantics of

"Invalid http retriable status range: expecting end <= 600, but found end={}", end));
}

retriable_ranges_.emplace_back(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a range intersection verification against the expected_ranges_

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left this for now since I'm initially leaning towards allowing overlap for simplicity. This is now reflected in the API docs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
…th-checks

Signed-off-by: Weston Carlson <wez470@gmail.com>
Copy link
Copy Markdown
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!
/lgtm api
Left a minor comment.

return false;
}

bool HttpHealthCheckerImpl::HttpStatusChecker::inRetriableRange(uint64_t http_status) const {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid code duplication by refactoring inExpectedRange and inRetriableRange to a single function inRange that receives the http_status and the range (either expected_range_ or retriable_range_)

Copy link
Copy Markdown
Contributor

@esmet esmet Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this but I think the two ranges fields should remain private and the callers should still use lightly wrapped functions inExpectedRange and inRetriablyRange. The inner private function would have the common factored out code. What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in response to c7e875e

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is better, thanks. Will update

"Invalid http retriable status range: expecting end <= 600, but found end={}", end));
}

retriable_ranges_.emplace_back(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@repokitteh-read-only repokitteh-read-only bot removed the api label Sep 9, 2021
Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
@zuercher
Copy link
Copy Markdown
Member

zuercher commented Sep 9, 2021

/assign-from @envoyproxy/first-pass-reviewers

@repokitteh-read-only
Copy link
Copy Markdown

@envoyproxy/first-pass-reviewers assignee is @adisuissa

🐱

Caused by: a #17948 (comment) was created by @zuercher.

see: more, trace.

Signed-off-by: Weston Carlson <wez470@gmail.com>
junr03
junr03 previously approved these changes Sep 22, 2021
Copy link
Copy Markdown
Member

@junr03 junr03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@wez470
Copy link
Copy Markdown
Contributor Author

wez470 commented Sep 22, 2021

@adisuissa Could you take another quick look? I updated the doc wording a little bit based on feedback.

Copy link
Copy Markdown
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm api

esmet
esmet previously approved these changes Sep 23, 2021
Signed-off-by: Weston Carlson <wez470@gmail.com>
@wez470 wez470 dismissed stale reviews from esmet and junr03 via 37c41e8 September 24, 2021 14:48
…th-checks

Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
rojkov
rojkov previously approved these changes Sep 27, 2021
Copy link
Copy Markdown
Member

@rojkov rojkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks!

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17948 (review) was submitted by @rojkov.

see: more, trace.

@wez470
Copy link
Copy Markdown
Contributor Author

wez470 commented Sep 29, 2021

Can this be merged now? Or is there something else I need to do here?

@mattklein123 mattklein123 self-assigned this Oct 6, 2021
@mattklein123
Copy link
Copy Markdown
Member

Sorry for the delay. Can you merge main and we can get this in?

/wait

…th-checks

Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Copy Markdown
Member

@rojkov rojkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@rojkov rojkov merged commit 42f9fc3 into envoyproxy:main Oct 7, 2021
@wez470 wez470 deleted the retryable-http-health-checks branch October 7, 2021 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants