Health checks: Add retriable http health check statuses.#17948
Health checks: Add retriable http health check statuses.#17948rojkov merged 32 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
|
@adisuissa this should be good for review now :) |
adisuissa
left a comment
There was a problem hiding this comment.
Thanks for working on this!
Left a few API comments.
| // By default all responses not in :ref:`expected_statuses <envoy_v3_api_field_config.core.v3.HealthCheck.HttpHealthCheck.expected_statuses>` | ||
| // will result in the host being considered immediately unhealthy. Ranges follow half-open semantics of | ||
| // :ref:`Int64Range <envoy_v3_api_msg_type.v3.Int64Range>`. The start and end of each range are required. | ||
| // Only statuses in the range [100, 600) are allowed. |
There was a problem hiding this comment.
What happens if this includes the 2XX responses? How can the proxy count positive healths then?
Maybe this should be restricted to the range [400, 600)
There was a problem hiding this comment.
I'm open to documenting this differently or restricting the range but currently expected statuses supersede retriable statuses. i.e. if 200 is expected and retriable, getting one will just count as a successful health check. I had initially thought this would be simpler than validating against any overlaps. wdyt?
There was a problem hiding this comment.
It is indeed simpler. I would suggest at least clarifying this as part of the comments
| // will result in the host being considered immediately unhealthy. Ranges follow half-open semantics of | ||
| // :ref:`Int64Range <envoy_v3_api_msg_type.v3.Int64Range>`. The start and end of each range are required. | ||
| // Only statuses in the range [100, 600) are allowed. | ||
| repeated type.v3.Int64Range retriable_statuses = 12; |
There was a problem hiding this comment.
The type can be Int32Range, but I see that you kept compatibility with expected_statuses, so I guess it's ok.
| // Specifies a list of HTTP response statuses considered retriable. If provided, responses in this range | ||
| // will count towards the configured :ref:`unhealthy_threshold <envoy_v3_api_field_config.core.v3.HealthCheck.unhealthy_threshold>`. | ||
| // By default all responses not in :ref:`expected_statuses <envoy_v3_api_field_config.core.v3.HealthCheck.HttpHealthCheck.expected_statuses>` | ||
| // will result in the host being considered immediately unhealthy. Ranges follow half-open semantics of |
There was a problem hiding this comment.
IIUC the main difference is what's counted as immediately unhealthy. Can you please update the comment to describe what is immediately unhealthy.
There was a problem hiding this comment.
Sorry, could you add a bit more on what's unclear here? It mentions these ranges counting towards unhealthy threshold and then says everything not in expected statuses by default is considered immediately unhealthy.
There was a problem hiding this comment.
I think the following emphasizes that the field is related to hosts not considered immediately unhealthy (the name retriable implies it, but I think this is more explicit):
| // will result in the host being considered immediately unhealthy. Ranges follow half-open semantics of | |
| // Specifies a list of HTTP response statuses considered retriable. If provided, responses in this range | |
| // will count towards the configured :ref:`unhealthy_threshold <envoy_v3_api_field_config.core.v3.HealthCheck.unhealthy_threshold>`, and will not result in the host being considered immediately unhealthy | |
| // (By default all responses not in :ref:`expected_statuses <envoy_v3_api_field_config.core.v3.HealthCheck.HttpHealthCheck.expected_statuses>` | |
| // will result in the host being considered immediately unhealthy). Ranges follow half-open semantics of |
| "Invalid http retriable status range: expecting end <= 600, but found end={}", end)); | ||
| } | ||
|
|
||
| retriable_ranges_.emplace_back( |
There was a problem hiding this comment.
Consider adding a range intersection verification against the expected_ranges_
There was a problem hiding this comment.
Left this for now since I'm initially leaning towards allowing overlap for simplicity. This is now reflected in the API docs.
Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
…th-checks Signed-off-by: Weston Carlson <wez470@gmail.com>
adisuissa
left a comment
There was a problem hiding this comment.
Thanks for working on this!
/lgtm api
Left a minor comment.
| return false; | ||
| } | ||
|
|
||
| bool HttpHealthCheckerImpl::HttpStatusChecker::inRetriableRange(uint64_t http_status) const { |
There was a problem hiding this comment.
nit: avoid code duplication by refactoring inExpectedRange and inRetriableRange to a single function inRange that receives the http_status and the range (either expected_range_ or retriable_range_)
There was a problem hiding this comment.
I agree with this but I think the two ranges fields should remain private and the callers should still use lightly wrapped functions inExpectedRange and inRetriablyRange. The inner private function would have the common factored out code. What do you think?
There was a problem hiding this comment.
Yes, that is better, thanks. Will update
| "Invalid http retriable status range: expecting end <= 600, but found end={}", end)); | ||
| } | ||
|
|
||
| retriable_ranges_.emplace_back( |
Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
|
/assign-from @envoyproxy/first-pass-reviewers |
|
@envoyproxy/first-pass-reviewers assignee is @adisuissa |
Signed-off-by: Weston Carlson <wez470@gmail.com>
|
@adisuissa Could you take another quick look? I updated the doc wording a little bit based on feedback. |
Signed-off-by: Weston Carlson <wez470@gmail.com>
…th-checks Signed-off-by: Weston Carlson <wez470@gmail.com>
|
Retrying Azure Pipelines: |
|
Can this be merged now? Or is there something else I need to do here? |
|
Sorry for the delay. Can you merge main and we can get this in? /wait |
…th-checks Signed-off-by: Weston Carlson <wez470@gmail.com>
Signed-off-by: Weston Carlson <wez470@gmail.com>
Sorry about the confusion on whether I was working on this or not. Ended up picking it up.
Adds a new API field for http health checks that allows specifying ranges of status codes that are considered retriable. If these status codes are received, those failures will contribute towards the configured unhealthy threshold rather that immediately considering the cluster member unhealthy as is the case today.
cc: @mattklein123 since you were commenting on the issue
Commit Message: Add retriable http health check status codes.
Additional Description:
Risk Level: Small
Testing: Unit, Integration
Docs Changes: Fixed proto docs around HTTP health checks and well as arch overview HTTP health check docs
Release Notes: Added line for new api field.
Platform Specific Features: None
[Optional Fixes #Issue] #7171
[Optional API Considerations:]