-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
balance: Log and fail stuck discovery streams. #2484
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In 6d2abbc, we changed how outbound proxies process discovery updates. The prior implementation used a watchdog timeout to bound the amount of time an update stream could be full. With that change, when an update channel fills, the backpressure can extend to the destination controller's gRPC response stream. To detect and avoid this harmful (and useless) backpressure, this change modifies the balancer's discovery processing stream to exit when the balancer has 1000 unprocessed discovery updates. A sufficiently scary warning is logged.
hawkw
approved these changes
Oct 17, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me! i left some very minor nits.
olix0r
added a commit
to linkerd/linkerd2
that referenced
this pull request
Oct 19, 2023
328826caa updated the balancer's discovery channel to prevent backing up into the discovery stream by dropping the discovery stream. This results in balancers becoming permanently stale (should they ever be used again). This change modifies the discovery stream so that these errors are fatal for the balancer. These errors are recorded distinctly by the error counters. To fix this, we replace the `DiscoverNew` module with a `discover::NewServices` module that wraps the buffering layer. The buffer now only holds target metadata, and services are only built as the entry is dequeued from channel. This has the (positive) side-effect that the proxy's stack_create_total metric will not be incremented before the balancer actually uses an endpoint stack. Previously, this metric would be incremented for all queued endpoint updates. We also now log at INFO the address of all additions and removals from a balancer. This should dramatically improve diagnostics in stale endpoint situations. --- * build(deps): bump DavidAnson/markdownlint-cli2-action (linkerd/linkerd2-proxy#2460) * build(deps): bump tj-actions/changed-files from 36.2.1 to 39.0.2 (linkerd/linkerd2-proxy#2468) * build(deps): bump EmbarkStudios/cargo-deny-action from 1.5.0 to 1.5.4 (linkerd/linkerd2-proxy#2448) * meshtls: log errors parsing client certs (linkerd/linkerd2-proxy#2467) * build(deps): bump actions/checkout from 3.5.0 to 4.1.0 (linkerd/linkerd2-proxy#2474) * build(deps): bump tj-actions/changed-files from 39.0.2 to 39.2.0 (linkerd/linkerd2-proxy#2475) * build(deps): bump EmbarkStudios/cargo-deny-action from 1.5.4 to 1.5.5 (linkerd/linkerd2-proxy#2478) * build(deps): bump DavidAnson/markdownlint-cli2-action (linkerd/linkerd2-proxy#2476) * build(deps): bump actions/upload-artifact from 3.1.2 to 3.1.3 (linkerd/linkerd2-proxy#2479) * Render grpc_status metric label as number (linkerd/linkerd2-proxy#2480) * balance: Log and fail stuck discovery streams. (linkerd/linkerd2-proxy#2484) * build(deps): update `rustix` to v0.36.16/v0.37.7 (linkerd/linkerd2-proxy#2488) * balance: Fail the discovery stream on queue backup (linkerd/linkerd2-proxy#2486) Signed-off-by: Oliver Gould <[email protected]>
hawkw
pushed a commit
to linkerd/linkerd2
that referenced
this pull request
Oct 19, 2023
328826caa updated the balancer's discovery channel to prevent backing up into the discovery stream by dropping the discovery stream. This results in balancers becoming permanently stale (should they ever be used again). This change modifies the discovery stream so that these errors are fatal for the balancer. These errors are recorded distinctly by the error counters. To fix this, we replace the `DiscoverNew` module with a `discover::NewServices` module that wraps the buffering layer. The buffer now only holds target metadata, and services are only built as the entry is dequeued from channel. This has the (positive) side-effect that the proxy's stack_create_total metric will not be incremented before the balancer actually uses an endpoint stack. Previously, this metric would be incremented for all queued endpoint updates. We also now log at INFO the address of all additions and removals from a balancer. This should dramatically improve diagnostics in stale endpoint situations. --- * build(deps): bump DavidAnson/markdownlint-cli2-action (linkerd/linkerd2-proxy#2460) * build(deps): bump tj-actions/changed-files from 36.2.1 to 39.0.2 (linkerd/linkerd2-proxy#2468) * build(deps): bump EmbarkStudios/cargo-deny-action from 1.5.0 to 1.5.4 (linkerd/linkerd2-proxy#2448) * meshtls: log errors parsing client certs (linkerd/linkerd2-proxy#2467) * build(deps): bump actions/checkout from 3.5.0 to 4.1.0 (linkerd/linkerd2-proxy#2474) * build(deps): bump tj-actions/changed-files from 39.0.2 to 39.2.0 (linkerd/linkerd2-proxy#2475) * build(deps): bump EmbarkStudios/cargo-deny-action from 1.5.4 to 1.5.5 (linkerd/linkerd2-proxy#2478) * build(deps): bump DavidAnson/markdownlint-cli2-action (linkerd/linkerd2-proxy#2476) * build(deps): bump actions/upload-artifact from 3.1.2 to 3.1.3 (linkerd/linkerd2-proxy#2479) * Render grpc_status metric label as number (linkerd/linkerd2-proxy#2480) * balance: Log and fail stuck discovery streams. (linkerd/linkerd2-proxy#2484) * build(deps): update `rustix` to v0.36.16/v0.37.7 (linkerd/linkerd2-proxy#2488) * balance: Fail the discovery stream on queue backup (linkerd/linkerd2-proxy#2486) Signed-off-by: Oliver Gould <[email protected]>
hawkw
added a commit
to linkerd/linkerd2
that referenced
this pull request
Oct 19, 2023
## edge-23.10.3 This edge release fixes issues in the proxy and destination controller which can result in Linkerd proxies sending traffic to stale endpoints. In addition, it contains other bugfixes and updates dependencies to include patches for the security advisories [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 and GHSA-c827-hfw6-qwvm. * Fixed an issue where the Destination controller could stop processing changes in the endpoints of a destination, if a proxy subscribed to that destination stops reading service discovery updates. This issue results in proxies attempting to send traffic for that destination to stale endpoints ([#11483], fixes [#11480], [#11279], and [#10590]) * Fixed a regression introduced in stable-2.13.0 where proxies would not terminate unused service discovery watches, exerting backpressure on the Destination controller which could cause it to become stuck ([linkerd2-proxy#2484] and [linkerd2-proxy#2486]) * Added `INFO`-level logging to the proxy when endpoints are added or removed from a load balancer. These logs are enabled by default, and can be disabled by [setting the proxy log level][proxy-log-level] to `warn,linkerd=info,linkerd_proxy_balance=warn` or similar ([linkerd2-proxy#2486]) * Fixed a regression where the proxy rendered `grpc_status` metric labels as a string rather than as the numeric status code ([linkerd2-proxy#2480]; fixes [#11449]) * Added missing `imagePullSecrets` to `linkerd-jaeger` ServiceAccount ([#11504]) * Updated the control plane's dependency on the `golang.google.org/grpc` Go package to include patches for [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 ([#11496]) * Updated dependencies on `rustix` to include patches for GHSA-c827-hfw6-qwvm ([linkerd2-proxy#2488] and [#11512]). [#10590]: #10590 [#11279]: #11279 [#11483]: #11483 [#11449]: #11449 [#11480]: #11480 [#11504]: #11504 [#11504]: #11512 [linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480 [linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484 [linkerd2-proxy#2486]: linkerd/linkerd2-proxy#2486 [linkerd2-proxy#2488]: linkerd/linkerd2-proxy#2488 [proxy-log-level]: https://linkerd.io/2.14/tasks/modifying-proxy-log-level/ [CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
hawkw
pushed a commit
that referenced
this pull request
Oct 25, 2023
In 6d2abbc, we changed how outbound proxies process discovery updates. The prior implementation used a watchdog timeout to bound the amount of time an update stream could be full. With that change, when an update channel fills, the backpressure can extend to the destination controller's gRPC response stream. To detect and avoid this harmful (and useless) backpressure, this change modifies the balancer's discovery processing stream to exit when the balancer has 1000 unprocessed discovery updates. A sufficiently scary warning is logged.
hawkw
pushed a commit
that referenced
this pull request
Oct 25, 2023
In 6d2abbc, we changed how outbound proxies process discovery updates. The prior implementation used a watchdog timeout to bound the amount of time an update stream could be full. With that change, when an update channel fills, the backpressure can extend to the destination controller's gRPC response stream. To detect and avoid this harmful (and useless) backpressure, this change modifies the balancer's discovery processing stream to exit when the balancer has 1000 unprocessed discovery updates. A sufficiently scary warning is logged.
hawkw
pushed a commit
that referenced
this pull request
Oct 26, 2023
In 6d2abbc, we changed how outbound proxies process discovery updates. The prior implementation used a watchdog timeout to bound the amount of time an update stream could be full. With that change, when an update channel fills, the backpressure can extend to the destination controller's gRPC response stream. To detect and avoid this harmful (and useless) backpressure, this change modifies the balancer's discovery processing stream to exit when the balancer has 1000 unprocessed discovery updates. A sufficiently scary warning is logged.
hawkw
pushed a commit
that referenced
this pull request
Oct 26, 2023
In 6d2abbc, we changed how outbound proxies process discovery updates. The prior implementation used a watchdog timeout to bound the amount of time an update stream could be full. With that change, when an update channel fills, the backpressure can extend to the destination controller's gRPC response stream. To detect and avoid this harmful (and useless) backpressure, this change modifies the balancer's discovery processing stream to exit when the balancer has 1000 unprocessed discovery updates. A sufficiently scary warning is logged.
mateiidavid
added a commit
to linkerd/linkerd2
that referenced
this pull request
Oct 26, 2023
This stable release fixes issues in the proxy and Destination controller which can result in Linkerd proxies sending traffic to stale endpoints. In addition, it contains a bug fix for profile resolutions for pods bound on host ports and includes patches for security advisory [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 * Control Plane * Fixed an issue where the Destination controller could stop processing changes in the endpoints of a destination, if a proxy subscribed to that destination stops reading service discovery updates. This issue results in proxies attempting to send traffic for that destination to stale endpoints ([#11483], fixes [#11480], [#11279], [#10590]) * Fixed an issue where the Destination controller would not update pod metadata for profile resolutions for a pod accessed via the host network (e.g. HostPort endpoints) ([#11334]) * Addressed [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 by upgrading several dependencies (including Go's gRPC and net libraries) * Proxy * Fixed a regression where the proxy rendered `grpc_status` metric labels as a string rather than as the numeric status code ([linkerd2-proxy#2480]; fixes [#11449]) * Fixed a regression introduced in stable-2.13.0 where proxies would not terminate unusred service discovery watches, exerting backpressure on the Destination controller which could cause it to become stuck ([linkerd2-proxy#2484]) [#10590]: #10590 [#11279]: #11279 [#11483]: #11483 [#11480]: #11480 [#11334]: #11334 [#11449]: #11449 [CVE-2023-44487]: GHSA-qppj-fm5r-hxr3 [linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480 [linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484 Signed-off-by: Matei David <[email protected]>
mateiidavid
added a commit
to linkerd/linkerd2
that referenced
this pull request
Oct 26, 2023
This stable release fixes issues in the proxy and Destination controller which can result in Linkerd proxies sending traffic to stale endpoints. In addition, it contains a bug fix for profile resolutions for pods bound on host ports and includes patches for security advisory [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 * Control Plane * Fixed an issue where the Destination controller could stop processing changes in the endpoints of a destination, if a proxy subscribed to that destination stops reading service discovery updates. This issue results in proxies attempting to send traffic for that destination to stale endpoints ([#11483], fixes [#11480], [#11279], [#10590]) * Fixed an issue where the Destination controller would not update pod metadata for profile resolutions for a pod accessed via the host network (e.g. HostPort endpoints) ([#11334]) * Addressed [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 by upgrading several dependencies (including Go's gRPC and net libraries) * Proxy * Fixed a regression where the proxy rendered `grpc_status` metric labels as a string rather than as the numeric status code ([linkerd2-proxy#2480]; fixes [#11449]) * Fixed a regression introduced in stable-2.13.0 where proxies would not terminate unusred service discovery watches, exerting backpressure on the Destination controller which could cause it to become stuck ([linkerd2-proxy#2484]) [#10590]: #10590 [#11279]: #11279 [#11483]: #11483 [#11480]: #11480 [#11334]: #11334 [#11449]: #11449 [CVE-2023-44487]: GHSA-qppj-fm5r-hxr3 [linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480 [linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484 --------- Signed-off-by: Matei David <[email protected]> Co-authored-by: Alejandro Pedraza <[email protected]> Co-authored-by: Oliver Gould <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In 6d2abbc, we changed how outbound proxies process discovery updates. The prior implementation used a watchdog timeout to bound the amount of time an update stream could be full. With that change, when an update channel fills, the backpressure can extend to the destination controller's gRPC response stream.
To detect and avoid this harmful (and useless) backpressure, this change modifies the balancer's discovery processing stream to exit when the balancer has 1000 unprocessed discovery updates. A sufficiently scary warning is logged.