Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linkerd-proxy using stale endpoints #11480

Closed
bruecktech opened this issue Oct 12, 2023 · 4 comments · Fixed by #11513
Closed

linkerd-proxy using stale endpoints #11480

bruecktech opened this issue Oct 12, 2023 · 4 comments · Fixed by #11513
Assignees

Comments

@bruecktech
Copy link

bruecktech commented Oct 12, 2023

What is the issue?

Since the update to 2.13.5 we experience sporadic issues where the linkerd-proxy seems to connect to endpoints that don't exist anymore since days.
We think it's triggered by an issue to connect to linkerd-destination (which is a different problem).
Restarting linkerd-destination solves the issue.
We also compared endpoints from linkerd diagnostics endpoints myservice... with kubectl get endpoints myservice and they seem to match. So we don't think that linkerd-destination contains stale data but rather the proxies.

The affected proxies did not have any pending endpoints after the issue, but we currently don't have the data to understand what it looked like during the issue

❯ curl -s localhost:8000/metrics | grep endpoints | grep myservice
outbound_http_balancer_endpoints{parent_group="core",parent_kind="Service",parent_namespace="prod",parent_name="myservice",parent_port="80",parent_section_name="",backend_group="",backend_kind="default",backend_namespace="",backend_name="service",backend_port="",backend_section_name="",endpoint_state="pending"} 0
outbound_http_balancer_endpoints{parent_group="core",parent_kind="Service",parent_namespace="prod",parent_name="myservice",parent_port="80",parent_section_name="",backend_group="",backend_kind="default",backend_namespace="",backend_name="service",backend_port="",backend_section_name="",endpoint_state="ready"} 419

How can it be reproduced?

Not clear. We updated to 2.13.5 and since then had 3 issues over the course of a few days.

Logs, error output, etc

{
		"message": "HTTP/1.1 request failed",
		"attributes": {
			"threadId": "ThreadId(1)",
			"spans": [
				{
					"name": "outbound"
				},
				{
					"name": "proxy",
					"addr": "172.20.207.194:80"
				},
				{
					"name": "rescue",
					"client": {
						"addr": "10.250.154.125:56774"
					}
				}
			],
			"level": "INFO",
			"fields": {
				"error": "logical service myservice.prod.svc.cluster.local:80: Service.myservice:80: endpoint 10.250.162.250:80: operation was canceled: connection was not ready"
			},
			"timestamp": "[104343.356432s]",
			"target": "linkerd_app_core::errors::respond"
		}
	}
}
{
		"message": "Unexpected error",
		"attributes": {
			"threadId": "ThreadId(1)",
			"spans": [
				{
					"name": "outbound"
				},
				{
					"name": "proxy",
					"addr": "172.20.207.194:80"
				},
				{
					"name": "rescue",
					"client": {
						"addr": "10.250.154.125:56774"
					}
				}
			],
			"level": "WARN",
			"fields": {
				"error": "logical service myservice.prod.svc.cluster.local:80: Service.prod.myservice:80: endpoint 10.250.162.250:80: operation was canceled: connection was not ready"
			},
			"timestamp": "[104343.356447s]",
			"target": "linkerd_app_outbound::http::server"
		}
	}
}
{
		"message": "Service failed",
		"attributes": {
			"threadId": "ThreadId(1)",
			"spans": [
				{
					"name": "outbound"
				},
				{
					"name": "proxy",
					"addr": "172.20.207.194:80"
				},
				{
					"ns": "prod",
					"port": "80",
					"name": "service"
				},
				{
					"name": "endpoint",
					"addr": "10.250.162.250:80"
				}
			],
			"level": "WARN",
			"fields": {
				"error": "channel closed"
			},
			"timestamp": "[104343.925284s]",
			"target": "linkerd_reconnect"
		}
	}
}
{
		"message": "Failed to connect",
		"attributes": {
			"threadId": "ThreadId(1)",
			"spans": [
				{
					"name": "outbound"
				},
				{
					"name": "proxy",
					"addr": "172.20.207.194:80"
				},
				{
					"ns": "prod",
					"port": "80",
					"name": "service"
				},
				{
					"name": "endpoint",
					"addr": "10.250.162.250:80"
				}
			],
			"level": "WARN",
			"fields": {
				"error": "Connection refused (os error 111)"
			},
			"timestamp": "[104344.409821s]",
			"target": "linkerd_reconnect"
		}
	}
}

output of linkerd check -o short

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2023-11-08T07:32:15Z
    see https://linkerd.io/2.14/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-webhooks-and-apisvc-tls
-------------------------------
‼ proxy-injector cert is valid for at least 60 days
    certificate will expire on 2023-10-18T13:43:33Z
    see https://linkerd.io/2.14/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
‼ sp-validator cert is valid for at least 60 days
    certificate will expire on 2023-10-21T08:56:13Z
    see https://linkerd.io/2.14/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
‼ policy-validator cert is valid for at least 60 days
    certificate will expire on 2023-11-08T09:31:39Z
    see https://linkerd.io/2.14/checks/#l5d-policy-validator-webhook-cert-not-expiring-soon for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.13.5 but the latest stable version is 2.14.1
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running stable-2.13.5 but cli running stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-5b5ddcf5d4-45glv (v2.207.0)
	* linkerd-destination-5b5ddcf5d4-94j2x (v2.207.0)
	* linkerd-destination-5b5ddcf5d4-g95bm (v2.207.0)
	* linkerd-destination-5b5ddcf5d4-gxvtz (v2.207.0)
	* linkerd-destination-5b5ddcf5d4-jmfn8 (v2.207.0)
	* linkerd-identity-9559b4d7f-96kv7 (v2.207.0)
	* linkerd-identity-9559b4d7f-gqddq (v2.207.0)
	* linkerd-identity-9559b4d7f-gx4bz (v2.207.0)
	* linkerd-identity-9559b4d7f-sfkb7 (v2.207.0)
	* linkerd-identity-9559b4d7f-sl7ck (v2.207.0)
	* linkerd-proxy-injector-6688d4487f-b6w99 (v2.207.0)
	* linkerd-proxy-injector-6688d4487f-bmgqm (v2.207.0)
	* linkerd-proxy-injector-6688d4487f-cf2ss (v2.207.0)
	* linkerd-proxy-injector-6688d4487f-ffzlj (v2.207.0)
	* linkerd-proxy-injector-6688d4487f-n8hxk (v2.207.0)
	* linkerd-sp-validator-dbcc64849-7846s (v2.207.0)
	* linkerd-sp-validator-dbcc64849-7d6kn (v2.207.0)
	* linkerd-sp-validator-dbcc64849-7pqw7 (v2.207.0)
	* linkerd-sp-validator-dbcc64849-92jzx (v2.207.0)
	* linkerd-sp-validator-dbcc64849-jkws8 (v2.207.0)
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-5b5ddcf5d4-45glv running v2.207.0 but cli running stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-cli-version for hints

linkerd-viz
-----------
‼ linkerd-viz pods are injected
    could not find proxy container for prometheus-scrape-1-5585795fbd-l4sn5 pod
    see https://linkerd.io/2.14/checks/#l5d-viz-pods-injection for hints
‼ viz extension pods are running
    container "linkerd-proxy" in pod "prometheus-scrape-1-5585795fbd-l4sn5" is not ready
    see https://linkerd.io/2.14/checks/#l5d-viz-pods-running for hints
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* grafana-864b6b8ddb-jxlpk (v2.207.0)
	* metrics-api-5484cdf977-llg6t (v2.207.0)
	* tap-58654c968b-7q5hm (v2.207.0)
	* tap-injector-55597d88c7-xd7wp (v2.207.0)
	* web-cbdb85945-b5s27 (v2.207.0)
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    grafana-864b6b8ddb-jxlpk running v2.207.0 but cli running stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cli-version for hints

linkerd-smi
-----------
‼ Linkerd extension command linkerd-smi exists
    exec: "linkerd-smi": executable file not found in $PATH
    see https://linkerd.io/2.14/checks/#extensions for hints

Status check results are √

Environment

  • EKS
  • Kubernetes 1.24
  • Linkerd 2.13.5

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

maybe

@DavidMcLaughlin
Copy link
Contributor

DavidMcLaughlin commented Oct 16, 2023

Thank you for the report, we have a few confirmed cases of this and are working on a fix. Just to make sure this is the same issue we see elsewhere:

  • Which version of Linkerd did you upgrade from?
  • Are you using Linkerd in HA mode?
  • Do you have access to your control plane metrics? If this is the same cause that we've confirmed, you should be able to see endpoints_updates flatline on the affected instance(s) of the destination controller. If you could confirm this, it would be helpful.

@bruecktech
Copy link
Author

bruecktech commented Oct 17, 2023

Which version of Linkerd did you upgrade from?

2.11.4

Are you using Linkerd in HA mode?

yes

Do you have access to your control plane metrics? If this is the same cause that we've confirmed, you should be able to see endpoints_updates flatline on the affected instance(s) of the destination controller. If you could confirm this, it would be helpful.

I looked at the metrics and they don't seem to flatline. In our case it seems there is not a single destination controller affected but rather a small number of pods running linkerd-proxy at the beginning and then over time the amount of pods affected is growing. So it seems most of the pods still get their endpoint updates but some are using outdated lists.

@hawkw hawkw mentioned this issue Oct 19, 2023
hawkw added a commit that referenced this issue Oct 19, 2023
## edge-23.10.3

This edge release fixes issues in the proxy and destination controller which can
result in Linkerd proxies sending traffic to stale endpoints. In addition, it
contains other bugfixes and updates dependencies to include patches for the
security advisories [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 and GHSA-c827-hfw6-qwvm.

* Fixed an issue where the Destination controller could stop processing
  changes in the endpoints of a destination, if a proxy subscribed to that
  destination stops reading service discovery updates. This issue results in
  proxies attempting to send traffic for that destination to stale endpoints
  ([#11483], fixes [#11480], [#11279], and [#10590])
* Fixed a regression introduced in stable-2.13.0 where proxies would not
  terminate unused service discovery watches, exerting backpressure on the
  Destination controller which could cause it to become stuck
  ([linkerd2-proxy#2484] and [linkerd2-proxy#2486])
* Added `INFO`-level logging to the proxy when endpoints are added or removed
  from a load balancer. These logs are enabled by default, and can be disabled
  by [setting the proxy log level][proxy-log-level] to
  `warn,linkerd=info,linkerd_proxy_balance=warn` or similar
  ([linkerd2-proxy#2486])
* Fixed a regression where the proxy rendered `grpc_status` metric labels as a
  string rather than as the numeric status code ([linkerd2-proxy#2480]; fixes
  [#11449])
* Added missing `imagePullSecrets` to `linkerd-jaeger` ServiceAccount ([#11504])
* Updated the control plane's dependency on the `golang.google.org/grpc` Go
  package to include patches for [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 ([#11496])
* Updated dependencies on `rustix` to include patches for GHSA-c827-hfw6-qwvm
  ([linkerd2-proxy#2488] and [#11512]).

[#10590]: #10590
[#11279]: #11279
[#11483]: #11483
[#11449]: #11449
[#11480]: #11480
[#11504]: #11504
[#11504]: #11512
[linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480
[linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484
[linkerd2-proxy#2486]: linkerd/linkerd2-proxy#2486
[linkerd2-proxy#2488]: linkerd/linkerd2-proxy#2488
[proxy-log-level]: https://linkerd.io/2.14/tasks/modifying-proxy-log-level/
[CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
@hawkw
Copy link
Contributor

hawkw commented Oct 20, 2023

The latest edge release, edge-23.10.3, contains changes to the proxy and Destination controller which should resolve these issues. We'd love your help validating this fix by trying out edge-23.10.3 in your environments and letting us know if you encounter any more use of stale endpoints. Thanks!

mateiidavid added a commit that referenced this issue Oct 26, 2023
This stable release fixes issues in the proxy and Destination controller which
can result in Linkerd proxies sending traffic to stale endpoints. In addition,
it contains a bug fix for profile resolutions for pods bound on host ports and
includes patches for security advisory [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3

* Control Plane
  * Fixed an issue where the Destination controller could stop processing
    changes in the endpoints of a destination, if a proxy subscribed to that
    destination stops reading service discovery updates. This issue results in
    proxies attempting to send traffic for that destination to stale endpoints
    ([#11483], fixes [#11480], [#11279], [#10590])
  * Fixed an issue where the Destination controller would not update pod
    metadata for profile resolutions for a pod accessed via the host network
    (e.g. HostPort endpoints) ([#11334])
  * Addressed [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 by upgrading several
    dependencies (including Go's gRPC and net libraries)

* Proxy
  * Fixed a regression where the proxy rendered `grpc_status` metric labels as
    a string rather than as the numeric status code ([linkerd2-proxy#2480];
    fixes [#11449])
  * Fixed a regression introduced in stable-2.13.0 where proxies would not
    terminate unusred service discovery watches, exerting backpressure on the
    Destination controller which could cause it to become stuck
    ([linkerd2-proxy#2484])

[#10590]: #10590
[#11279]: #11279
[#11483]: #11483
[#11480]: #11480
[#11334]: #11334
[#11449]: #11449
[CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
[linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480
[linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484

Signed-off-by: Matei David <[email protected]>
mateiidavid added a commit that referenced this issue Oct 26, 2023
This stable release fixes issues in the proxy and Destination controller which
can result in Linkerd proxies sending traffic to stale endpoints. In addition,
it contains a bug fix for profile resolutions for pods bound on host ports and
includes patches for security advisory [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3

* Control Plane
  * Fixed an issue where the Destination controller could stop processing
    changes in the endpoints of a destination, if a proxy subscribed to that
    destination stops reading service discovery updates. This issue results in
    proxies attempting to send traffic for that destination to stale endpoints
    ([#11483], fixes [#11480], [#11279], [#10590])
  * Fixed an issue where the Destination controller would not update pod
    metadata for profile resolutions for a pod accessed via the host network
    (e.g. HostPort endpoints) ([#11334])
  * Addressed [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 by upgrading several
    dependencies (including Go's gRPC and net libraries)

* Proxy
  * Fixed a regression where the proxy rendered `grpc_status` metric labels as
    a string rather than as the numeric status code ([linkerd2-proxy#2480];
    fixes [#11449])
  * Fixed a regression introduced in stable-2.13.0 where proxies would not
    terminate unusred service discovery watches, exerting backpressure on the
    Destination controller which could cause it to become stuck
    ([linkerd2-proxy#2484])

[#10590]: #10590
[#11279]: #11279
[#11483]: #11483
[#11480]: #11480
[#11334]: #11334
[#11449]: #11449
[CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
[linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480
[linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484

---------

Signed-off-by: Matei David <[email protected]>
Co-authored-by: Alejandro Pedraza <[email protected]>
Co-authored-by: Oliver Gould <[email protected]>
@DavidMcLaughlin
Copy link
Contributor

This fix is now in stable-2.14.2.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants