Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd proxy fails to reconnect to restarted DaemonSet pod. #10590

Closed
Steffen911 opened this issue Mar 21, 2023 · 5 comments
Closed

Linkerd proxy fails to reconnect to restarted DaemonSet pod. #10590

Steffen911 opened this issue Mar 21, 2023 · 5 comments

Comments

@Steffen911
Copy link

Steffen911 commented Mar 21, 2023

What is the issue?

We use the opentelemetry collector as a DaemonSet and send traces from our pods to the collector using the node ip address. When we restart the collector it receives a new Pod IP, but the Node IP remains the same. Some proxies still try to connect to the old Pod IP and, therefore, fail to reconnect.

Example:
Emojiservice is supposed to send traces to the Collector service. The collector runs as a DaemonSet and exposes port 4317 (gRPC). We inject the NodeIP via the downwards API into the emojiservice to make it send traces to 10.167.0.1:4317. This is resolved to the PodIP of the Collector 10.169.10.1:4317.

Now, we restart the Daemonset and the PodIP of the Collector changes to 10.169.11.1. I still see debug log entries from the Emojiservice sidecar that try to connect to the old IP, though (see log example).

I believe it is related to #8956, but I don't know how and if I can use the diagnostics command for IP based connections.

How can it be reproduced?

Call instances of a DaemonSet using the NodeIP address from the K8s downwards API. Restart the pods of the Daemonset to assign a new PodIP to them. See that connections using the NodeIP still use the old PodIP and fail to be re-established.

Logs, error output, etc

{
	"id": "AQAAAYcDrhY0uLmpXQAAAABBWWNEcmgya0FBQjMxcXIxZEpFSnJBQTQ",
	"content": {
		"timestamp": "2023-03-21T10:19:13.332Z",
		"tags": [
			"short_image:proxy",
			"kube_container_name:linkerd-proxy",
			"image_tag:stable-2.12.4",
			"pod_phase:running",
			"source:proxy",
			"kube_ownerref_kind:replicaset",
			"container_name:linkerd-proxy",
			"cloud_provider:gcp"
		],
		"attributes": {
			"threadId": "ThreadId(1)",
			"spans": [
				{
					"name": "outbound"
				},
				{
					"name": "proxy",
					"addr": "10.167.0.1:4317"
				}
			],
			"level": "DEBUG",
			"fields": {
				"server": {
					"addr": "10.169.10.1:4317"
				},
				"message": "Connecting"
			},
			"timestamp": "[ 75611.397246s]",
			"target": "linkerd_proxy_transport::connect"
		}
	}
}

output of linkerd check -o short

» linkerd check -o short
Status check results are √

Environment

  • Kubernetes Version: v1.24.10-gke.2300
  • Environment: GKE
  • LinkerD Version: stable-2.12.4

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

@Steffen911 Steffen911 added the bug label Mar 21, 2023
@Steffen911 Steffen911 changed the title Linked proxy fails to reconnect to restarted DaemonSet pod. Linkerd proxy fails to reconnect to restarted DaemonSet pod. Mar 21, 2023
@adleong
Copy link
Member

adleong commented Mar 23, 2023

@Steffen911 thank you for this clear and detailed report. I believe that this is a bug in the destination controller specifically where connecting to a pod via the host IP does not properly watch for updates for which pod is running on that host. I think we have enough information to investigate and address this issue.

@adleong adleong added area/controller bug/staleness priority/P1 Planned for Release priority/triage and removed priority/P1 Planned for Release labels Mar 23, 2023
@adleong adleong added this to the stable-2.14.0 milestone Mar 23, 2023
@stale
Copy link

stale bot commented Jun 21, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jun 21, 2023
@Steffen911
Copy link
Author

Still relevant.

@stale stale bot removed the wontfix label Jun 22, 2023
@hawkw hawkw mentioned this issue Oct 19, 2023
hawkw added a commit that referenced this issue Oct 19, 2023
## edge-23.10.3

This edge release fixes issues in the proxy and destination controller which can
result in Linkerd proxies sending traffic to stale endpoints. In addition, it
contains other bugfixes and updates dependencies to include patches for the
security advisories [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 and GHSA-c827-hfw6-qwvm.

* Fixed an issue where the Destination controller could stop processing
  changes in the endpoints of a destination, if a proxy subscribed to that
  destination stops reading service discovery updates. This issue results in
  proxies attempting to send traffic for that destination to stale endpoints
  ([#11483], fixes [#11480], [#11279], and [#10590])
* Fixed a regression introduced in stable-2.13.0 where proxies would not
  terminate unused service discovery watches, exerting backpressure on the
  Destination controller which could cause it to become stuck
  ([linkerd2-proxy#2484] and [linkerd2-proxy#2486])
* Added `INFO`-level logging to the proxy when endpoints are added or removed
  from a load balancer. These logs are enabled by default, and can be disabled
  by [setting the proxy log level][proxy-log-level] to
  `warn,linkerd=info,linkerd_proxy_balance=warn` or similar
  ([linkerd2-proxy#2486])
* Fixed a regression where the proxy rendered `grpc_status` metric labels as a
  string rather than as the numeric status code ([linkerd2-proxy#2480]; fixes
  [#11449])
* Added missing `imagePullSecrets` to `linkerd-jaeger` ServiceAccount ([#11504])
* Updated the control plane's dependency on the `golang.google.org/grpc` Go
  package to include patches for [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 ([#11496])
* Updated dependencies on `rustix` to include patches for GHSA-c827-hfw6-qwvm
  ([linkerd2-proxy#2488] and [#11512]).

[#10590]: #10590
[#11279]: #11279
[#11483]: #11483
[#11449]: #11449
[#11480]: #11480
[#11504]: #11504
[#11504]: #11512
[linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480
[linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484
[linkerd2-proxy#2486]: linkerd/linkerd2-proxy#2486
[linkerd2-proxy#2488]: linkerd/linkerd2-proxy#2488
[proxy-log-level]: https://linkerd.io/2.14/tasks/modifying-proxy-log-level/
[CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
@hawkw
Copy link
Contributor

hawkw commented Oct 20, 2023

The latest edge release, edge-23.10.3, contains changes to the proxy and Destination controller which should resolve these issues. We'd love your help validating this fix by trying out edge-23.10.3 in your environments and letting us know if you encounter any more use of stale endpoints. Thanks!

mateiidavid added a commit that referenced this issue Oct 26, 2023
This stable release fixes issues in the proxy and Destination controller which
can result in Linkerd proxies sending traffic to stale endpoints. In addition,
it contains a bug fix for profile resolutions for pods bound on host ports and
includes patches for security advisory [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3

* Control Plane
  * Fixed an issue where the Destination controller could stop processing
    changes in the endpoints of a destination, if a proxy subscribed to that
    destination stops reading service discovery updates. This issue results in
    proxies attempting to send traffic for that destination to stale endpoints
    ([#11483], fixes [#11480], [#11279], [#10590])
  * Fixed an issue where the Destination controller would not update pod
    metadata for profile resolutions for a pod accessed via the host network
    (e.g. HostPort endpoints) ([#11334])
  * Addressed [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 by upgrading several
    dependencies (including Go's gRPC and net libraries)

* Proxy
  * Fixed a regression where the proxy rendered `grpc_status` metric labels as
    a string rather than as the numeric status code ([linkerd2-proxy#2480];
    fixes [#11449])
  * Fixed a regression introduced in stable-2.13.0 where proxies would not
    terminate unusred service discovery watches, exerting backpressure on the
    Destination controller which could cause it to become stuck
    ([linkerd2-proxy#2484])

[#10590]: #10590
[#11279]: #11279
[#11483]: #11483
[#11480]: #11480
[#11334]: #11334
[#11449]: #11449
[CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
[linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480
[linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484

Signed-off-by: Matei David <[email protected]>
mateiidavid added a commit that referenced this issue Oct 26, 2023
This stable release fixes issues in the proxy and Destination controller which
can result in Linkerd proxies sending traffic to stale endpoints. In addition,
it contains a bug fix for profile resolutions for pods bound on host ports and
includes patches for security advisory [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3

* Control Plane
  * Fixed an issue where the Destination controller could stop processing
    changes in the endpoints of a destination, if a proxy subscribed to that
    destination stops reading service discovery updates. This issue results in
    proxies attempting to send traffic for that destination to stale endpoints
    ([#11483], fixes [#11480], [#11279], [#10590])
  * Fixed an issue where the Destination controller would not update pod
    metadata for profile resolutions for a pod accessed via the host network
    (e.g. HostPort endpoints) ([#11334])
  * Addressed [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 by upgrading several
    dependencies (including Go's gRPC and net libraries)

* Proxy
  * Fixed a regression where the proxy rendered `grpc_status` metric labels as
    a string rather than as the numeric status code ([linkerd2-proxy#2480];
    fixes [#11449])
  * Fixed a regression introduced in stable-2.13.0 where proxies would not
    terminate unusred service discovery watches, exerting backpressure on the
    Destination controller which could cause it to become stuck
    ([linkerd2-proxy#2484])

[#10590]: #10590
[#11279]: #11279
[#11483]: #11483
[#11480]: #11480
[#11334]: #11334
[#11449]: #11449
[CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
[linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480
[linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484

---------

Signed-off-by: Matei David <[email protected]>
Co-authored-by: Alejandro Pedraza <[email protected]>
Co-authored-by: Oliver Gould <[email protected]>
@DavidMcLaughlin
Copy link
Contributor

This fix is now available in stable-2.14.2. Please feel free to reopen if you continue to experience this issue after upgrading.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants