hostPort pod restarts result in proxy connection failures in stable-2.13.5 despite fixes in stable-2.13.3 #11584

lifttocode · 2023-11-07T05:24:03Z

What is the issue?

The issue that was reported as fixed in stable-2.13.3 through #7589 appears to still be occurring in stable-2.13.5. We are experiencing the similar reconnection refusal errors. cc @alpeb

When a datadog agent redeploys (the pod is deleted & recreated), other applications on that same host are unable to reconnect even after it comes back up healthy. The Linkerd proxy logs show connection failures, no route to host errors, etc (shown below). I would expect that once the Datadog agent comes back up, the Linkerd proxy is able to successfully make connections again.

How can it be reproduced?

Deploy datadog according to their documentation. Configure it to enable APM.
Inject the host IP as the datadog agent address described here

env:
  - name: DD_AGENT_HOST
    valueFrom:
      fieldRef:
        fieldPath: status.hostIP

Configure the application to send tracing data to datadog. For Java apps, these steps can be following.
Delete the datadog agent running on the same host as the application.
Follow the linkerd-proxy logs on the application to see the errors.

Logs, error output, etc

Linkerd proxy logs

[2176930.106950s]  INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58778}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: connect timed out after 1s error.sources=[error trying to co
nnect: connect timed out after 1s, connect timed out after 1s]
[2176931.109175s]  INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58790}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: connect timed out after 1s error.sources=[error trying to co
nnect: connect timed out after 1s, connect timed out after 1s]
[2176931.508782s]  INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58804}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: No route to host (os error 113) error.sources=[error trying
to connect: No route to host (os error 113), No route to host (os error 113)]
[2176934.320650s]  INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58810}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: connect timed out after 1s error.sources=[error trying to co
nnect: connect timed out after 1s, connect timed out after 1s]
[2176934.321310s]  INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58822}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: Connection refused (os error 111) error.sources=[error tryin
g to connect: Connection refused (os error 111), Connection refused (os error 111)]
[2176935.323857s]  INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58838}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: Connection refused (os error 111) error.sources=[error tryin
g to connect: Connection refused (os error 111), Connection refused (os error 111)]
[2176936.411565s]  INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58844}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: Connection refused (os error 111) error.sources=[error tryin
g to connect: Connection refused (os error 111), Connection refused (os error 111)]
[2177001.741203s]  WARN ThreadId(01) linkerd_reconnect: Service failed error=channel closed

output of `linkerd check -o short`

❯ ./linkerd2-cli-stable-2.13.5-darwin-arm64 check -o short
linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.13.5 but the latest stable version is 2.14.2
    see https://linkerd.io/2.13/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.13.5 but the latest stable version is 2.14.2
    see https://linkerd.io/2.13/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-556886d49c-brvh6 (stable-2.13.5)
	* linkerd-destination-556886d49c-bsk52 (stable-2.13.5)
	* linkerd-destination-556886d49c-t9hfb (stable-2.13.5)
	* linkerd-identity-74778fb5ff-7mjbv (stable-2.13.5)
	* linkerd-identity-74778fb5ff-gxp8x (stable-2.13.5)
	* linkerd-identity-74778fb5ff-q2b9r (stable-2.13.5)
	* linkerd-proxy-injector-776c8f5bc4-d566n (stable-2.13.5)
	* linkerd-proxy-injector-776c8f5bc4-g84pb (stable-2.13.5)
	* linkerd-proxy-injector-776c8f5bc4-z59jw (stable-2.13.5)
    see https://linkerd.io/2.13/checks/#l5d-cp-proxy-version for hints

linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
    kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
    see https://linkerd.io/2.13/checks/#l5d-injection-disabled for hints

Status check results are √

Environment

Linkerd version: stable-2.13.5
Kubernetes version: v1.25.14
Cluster Environment: EKS
Host OS: linux
Linkerd CLI version: stable-2.13.5

Possible solution

N/A

Additional context

N/A

Would you like to work on fixing this bug?

no

The text was updated successfully, but these errors were encountered:

lifttocode · 2023-11-07T08:00:22Z

I've noticed that PR #11328 appears to address a similar issue to what we've been experiencing as described in this bug report. We are currently on stable-2.13.5. To verify if the fix implemented in stable-2.13.7 resolves our issue, we will be updating to this version and monitoring the behavior. I will provide an update with our findings once we have assessed the impact of the changes in the new release.

alpeb · 2023-11-07T14:29:58Z

Yes @lifttocode, this sounds a lot like the issue fix that was backported into 2.13.7. Eager to hear back on what you find out!

lifttocode · 2023-12-06T07:09:18Z

@alpeb sorry for the late update! it's resolved in 2.13.7! thank you.

alpeb · 2023-12-06T20:58:24Z

Awesome, thanks for reporting back!

lifttocode added the bug label Nov 7, 2023

lifttocode changed the title ~~hostPort pod restarts result in proxy connection failures in stable-2.12.5 despite fixes in stable-2.12.3~~ hostPort pod restarts result in proxy connection failures in stable-2.13.5 despite fixes in stable-2.13.3 Nov 7, 2023

alpeb added bug and removed bug labels Nov 8, 2023

alpeb closed this as completed Dec 6, 2023

github-actions bot locked as resolved and limited conversation to collaborators Jan 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hostPort pod restarts result in proxy connection failures in stable-2.13.5 despite fixes in stable-2.13.3 #11584

hostPort pod restarts result in proxy connection failures in stable-2.13.5 despite fixes in stable-2.13.3 #11584

lifttocode commented Nov 7, 2023 •

edited

Loading

lifttocode commented Nov 7, 2023 •

edited

Loading

alpeb commented Nov 7, 2023

lifttocode commented Dec 6, 2023 •

edited

Loading

alpeb commented Dec 6, 2023

hostPort pod restarts result in proxy connection failures in stable-2.13.5 despite fixes in stable-2.13.3 #11584

hostPort pod restarts result in proxy connection failures in stable-2.13.5 despite fixes in stable-2.13.3 #11584

Comments

lifttocode commented Nov 7, 2023 • edited Loading

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

lifttocode commented Nov 7, 2023 • edited Loading

alpeb commented Nov 7, 2023

lifttocode commented Dec 6, 2023 • edited Loading

alpeb commented Dec 6, 2023

lifttocode commented Nov 7, 2023 •

edited

Loading

output of `linkerd check -o short`

lifttocode commented Nov 7, 2023 •

edited

Loading

lifttocode commented Dec 6, 2023 •

edited

Loading