You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The issue that was reported as fixed in stable-2.13.3 through #7589 appears to still be occurring in stable-2.13.5. We are experiencing the similar reconnection refusal errors. cc @alpeb
When a datadog agent redeploys (the pod is deleted & recreated), other applications on that same host are unable to reconnect even after it comes back up healthy. The Linkerd proxy logs show connection failures, no route to host errors, etc (shown below). I would expect that once the Datadog agent comes back up, the Linkerd proxy is able to successfully make connections again.
How can it be reproduced?
Deploy datadog according to their documentation. Configure it to enable APM.
Inject the host IP as the datadog agent address described here
Configure the application to send tracing data to datadog. For Java apps, these steps can be following.
Delete the datadog agent running on the same host as the application.
Follow the linkerd-proxy logs on the application to see the errors.
Logs, error output, etc
Linkerd proxy logs
[2176930.106950s] INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58778}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: connect timed out after 1s error.sources=[error trying to co
nnect: connect timed out after 1s, connect timed out after 1s]
[2176931.109175s] INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58790}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: connect timed out after 1s error.sources=[error trying to co
nnect: connect timed out after 1s, connect timed out after 1s]
[2176931.508782s] INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58804}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: No route to host (os error 113) error.sources=[error trying
to connect: No route to host (os error 113), No route to host (os error 113)]
[2176934.320650s] INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58810}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: connect timed out after 1s error.sources=[error trying to co
nnect: connect timed out after 1s, connect timed out after 1s]
[2176934.321310s] INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58822}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: Connection refused (os error 111) error.sources=[error tryin
g to connect: Connection refused (os error 111), Connection refused (os error 111)]
[2176935.323857s] INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58838}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: Connection refused (os error 111) error.sources=[error tryin
g to connect: Connection refused (os error 111), Connection refused (os error 111)]
[2176936.411565s] INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58844}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: Connection refused (os error 111) error.sources=[error tryin
g to connect: Connection refused (os error 111), Connection refused (os error 111)]
[2177001.741203s] WARN ThreadId(01) linkerd_reconnect: Service failed error=channel closed
output of linkerd check -o short
❯ ./linkerd2-cli-stable-2.13.5-darwin-arm64 check -o short
linkerd-version
---------------
‼ cli is up-to-date
is running version 2.13.5 but the latest stable version is 2.14.2
see https://linkerd.io/2.13/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 2.13.5 but the latest stable version is 2.14.2
see https://linkerd.io/2.13/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-556886d49c-brvh6 (stable-2.13.5)
* linkerd-destination-556886d49c-bsk52 (stable-2.13.5)
* linkerd-destination-556886d49c-t9hfb (stable-2.13.5)
* linkerd-identity-74778fb5ff-7mjbv (stable-2.13.5)
* linkerd-identity-74778fb5ff-gxp8x (stable-2.13.5)
* linkerd-identity-74778fb5ff-q2b9r (stable-2.13.5)
* linkerd-proxy-injector-776c8f5bc4-d566n (stable-2.13.5)
* linkerd-proxy-injector-776c8f5bc4-g84pb (stable-2.13.5)
* linkerd-proxy-injector-776c8f5bc4-z59jw (stable-2.13.5)
see https://linkerd.io/2.13/checks/#l5d-cp-proxy-version for hints
linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
see https://linkerd.io/2.13/checks/#l5d-injection-disabled for hints
Status check results are √
Environment
Linkerd version: stable-2.13.5
Kubernetes version: v1.25.14
Cluster Environment: EKS
Host OS: linux
Linkerd CLI version: stable-2.13.5
Possible solution
N/A
Additional context
N/A
Would you like to work on fixing this bug?
no
The text was updated successfully, but these errors were encountered:
lifttocode
changed the title
hostPort pod restarts result in proxy connection failures in stable-2.12.5 despite fixes in stable-2.12.3
hostPort pod restarts result in proxy connection failures in stable-2.13.5 despite fixes in stable-2.13.3
Nov 7, 2023
I've noticed that PR #11328 appears to address a similar issue to what we've been experiencing as described in this bug report. We are currently on stable-2.13.5. To verify if the fix implemented in stable-2.13.7 resolves our issue, we will be updating to this version and monitoring the behavior. I will provide an update with our findings once we have assessed the impact of the changes in the new release.
What is the issue?
The issue that was reported as fixed in stable-2.13.3 through #7589 appears to still be occurring in stable-2.13.5. We are experiencing the similar reconnection refusal errors. cc @alpeb
When a datadog agent redeploys (the pod is deleted & recreated), other applications on that same host are unable to reconnect even after it comes back up healthy. The Linkerd proxy logs show connection failures, no route to host errors, etc (shown below). I would expect that once the Datadog agent comes back up, the Linkerd proxy is able to successfully make connections again.
How can it be reproduced?
Logs, error output, etc
Linkerd proxy logs
output of
linkerd check -o short
Environment
Possible solution
N/A
Additional context
N/A
Would you like to work on fixing this bug?
no
The text was updated successfully, but these errors were encountered: