Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hostPort pod restarts result in proxy connection failures in stable-2.13.5 despite fixes in stable-2.13.3 #11584

Closed
lifttocode opened this issue Nov 7, 2023 · 4 comments
Labels

Comments

@lifttocode
Copy link

lifttocode commented Nov 7, 2023

What is the issue?

The issue that was reported as fixed in stable-2.13.3 through #7589 appears to still be occurring in stable-2.13.5. We are experiencing the similar reconnection refusal errors. cc @alpeb

When a datadog agent redeploys (the pod is deleted & recreated), other applications on that same host are unable to reconnect even after it comes back up healthy. The Linkerd proxy logs show connection failures, no route to host errors, etc (shown below). I would expect that once the Datadog agent comes back up, the Linkerd proxy is able to successfully make connections again.

How can it be reproduced?

  1. Deploy datadog according to their documentation. Configure it to enable APM.
  2. Inject the host IP as the datadog agent address described here
env:
  - name: DD_AGENT_HOST
    valueFrom:
      fieldRef:
        fieldPath: status.hostIP
  1. Configure the application to send tracing data to datadog. For Java apps, these steps can be following.
  2. Delete the datadog agent running on the same host as the application.
  3. Follow the linkerd-proxy logs on the application to see the errors.

Logs, error output, etc

Linkerd proxy logs

[2176930.106950s]  INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58778}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: connect timed out after 1s error.sources=[error trying to co
nnect: connect timed out after 1s, connect timed out after 1s]
[2176931.109175s]  INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58790}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: connect timed out after 1s error.sources=[error trying to co
nnect: connect timed out after 1s, connect timed out after 1s]
[2176931.508782s]  INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58804}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: No route to host (os error 113) error.sources=[error trying
to connect: No route to host (os error 113), No route to host (os error 113)]
[2176934.320650s]  INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58810}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: connect timed out after 1s error.sources=[error trying to co
nnect: connect timed out after 1s, connect timed out after 1s]
[2176934.321310s]  INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58822}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: Connection refused (os error 111) error.sources=[error tryin
g to connect: Connection refused (os error 111), Connection refused (os error 111)]
[2176935.323857s]  INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58838}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: Connection refused (os error 111) error.sources=[error tryin
g to connect: Connection refused (os error 111), Connection refused (os error 111)]
[2176936.411565s]  INFO ThreadId(01) outbound:proxy{addr=10.0.77.252:8126}:forward{addr=10.0.73.245:8126}:rescue{client.addr=10.0.79.118:58844}: linkerd_app_core
::errors::respond: HTTP/1.1 request failed error=endpoint 10.0.73.245:8126: error trying to connect: Connection refused (os error 111) error.sources=[error tryin
g to connect: Connection refused (os error 111), Connection refused (os error 111)]
[2177001.741203s]  WARN ThreadId(01) linkerd_reconnect: Service failed error=channel closed

output of linkerd check -o short

❯ ./linkerd2-cli-stable-2.13.5-darwin-arm64 check -o short
linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.13.5 but the latest stable version is 2.14.2
    see https://linkerd.io/2.13/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.13.5 but the latest stable version is 2.14.2
    see https://linkerd.io/2.13/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-556886d49c-brvh6 (stable-2.13.5)
	* linkerd-destination-556886d49c-bsk52 (stable-2.13.5)
	* linkerd-destination-556886d49c-t9hfb (stable-2.13.5)
	* linkerd-identity-74778fb5ff-7mjbv (stable-2.13.5)
	* linkerd-identity-74778fb5ff-gxp8x (stable-2.13.5)
	* linkerd-identity-74778fb5ff-q2b9r (stable-2.13.5)
	* linkerd-proxy-injector-776c8f5bc4-d566n (stable-2.13.5)
	* linkerd-proxy-injector-776c8f5bc4-g84pb (stable-2.13.5)
	* linkerd-proxy-injector-776c8f5bc4-z59jw (stable-2.13.5)
    see https://linkerd.io/2.13/checks/#l5d-cp-proxy-version for hints

linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
    kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
    see https://linkerd.io/2.13/checks/#l5d-injection-disabled for hints

Status check results are √

Environment

  • Linkerd version: stable-2.13.5
  • Kubernetes version: v1.25.14
  • Cluster Environment: EKS
  • Host OS: linux
  • Linkerd CLI version: stable-2.13.5

Possible solution

N/A

Additional context

N/A

Would you like to work on fixing this bug?

no

@lifttocode lifttocode added the bug label Nov 7, 2023
@lifttocode lifttocode changed the title hostPort pod restarts result in proxy connection failures in stable-2.12.5 despite fixes in stable-2.12.3 hostPort pod restarts result in proxy connection failures in stable-2.13.5 despite fixes in stable-2.13.3 Nov 7, 2023
@lifttocode
Copy link
Author

lifttocode commented Nov 7, 2023

I've noticed that PR #11328 appears to address a similar issue to what we've been experiencing as described in this bug report. We are currently on stable-2.13.5. To verify if the fix implemented in stable-2.13.7 resolves our issue, we will be updating to this version and monitoring the behavior. I will provide an update with our findings once we have assessed the impact of the changes in the new release.

@alpeb
Copy link
Member

alpeb commented Nov 7, 2023

Yes @lifttocode, this sounds a lot like the issue fix that was backported into 2.13.7. Eager to hear back on what you find out!

@alpeb alpeb added bug and removed bug labels Nov 8, 2023
@lifttocode
Copy link
Author

lifttocode commented Dec 6, 2023

@alpeb sorry for the late update! it's resolved in 2.13.7! thank you.

@alpeb
Copy link
Member

alpeb commented Dec 6, 2023

Awesome, thanks for reporting back!

@alpeb alpeb closed this as completed Dec 6, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants