Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy keeps trying to connect to stale identity IP address #6184

Closed
alex-berger opened this issue May 26, 2021 · 21 comments
Closed

Proxy keeps trying to connect to stale identity IP address #6184

alex-berger opened this issue May 26, 2021 · 21 comments
Labels
Milestone

Comments

@alex-berger
Copy link
Contributor

Bug Report

While testing #6157 and #6146 I observed another occurrence of the linkerd-proxy trying to connect to stale (no longer existing) endpoints.

What is the issue?

One of our meshed ingress gateways (running in ingress mode) tries to connect to a stale linkerd-identity address 10.40.25.18:8080. That IP address is currently not in use by any Pod, but the affected linkerd-proxy instance keeps on trying to connect to it, which is visible from the log as well as from strace trace below.

Note, this only affects one single linkerd-proxy sidecar instance in our cluster, which is suspicious and looks like some bad state (race condition) thingy.

How can it be reproduced?

Logs, error output, etc

Linkerd-proxy log message:

The full log is available from this gist.

{"timestamp":"[ 99086.550700s]","level":"DEBUG","fields":{"message":"Initiating TLS connection","server.id":"linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local"},"target":"linkerd_tls::client","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(3)"}
{"timestamp":"[ 99086.551004s]","level":"DEBUG","fields":{"message":"Connecting","server.addr":"10.40.25.18:8080"},"target":"linkerd_proxy_transport::connect","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(3)"}
{"timestamp":"[ 99087.472380s]","level":"DEBUG","fields":{"message":"Failed to connect","error":"request timed out"},"target":"linkerd_reconnect::service","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(3)"}
{"timestamp":"[ 99087.473194s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect::service","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(3)"}

Linkerd-proxy strace excerpt:

12554 12:24:30 write(1<pipe:[40991]>, "{\"timestamp\":\"[ 99086.550700s]\",\"level\":\"DEBUG\",\"fields\":{\"message\":\"Initiating TLS connection\",\"server.id\":\"linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local\"},\"target\":\"linkerd_tls::client\",\"sp
ans\":[{\"name\":\"daemon\"},{\"name\":\"identity\"}],\"threadId\":\"ThreadId(3)\"}\n", 287 <unfinished ...>
12553 12:24:30 <... sched_yield resumed> ) = 0
12554 12:24:30 <... write resumed> )    = 287
12554 12:24:30 write(1<pipe:[40991]>, "{\"timestamp\":\"[ 99086.551004s]\",\"level\":\"DEBUG\",\"fields\":{\"message\":\"Connecting\",\"server.addr\":\"10.40.25.18:8080\"},\"target\":\"linkerd_proxy_transport::connect\",\"spans\":[{\"name\":\"daemon\"},{\"name\":\"identity\"}],\"threadId\":\"ThreadId(3)\"}\n", 233) = 233
12554 12:24:30 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 38<TCP:[12935778]>
12553 12:24:30 sched_yield( <unfinished ...>
12554 12:24:30 connect(38<TCP:[12935778]>, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("10.40.25.18")}, 16) = -1 EINPROGRESS (Operation now in progress)
12554 12:24:30 epoll_ctl(5<anon_inode:[eventpoll]>, EPOLL_CTL_ADD, 38<TCP:[10.40.2.232:38436->10.40.25.18:8080]>, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1560281109, u64=1560281109}} <unfinished ...>
12553 12:24:30 <... sched_yield resumed> ) = 0
12553 12:24:30 epoll_wait(3<anon_inode:[eventpoll]>,  <unfinished ...>
12554 12:24:30 <... epoll_ctl resumed> ) = 0
...
12554 12:24:31 epoll_ctl(5<anon_inode:[eventpoll]>, EPOLL_CTL_DEL, 38<TCP:[10.40.2.232:38436->10.40.25.18:8080]>, NULL <unfinished ...>
12553 12:24:31 <... futex resumed> )    = 0
12554 12:24:31 <... epoll_ctl resumed> ) = 0
12553 12:24:31 sched_yield( <unfinished ...>
12554 12:24:31 close(38<TCP:[10.40.2.232:38436->10.40.25.18:8080]> <unfinished ...>
12553 12:24:31 <... sched_yield resumed> ) = 0
12553 12:24:31 sched_yield( <unfinished ...>
12554 12:24:31 <... close resumed> )    = 0
12553 12:24:31 <... sched_yield resumed> ) = 0
12553 12:24:31 sched_yield()            = 0
12554 12:24:31 write(1<pipe:[40991]>, "{\"timestamp\":\"[ 99087.554720s]\",\"level\":\"DEBUG\",\"fields\":{\"message\":\"Failed to connect\",\"error\":\"request timed out\"},\"target\":\"linkerd_reconnect::service\",\"spans\":[{\"name\":\"daemon\"},{\"name\":\"identity\"}],\"threadId\":\"ThreadId(3)\"}\n", 229 <unfinished ...>
12553 12:24:31 epoll_wait(3<anon_inode:[eventpoll]>,  <unfinished ...>
12554 12:24:31 <... write resumed> )    = 229
12554 12:24:31 write(1<pipe:[40991]>, "{\"timestamp\":\"[ 99087.554914s]\",\"level\":\"DEBUG\",\"fields\":{\"message\":\"Recovering\"},\"target\":\"linkerd_reconnect::service\",\"spans\":[{\"name\":\"daemon\"},{\"name\":\"identity\"}],\"threadId\":\"ThreadId(3)\"}\n", 194) = 194

linkerd check output

Linkerd core checks
===================

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 21.5.1 but the latest edge version is 21.5.2
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
√ can retrieve the control plane version
√ control plane is up-to-date
‼ control plane and cli versions match
    control plane running edge-21.5.2 but cli running edge-21.5.1
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-destination-559cf748b-22992 (a07b33d7)
        * linkerd-destination-559cf748b-7ldlz (a07b33d7)
        * linkerd-destination-559cf748b-hg9hq (a07b33d7)
        * linkerd-identity-796d8b54f-4zm7t (a07b33d7)
        * linkerd-identity-796d8b54f-7bqsm (a07b33d7)
        * linkerd-identity-796d8b54f-gvqng (a07b33d7)
        * linkerd-proxy-injector-74c5f4889d-8dxph (a07b33d7)
        * linkerd-proxy-injector-74c5f4889d-bldh2 (a07b33d7)
        * linkerd-proxy-injector-74c5f4889d-vcz4t (a07b33d7)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-559cf748b-22992 running a07b33d7 but cli running edge-21.5.1
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

Status check results are √

Environment

  • Kubernetes Version: 1.19.6
  • Cluster Environment: EKS
  • Host OS: Amazon Linux 2
  • Linkerd version: edge-21.5.2 with

Possible solution

Additional context

@olix0r
Copy link
Member

olix0r commented May 26, 2021

@alex-berger Thanks for another great bug report.

The first thing I'd want to look into is the state of the pod's DNS--proxies discover the identity service by resolving SRV records for linkerd-identity-headless.linkerd... There should be some diagnostic logging from the linkerd_dns crate every ~5s, but it looks like the logging we've captured in that gist is too verbose/short to have captured that. If you happen to have more logs, can you grep out resolve or dns and share what you find?

@alex-berger
Copy link
Contributor Author

alex-berger commented May 26, 2021

@olix0r I grepped for dns|resolve with a slightly increased time windows (~10seconds) and got the following log messages, no sightings of linkerd-identity-headless.linkerd... in there:

{"timestamp":"[ 98617.927784s]","level":"DEBUG","fields":{"message":"resolve_srv","name":"linkerd-dst-headless.linkerd.svc.cluster.local."},"target":"linkerd_dns","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(3)"}
{"timestamp":"[ 98617.932566s]","level":"DEBUG","fields":{"addrs":"[10.40.12.80:8086, 10.40.16.95:8086, 10.40.32.167:8086]"},"target":"linkerd_dns","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(3)"}
{"timestamp":"[ 98617.932627s]","level":"DEBUG","fields":{"addrs":"[10.40.12.80:8086, 10.40.16.95:8086, 10.40.32.167:8086]"},"target":"linkerd_proxy_dns_resolve","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(3)"}
{"timestamp":"[ 98617.934687s]","level":"DEBUG","fields":{"message":"Resolving","dst":"grafana.kube-system.svc.cluster.local:3000","context":"{\"ns\":\"gloo-system\", \"nodeName\":\"ip-10-40-9-160.eu-central-1.compute.internal\"}\n"},"target":"linkerd_proxy_api_resolve::resolve","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"v":"1.x","name":"http"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"logical"},{"addr":"grafana.kube-system.svc.cluster.local:3000","name":"concrete"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98617.937693s]","level":"TRACE","fields":{"metadata":"MetadataMap { headers: {\"content-type\": \"application/grpc\", \"date\": \"Wed, 26 May 2021 12:16:41 GMT\"} }"},"target":"linkerd_proxy_api_resolve::resolve","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"v":"1.x","name":"http"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"logical"},{"addr":"grafana.kube-system.svc.cluster.local:3000","name":"concrete"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98617.937761s]","level":"TRACE","fields":{"message":"Connected"},"target":"linkerd_proxy_resolve::recover","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"v":"1.x","name":"http"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"logical"},{"addr":"grafana.kube-system.svc.cluster.local:3000","name":"concrete"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98617.937892s]","level":"DEBUG","fields":{"message":"Add","endpoints":"1"},"target":"linkerd_proxy_api_resolve::resolve","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"v":"1.x","name":"http"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"logical"},{"addr":"grafana.kube-system.svc.cluster.local:3000","name":"concrete"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98617.937914s]","level":"TRACE","fields":{"message":"Resolved endpoint","addr":"10.40.30.51:3000","metadata":"Metadata { labels: {\"deployment\": \"grafana-deployment\", \"namespace\": \"kube-system\", \"pod\": \"grafana-deployment-87bbb98-44q59\", \"pod_template_hash\": \"87bbb98\", \"service\": \"grafana\", \"serviceaccount\": \"grafana-serviceaccount\"}, protocol_hint: Unknown, opaque_transport_port: None, identity: None, authority_override: None }","concrete":"Concrete { resolve: ConcreteAddr(NameAddr { name: \"grafana.kube-system.svc.cluster.local.\", port: 3000 }), logical: Logical { protocol: Http1, profile: .., logical_addr: LogicalAddr(grafana.kube-system.svc.cluster.local:3000) } }"},"target":"linkerd_app_outbound::endpoint","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"v":"1.x","name":"http"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"logical"},{"addr":"grafana.kube-system.svc.cluster.local:3000","name":"concrete"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98622.933614s]","level":"DEBUG","fields":{"message":"resolve_srv","name":"linkerd-dst-headless.linkerd.svc.cluster.local."},"target":"linkerd_dns","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98622.935086s]","level":"DEBUG","fields":{"addrs":"[10.40.12.80:8086, 10.40.16.95:8086, 10.40.32.167:8086]"},"target":"linkerd_dns","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98622.935131s]","level":"DEBUG","fields":{"addrs":"[10.40.12.80:8086, 10.40.16.95:8086, 10.40.32.167:8086]"},"target":"linkerd_proxy_dns_resolve","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98631.534495s]","level":"DEBUG","fields":{"message":"resolve_srv","name":"linkerd-dst-headless.linkerd.svc.cluster.local."},"target":"linkerd_dns","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98631.536194s]","level":"DEBUG","fields":{"addrs":"[10.40.12.80:8086, 10.40.16.95:8086, 10.40.32.167:8086]"},"target":"linkerd_dns","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(3)"}
{"timestamp":"[ 98631.536270s]","level":"DEBUG","fields":{"addrs":"[10.40.12.80:8086, 10.40.16.95:8086, 10.40.32.167:8086]"},"target":"linkerd_proxy_dns_resolve","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(3)"}
{"timestamp":"[ 98663.194286s]","level":"DEBUG","fields":{"message":"resolve_srv","name":"linkerd-dst-headless.linkerd.svc.cluster.local."},"target":"linkerd_dns","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(3)"}
{"timestamp":"[ 98663.196208s]","level":"DEBUG","fields":{"addrs":"[10.40.12.80:8086, 10.40.16.95:8086, 10.40.32.167:8086]"},"target":"linkerd_dns","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98663.196235s]","level":"DEBUG","fields":{"addrs":"[10.40.12.80:8086, 10.40.16.95:8086, 10.40.32.167:8086]"},"target":"linkerd_proxy_dns_resolve","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98781.228775s]","level":"DEBUG","fields":{"message":"resolve_srv","name":"linkerd-dst-headless.linkerd.svc.cluster.local."},"target":"linkerd_dns","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98781.230394s]","level":"DEBUG","fields":{"addrs":"[10.40.12.80:8086, 10.40.16.95:8086, 10.40.32.167:8086]"},"target":"linkerd_dns","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98781.230447s]","level":"DEBUG","fields":{"addrs":"[10.40.12.80:8086, 10.40.16.95:8086, 10.40.32.167:8086]"},"target":"linkerd_proxy_dns_resolve","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98781.233032s]","level":"DEBUG","fields":{"message":"Resolving","dst":"grafana.kube-system.svc.cluster.local:3000","context":"{\"ns\":\"gloo-system\", \"nodeName\":\"ip-10-40-9-160.eu-central-1.compute.internal\"}\n"},"target":"linkerd_proxy_api_resolve::resolve","spans":[{"name":"outbound"},{"client.addr":"10.40.2.232:57436","name":"accept"},{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"v":"1.x","name":"http"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"logical"},{"addr":"grafana.kube-system.svc.cluster.local:3000","name":"concrete"}],"threadId":"ThreadId(3)"}
{"timestamp":"[ 98781.248482s]","level":"TRACE","fields":{"metadata":"MetadataMap { headers: {\"content-type\": \"application/grpc\", \"date\": \"Wed, 26 May 2021 12:19:24 GMT\"} }"},"target":"linkerd_proxy_api_resolve::resolve","spans":[{"name":"outbound"},{"client.addr":"10.40.2.232:57436","name":"accept"},{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"v":"1.x","name":"http"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"logical"},{"addr":"grafana.kube-system.svc.cluster.local:3000","name":"concrete"}],"threadId":"ThreadId(3)"}
{"timestamp":"[ 98781.248559s]","level":"TRACE","fields":{"message":"Connected"},"target":"linkerd_proxy_resolve::recover","spans":[{"name":"outbound"},{"client.addr":"10.40.2.232:57436","name":"accept"},{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"v":"1.x","name":"http"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"logical"},{"addr":"grafana.kube-system.svc.cluster.local:3000","name":"concrete"}],"threadId":"ThreadId(3)"}
{"timestamp":"[ 98781.248727s]","level":"DEBUG","fields":{"message":"Add","endpoints":"1"},"target":"linkerd_proxy_api_resolve::resolve","spans":[{"name":"outbound"},{"client.addr":"10.40.2.232:57436","name":"accept"},{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"v":"1.x","name":"http"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"logical"},{"addr":"grafana.kube-system.svc.cluster.local:3000","name":"concrete"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98781.248759s]","level":"TRACE","fields":{"message":"Resolved endpoint","addr":"10.40.30.51:3000","metadata":"Metadata { labels: {\"deployment\": \"grafana-deployment\", \"namespace\": \"kube-system\", \"pod\": \"grafana-deployment-87bbb98-44q59\", \"pod_template_hash\": \"87bbb98\", \"service\": \"grafana\", \"serviceaccount\": \"grafana-serviceaccount\"}, protocol_hint: Unknown, opaque_transport_port: None, identity: None, authority_override: None }","concrete":"Concrete { resolve: ConcreteAddr(NameAddr { name: \"grafana.kube-system.svc.cluster.local.\", port: 3000 }), logical: Logical { protocol: Http1, profile: .., logical_addr: LogicalAddr(grafana.kube-system.svc.cluster.local:3000) } }"},"target":"linkerd_app_outbound::endpoint","spans":[{"name":"outbound"},{"client.addr":"10.40.2.232:57436","name":"accept"},{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"v":"1.x","name":"http"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"logical"},{"addr":"grafana.kube-system.svc.cluster.local:3000","name":"concrete"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98786.232276s]","level":"DEBUG","fields":{"message":"resolve_srv","name":"linkerd-dst-headless.linkerd.svc.cluster.local."},"target":"linkerd_dns","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(3)"}
{"timestamp":"[ 98786.234414s]","level":"DEBUG","fields":{"addrs":"[10.40.12.80:8086, 10.40.16.95:8086, 10.40.32.167:8086]"},"target":"linkerd_dns","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 98786.234437s]","level":"DEBUG","fields":{"addrs":"[10.40.12.80:8086, 10.40.16.95:8086, 10.40.32.167:8086]"},"target":"linkerd_proxy_dns_resolve","spans":[{"orig_dst":"10.40.30.51:3000","name":"ingress"},{"dst":"grafana.kube-system.svc.cluster.local:3000","name":"override"}],"threadId":"ThreadId(2)"}

I scanned another hour of logs for that Pod without finding any log message for linkerd-identity-headless.linkerd....

@olix0r
Copy link
Member

olix0r commented May 26, 2021

Interesting! I think this at least gives us some pointers about what to look into.

In the meantime, you may want to run with a more targeted logging configuration so we get some better signal if it recurs after restart:

config.linkerd.io/proxy-log-level: warn,linkerd=info,linkerd_dns=debug,linkerd_proxy_dns_resolve=debug,linkerd_proxy_resolve=trace,linkerd_reconnect=debug

@olix0r olix0r added this to the stable-2.11.0 milestone May 26, 2021
@olix0r olix0r added area/identity Automatic transport identity bug/staleness labels May 26, 2021
@alex-berger
Copy link
Contributor Author

alex-berger commented May 28, 2021

@olix0r I applied the log configuration that you recommended to our ingress gateways. Now, the problem appears again on one of our ingress gateway Pods, but the log of that Pod still does not contain any recent log messages matching dns|resolve. There are a hand full of such log messages in the log immetiately after the start of the Pod (see below), but then no more such log messages appear for hours.

Note, the logs of the others ingress gateway Pods, which show no symptoms contain plenty of log messages that match dns|resolve. So, it looks like the affected linkerd-proxy instance does not even try to resolve things (anymore).

{"timestamp":"[    15.015029s]","level":"DEBUG","fields":{"addrs":"[10.176.15.31:8080, 10.176.33.89:8080, 10.176.42.139:8080]"},"target":"linkerd_proxy_dns_resolve","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(1)"}
{"timestamp":"[    15.014978s]","level":"DEBUG","fields":{"addrs":"[10.176.15.31:8080, 10.176.33.89:8080, 10.176.42.139:8080]"},"target":"linkerd_dns","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(1)"}
{"timestamp":"[    15.013344s]","level":"DEBUG","fields":{"message":"resolve_srv","name":"linkerd-identity-headless.linkerd.svc.cluster.local."},"target":"linkerd_dns","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(1)"}
{"timestamp":"[    10.012737s]","level":"DEBUG","fields":{"addrs":"[10.176.15.31:8080, 10.176.33.89:8080, 10.176.42.139:8080]"},"target":"linkerd_proxy_dns_resolve","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(1)"}
{"timestamp":"[    10.012693s]","level":"DEBUG","fields":{"addrs":"[10.176.15.31:8080, 10.176.33.89:8080, 10.176.42.139:8080]"},"target":"linkerd_dns","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(1)"}
{"timestamp":"[    10.010689s]","level":"DEBUG","fields":{"message":"resolve_srv","name":"linkerd-identity-headless.linkerd.svc.cluster.local."},"target":"linkerd_dns","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(1)"}
{"timestamp":"[     5.009151s]","level":"TRACE","fields":{"message":"Connected"},"target":"linkerd_proxy_resolve::recover","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(1)"}
{"timestamp":"[     5.009085s]","level":"DEBUG","fields":{"addrs":"[10.176.15.31:8080, 10.176.33.89:8080, 10.176.42.139:8080]"},"target":"linkerd_dns","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(1)"}
{"timestamp":"[     0.004939s]","level":"DEBUG","fields":{"message":"resolve_srv","name":"linkerd-identity-headless.linkerd.svc.cluster.local."},"target":"linkerd_dns","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(1)"}
{"timestamp":"[     0.003721s]","level":"INFO","fields":{"message":"Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}

The log of the affected linkerd-proxy is full of those messages instead, whereas 10.176.2.62 is a stale IP address no longer assigned to any Pod. Is it possible to deduce from the log whether that IP address was resolved by linkerd-proxy ( using l5d-dst-override or host/authority header) or whether linkerd-proxy just forwards the request to the original IP address (without doing any resolving resp. address rewriting)? I am asking this because I am wondering who comes up with 10.176.2.62, the linkerd-proxy sidecar container or the envoy (gloo-edge gateway) container:

{"timestamp":"[ 11341.956104s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11341.956056s]","level":"DEBUG","fields":{"message":"Failed to connect","error":"request timed out"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11339.453263s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11339.453201s]","level":"DEBUG","fields":{"message":"Failed to connect","error":"request timed out"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11336.951308s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11336.951260s]","level":"DEBUG","fields":{"message":"Failed to connect","error":"request timed out"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11334.449317s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11334.449263s]","level":"DEBUG","fields":{"message":"Failed to connect","error":"request timed out"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11331.947645s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11331.947507s]","level":"DEBUG","fields":{"message":"Failed to connect","error":"request timed out"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11329.445082s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11329.445033s]","level":"DEBUG","fields":{"message":"Failed to connect","error":"request timed out"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11326.941779s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11326.941723s]","level":"DEBUG","fields":{"message":"Failed to connect","error":"request timed out"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11324.439668s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11324.439609s]","level":"DEBUG","fields":{"message":"Failed to connect","error":"request timed out"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11321.936539s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11321.936481s]","level":"DEBUG","fields":{"message":"Failed to connect","error":"request timed out"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11319.434664s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11319.434558s]","level":"DEBUG","fields":{"message":"Failed to connect","error":"request timed out"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11316.932468s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11316.932413s]","level":"DEBUG","fields":{"message":"Failed to connect","error":"request timed out"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11314.429742s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11314.429685s]","level":"DEBUG","fields":{"message":"Failed to connect","error":"request timed out"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11311.927277s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11311.927220s]","level":"DEBUG","fields":{"message":"Failed to connect","error":"request timed out"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11309.425028s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11309.424976s]","level":"DEBUG","fields":{"message":"Failed to connect","error":"request timed out"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11306.922143s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 11306.922081s]","level":"DEBUG","fields":{"message":"Failed to connect","error":"request timed out"},"target":"linkerd_reconnect::service","spans":[{"orig_dst":"10.176.2.62:9977","name":"ingress"},{"name":"forward"}],"threadId":"ThreadId(1)"}

@alex-berger
Copy link
Contributor Author

Actually, doesn't this look rather similar to what we had in #6037 (comment), a retry (reconnect) loop that never ends? Some first ad-hoc strace sessions reveal, that there seems to be no connection (from envoy) to the linkerd-proxy that triggers those outbound connections to 10.176.2.62.

@olix0r olix0r added area/proxy and removed area/identity Automatic transport identity labels May 28, 2021
@olix0r
Copy link
Member

olix0r commented May 28, 2021

@alex-berger Thanks! I think we have a pretty good understanding of why this is happening. I think it's similar to the other issue in that this probably can't cause any runtime problems -- as soon as the identity client is actually used, it should trigger a re-resolution and get a fresh set of endpoints to use. However, this is definitely undesirable behavior as-is, and we'll make changes to the proxy to avoid this state. We need to do a little more thinking on exactly what these changes should be, though ;)

@olix0r olix0r added priority/P1 Planned for Release and removed bug/staleness labels May 28, 2021
olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue May 29, 2021
We currently build the identity client at startup and hold it to be used
fairly infrequently: the client is used once per day by default, and
even in extremely aggressive scenarios is unlikely to be used more
frequently than once every few minutes. As such, there's very little
benefit to holding the client (and buffers, DNS resolutions, etc) open
continually. Instead, it seems preferable to instantiate the identity
client only as it's needed.

Practically, we see issues like linkerd/linkerd2#6184 where the identity
client may try to reconnect to stale endpoints when the identity
deployment is rescheduled (because there aren't a steady stream of
requests on this client).

This change makes the controller stack a `NewService<()>` so that
clients can be instantiated lazily. The identity module now creates a
new connection for each identity request. Other controller clients are
unaffacted, continuing to use long-live clients.
olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue May 29, 2021
We currently build the identity client at startup and hold it to be used
fairly infrequently: the client is used once per day by default, and
even in extremely aggressive scenarios is unlikely to be used more
frequently than once every few minutes. As such, there's very little
benefit to holding the client (and buffers, DNS resolutions, etc) open
continually. Instead, it seems preferable to instantiate the identity
client only as it's needed.

Practically, we see issues like linkerd/linkerd2#6184 where the identity
client may try to reconnect to stale endpoints when the identity
deployment is rescheduled (because there aren't a steady stream of
requests on this client).

This change makes the controller stack a `NewService<()>` so that
clients can be instantiated lazily. The identity module now creates a
new connection for each identity request. Other controller clients are
unaffacted, continuing to use long-live clients.
@olix0r
Copy link
Member

olix0r commented May 29, 2021

I've published an image with the fix proposed in linkerd/linkerd2-proxy#1021 to ghcr.io/olix0r/l2-proxy:e269429a

@alex-berger
Copy link
Contributor Author

@olix0r the stale address 10.176.2.62:9977 from the above logs does not look like it is related to the identity client (wrong port). In fact there is only one service using that port in our cluster and that is the gloo xDS controller. So there must be more to this than just fixing the identity client.

FYI, I guess that in case of 10.176.2.62:9977 there is no l5d-dst-override header set as this is part of the envoy control plane (xDS) communication path.

@olix0r
Copy link
Member

olix0r commented May 30, 2021

The original issue was definitely the identity client, so i think that change is still worth merging.

Is it possible to deduce from the log whether that IP address was resolved by linkerd-proxy ( using l5d-dst-override or host/authority header) or whether linkerd-proxy just forwards the request to the original IP address (without doing any resolving resp. address rewriting)?

In the logs you shared, the scope {"name":"forward"} indicates that this is using the original-dst-address provided by the client (src) without doing any discovery in LInkerd. We'll take a deeper look at this next week, but i'd expect the client to idle out and be dropped unless envoy continually tried to send requests to it.

olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Jun 1, 2021
We currently build the identity client at startup and hold it to be used
fairly infrequently: the client is used once per day by default, and
even in extremely aggressive scenarios is unlikely to be used more
frequently than once every few minutes. As such, there's very little
benefit to holding the client (and buffers, DNS resolutions, etc) open
continually. Instead, it seems preferable to instantiate the identity
client only as it's needed.

Practically, we see issues like linkerd/linkerd2#6184 where the identity
client may try to reconnect to stale endpoints when the identity
deployment is rescheduled (because there aren't a steady stream of
requests on this client).

This change makes the controller stack a `NewService<()>` so that
clients can be instantiated lazily. The identity module now creates a
new connection for each identity request. Other controller clients are
unaffacted, continuing to use long-live clients.
@alex-berger
Copy link
Contributor Author

alex-berger commented Jun 4, 2021

I installed edge-21.6.1 which contains the fix for the identity client (@olix0r thank you very much for that ♥), but we still see connections to stale respectively wrong IP addresses in our linkerd-proxy logs.

I also have the suspicion that the following statement is not true, as some strace analysis on envoy reveals that it never tries to connect to those stale IP addresses. So it really looks like this is caused by the oubound linkerd-proxy, which for whatever reasons tries to connect to the wrong target:

In the logs you shared, the scope {"name":"forward"} indicates that this is using the original-dst-address provided by the client (src) without doing any discovery in LInkerd. We'll take a deeper look at this next week, but i'd expect the client to idle out and be dropped unless envoy continually tried to send requests to it.

I also have the suspicion that this happens only if the meshed HTTP client is reusing the same TCP connection for multiple requests (e.g. HTTP 1.1 without Connection: close resp. HTTP2/gRPC streams). Clients, which issue a single HTTP request per TCP connect (and then immediately close that TCP connection) seem not to suffer from this problem.


Note, this is a serious problem! Whenever an IP address is reassigned to a new Pod which is listening on the same port as the one of the stale resolve, traffic is successfully routed to the wrong Pod by the outbound linkerd-proxy. This has several severe implications:

  • This must be considered a data breach as data is sent to the wrong IP address, which might now be assigned to another Pod. Think of a Pod in a different namespace which now receives traffic / data intended for another namespace (tenant/customer/...), which is a clear violation of any data privacy promise.
  • It causes service disruption because the outbound linkerd-proxy will receive valid HTTP responses (e.g. status code 400) and does not realize that it is talking to the wrong Pod.

To recap, this is not just about stale targets it is also about routing traffic to the wrong target as IP addresses are reassigned to (reused by) new Pods. And this happens all the time on our clusters, so it is not a weird edge case it's the default.


Maybe this is a classical over-optimization trap, where we are victims of caching (reusing old information) instead of resolving destinations with each request (or at least once per N seconds). Whatever is the cause of this, IMHO linkerd should prioritize correctness over performance.

@alex-berger
Copy link
Contributor Author

I think I managed to identify one specific use-case where this happens.

Having long-lived gRPC connections the outbound linkerd-proxy does not close the (inner) connection if it fails to (re)connect to the destination address. Instead it keeps on trying to reconnect to the (now stale) destination address.

use-case-forward-long-lived-connections

I would expect that the linkerd-proxy should close the (inner) connection if it fails to (re)connect to the destination. This would cause the meshed client to reconnect with the new (non stale) destination address and everything would be fine.

@olix0r
Copy link
Member

olix0r commented Jun 4, 2021

OK, I think this sounds like a plausible thesis about the behavior. Looking at our error-handling, I see that we may not properly teardown connections when HTTP/2 is being used:

https://github.com/linkerd/linkerd2-proxy/blob/00c3394ab36eeff14192e6ae52c41238376daabd/linkerd/app/core/src/errors.rs#L195-L207

Here, we only return a stream-level reset when the error includes a h2 stream reset information. I'd be curious if we see the "Propagating HTTP2 reset" message. If we see that message logged, then we need to do more inspecting of the h2 reset to determine whether the connection can be torn down. But that may not even be enough to totally eliminate this... I could imagine a scenario where, if the IP is recycled quickly enough, it would be indistinguishable from a temporary network interruption.

We can probably reproduce parts of this issue given the information we have, but I'm not sure how we can make a kubernetes cluster reuse IPs reliably in tests to cover the whole scenario you're seeing... Anyway, I think we have enough information to dig into a fix; but it's going to be tricky to get full integration tests to cover this.

@olix0r olix0r added priority/P0 Release Blocker and removed priority/P1 Planned for Release labels Jun 4, 2021
@olix0r
Copy link
Member

olix0r commented Jun 5, 2021

OK, I'm able to reproduce this with emojivoto. It seems specific to ingress-mode, which is a little surprising to me, but at least it's a solid lead:

In a k3s cluster with the latest edge release installed. I've configured the voting and emoji services with clusterIP: None so that linkerd doesn't build a load balancer, etc.

In both cases, I watched the web logs

:; k logs -n emojivoto deploy/web linkerd-proxy -f |tee $kind.log

While restarting the voting service:

:; k rollout restart -n emojivoto deploy/voting

With linkerd.io/inject: enabled we see:

[    23.419944s]  WARN ThreadId(01) outbound:accept{client.addr=10.42.0.28:40890}:server{orig_dst=10.42.0.26:8080}:profile:http{v=h2}: linkerd_timeout::failfast: HTTP Server entering failfast after 3s
[    23.419959s] DEBUG ThreadId(01) outbound:accept{client.addr=10.42.0.28:40890}:server{orig_dst=10.42.0.26:8080}:profile:http{v=h2}: linkerd_timeout::failfast: HTTP Server in failfast
[    23.419975s]  WARN ThreadId(01) outbound:accept{client.addr=10.42.0.28:40890}:server{orig_dst=10.42.0.26:8080}:profile:http{v=h2}: linkerd_app_core::errors: Failed to proxy request: HTTP Server service in fail-fast client.addr=10.42.0.28:40890
[    23.419979s] DEBUG ThreadId(01) outbound:accept{client.addr=10.42.0.28:40890}:server{orig_dst=10.42.0.26:8080}:profile:http{v=h2}: linkerd_app_core::errors: Closing server-side connection
[    23.419988s] DEBUG ThreadId(01) outbound:accept{client.addr=10.42.0.28:40890}:server{orig_dst=10.42.0.26:8080}:profile:http{v=h2}: linkerd_app_core::errors: Handling error with gRPC status code=Unavailable
[    23.420068s] DEBUG ThreadId(01) outbound:accept{client.addr=10.42.0.28:40890}:server{orig_dst=10.42.0.26:8080}:profile:http{v=h2}: linkerd_proxy_http::server: The stack is tearing down the connection
[    23.420278s] DEBUG ThreadId(01) outbound:accept{client.addr=10.42.0.28:40890}: linkerd_app_core::serve: Connection closed

That is, the failfast error causes an error that triggers the application's client connection to be torn down so it re-resolves the target.

With linkerd.io/inject: ingress, however, we never see any kind of error! Instead we simply hold the request indefinitely. The easiest fix is to add a failfast timeout on this stack. I've put up a branch and image that does this, available via ghcr.io/olix0r/l2-proxy:b7b591c1. But before we go ahead and merge this, I'm going to take a deeper look at how to avoid reconnecting in cases like this.

Full manifests and log files here.

olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Jun 5, 2021
The ingress-mode proxy's forwarding stack--used when a request does not
set the `l5d-dst-override` header--has no failfast implemetation. This
means that when a connection can't be obtained for the endpoint,
requests are buffered indefinitely.

This change adds a failfast layer so that these requests are failed
eagerly after 3s of unavailability, causing the serverside connection to
be dropped (so that the application client may re-resolve the endpoint).

This is really a temporary solution. We should probably avoid
implementing reconnection at all in this case so that connection errors
can be used in place of failfast errors.

Related to linkerd/linkerd2#6184
@alex-berger
Copy link
Contributor Author

@olix0r I wanted to test your changes, but was not able to pull the image ghcr.io/olix0r/l2-proxy:b7b591c1.

@olix0r
Copy link
Member

olix0r commented Jun 5, 2021

@alex-berger 🤦 helps if i actually push the image, sorry.

:; docker push ghcr.io/olix0r/l2-proxy:b7b591c1 
The push refers to repository [ghcr.io/olix0r/l2-proxy]
017143603bda: Pushed 
25ab8a76737b: Pushed 
a305cb840bb8: Pushed 
598383fbfa3b: Layer already exists 
2f827ea23f74: Layer already exists 
cc01cd1edc16: Layer already exists 
421e1feee777: Layer already exists 
41936fff8b42: Layer already exists 
f4697116d561: Layer already exists 
9eb82f04c782: Layer already exists 
b7b591c1: digest: sha256:666614c761b65354b376804b9c4e4611df7ae4738734083fc8a74594564b620b size: 2409

@alex-berger
Copy link
Contributor Author

@olix0r First tests look promising, the temporary solution introduced by c49488c makes the system resilient again. As you mentioned it is not a perfect/permanent solution but at least the system is now self-healing and will recover fast (within ~10 seconds) when the gloo xDS control plane is restarted.

@alex-berger
Copy link
Contributor Author

@olix0r Any chance we can get a new edge-release this week with c49488c such that we can upgrade/fix our broken clusters?

@olix0r
Copy link
Member

olix0r commented Jun 7, 2021

@alex-berger yeah, I think we can introduce the temporary fix until we've modified the reconnect logic (some changes to set this up are already in-flight).

olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Jun 7, 2021
The ingress-mode proxy's forwarding stack--used when a request does not
set the `l5d-dst-override` header--has no failfast implemetation. This
means that when a connection can't be obtained for the endpoint,
requests are buffered indefinitely.

This change adds a failfast layer so that these requests are failed
eagerly after 3s of unavailability, causing the serverside connection to
be dropped (so that the application client may re-resolve the endpoint).

This is really a temporary solution. We should probably avoid
implementing reconnection at all in this case so that connection errors
can be used in place of failfast errors.

Related to linkerd/linkerd2#6184
@olix0r olix0r added priority/P1 Planned for Release and removed priority/P0 Release Blocker labels Jun 10, 2021
@AlonGluz
Copy link

AlonGluz commented Jun 22, 2021

@olix0r We're suffering from the same bug I guess, but our Linkerd-proxy is operating on default mode and not ingress we do use l5d-dst-override headers. Does the temp fix affect it also?

@olix0r
Copy link
Member

olix0r commented Jun 22, 2021

@AlonGluz Can you please open a new issue with as many relevant details as possible? Thanks.

@stale
Copy link

stale bot commented Sep 20, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Sep 20, 2021
@stale stale bot closed this as completed Oct 4, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants