Outbound proxy delays writing of TLS handshake message #6146

alex-berger · 2021-05-20T07:51:27Z

Bug Report

Outbound linkerd-proxy sometimes does not call epoll_wait on a certain epoll file descriptor which results in very large delays between establishing a connection and sending the TLS handshake client_hello message. This in turn causes the inbound linkerd-proxy on the other side to close that connection as it times out after 10 seconds trying to read the TLS/SNI message.

What is the issue?

Interpreting the log and trace files and the key events as summarized in the below table we can see that:

The outbound proxy successfully opens a TCP connection, but then waits for 52 seconds before it writes the first byte of data into that connection (the TLS handshake client_hello message).
The inbound proxy immediatelly accepts the connection from the outbound proxy and waits for 10 seconds for any data to arrive. As no data arrives within 10 seconds, it times out, closes the connection and writes the "TLS detection timed out" log message.

It looks like this is a strange bug on the outbound proxy, which in some cases leads to a very long delay between
establishing a connection and writing the TSL handshake. What is very suspicious is that there is no epoll_wait system call on the outbound proxy for the epoll file descriptor 5 (the one the affected TCP connection is registered with).

I can only speculate on what causes this behaviour, some ideas:

A bug in the underlying async runtime (Tokio or Tower) which results in some epoll file descriptors not beeing polled (epoll_wait is not called for those file descritors).
... your idea goes here ...

How can it be reproduced?

???

Logs, error output, etc

In order to narrow down what is going on I enabled loglevel TRACE on the outbound and
inbound linkerd-proxy sidecars of the following two Pods in our cluster:

Pod	IP Address	Role
`gh-commander-api-f4b748487-7l57c`	`10.40.39.45`	`outbound`
`linkerd-destination-dc474b4bb-jsw7j`	`10.40.42.129`	`inbound`

Furthermore, I started an strace sessions for the Unix processes of each of
these two linkerd-proxy instances to capture all system calls.

After running for some minutes the problem appeared again at 2021-05-19T18:49:12Z indicated
by a log message on the linkerd-destination Pod containing the string TLS detection timed out.
Being confident that we captured an instance of the problematic event, I stopped log and strace
capturing which resulted in the following files:

Filename	Pod	Content	Relevant Time Window
10.40.39.45.log.txt	gh-commander-api-f4b748487-7l57c	Log of outbound `linkerd-proxy`	18:49:02Z - 18:49:54Z
10.40.39.45.strace.txt	gh-commander-api-f4b748487-7l57c	Strace output of outbound `linkerd-proxy`	18:49:02Z - 18:49:54Z
10.40.42.129.log.txt	linkerd-destination-dc474b4bb-jsw7j	Log of inbound `linkerd-proxy`	18:49:02Z - 18:49:14Z
10.40.42.129.strace.txt	linkerd-destination-dc474b4bb-jsw7j	Strace output of inbound `linkerd-proxy`	18:49:02Z - 18:49:14Z

Note, the log files are huge and do not fit into a gist, so they are links to the linkerd Slack channel and you might only be able to read them if you have access to that channel.

The following table summarises the key events from the above log and trace files in chronological order for the TCP
connection 10.40.39.45:34506 ➔ 10.40.42.129:8086:

ID	Time	Event	Actor	Remarks
1	18:49:02Z	`connect`	`outbound`	Outbound `linkerd-proxy` opens TCP connection
2	18:49:02Z	`epoll_ctl`	`outbound`	Outbound `linkerd-proxy` registers `epoll` with file descriptor `5`
3	18:49:02Z	`accept`	`inbound`	Inbound `linkerd-proxy` accepts TCP connection
4	18:49:02Z	`epoll_ctl`	`inbound`	Inbound `linkerd-proxy` registers `epoll`
5	18:49:12Z	`read-timeout`	`inbound`	Inbound `linkerd-proxy` closes TCP connection witout having received any data from it.
6	18:49:12Z	`epoll_ctl`	`inbound`	Inbound `linkerd-proxy` deregisters `epoll`
7	18:49:12Z	`close`	`inbound`	Inbound `linkerd-proxy` closes connection
8	18:49:12Z	`log`	`inbound`	Inbound `linkerd-proxy` logs `TLS detection timed out`
9	18:49:54Z	`writev`	`outbound`	Outbound `linkerd-proxy` writes TLS handshake data with SNI value `linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local`
10	18:49:54Z	`recvfrom`	`outbound`	Outbound `linkerd-proxy` tries to read from connection and realises that the peer closed it.
11	18:49:54Z	`epoll_ctl`	`outbound`	Outbound `linkerd-proxy` deregisters `epoll` from file descriptor `5`
12	18:49:54Z	`close`	`outbound`	Outbound `linkerd-proxy` closes connection

`linkerd check` output

Linkerd core checks
===================

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ can retrieve the control plane version
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
√ control plane proxies are up-to-date
√ control plane proxies and cli versions match

Status check results are √

Environment

Kubernetes Version: 1.19.6
Cluster Environment: EKS
Host OS: Amazon Linux 2
Linkerd version: edge-21.5.1

Possible solution

Additional context

This issue is based on this Slack thread.

The text was updated successfully, but these errors were encountered:

kleimkuhler · 2021-05-20T13:44:30Z

@alex-berger Thanks for such a detailed issue and posting your updates as you debugged this in the slack channel. Descriptions like this are super helpful when finding the root issue. We'll take a look at what's going on.

olix0r · 2021-05-20T19:08:25Z

@alex-berger Thank you for spending the time to put this extremely thorough bug report here. I have spotted at least one thing that doesn't look right in our code that I can put together a fix for. If so, I think this issue would be specific to HA control planes, so you may have some luck scaling down the destination service to a single replica in the meantime. Also I'll share this branch shortly and work on reproducing this and digging into more of your logs to see if there's anything else.

Sorry you hit this issue; but this detailed feedback is invaluable!

olix0r · 2021-05-20T19:18:41Z

@alex-berger How did this bug manifest to your application? I understand from your description and from the logs that connections from application pods control plane pods could fail with bizarre timeouts. Did these connection issues cause problems for your application traffic? I'm not spotting any obvious warnings in the logs you shared.

There's definitely an issue here that we can fix; but I want to make sure we understand the whole surface of the problem so we can be sure it's actually being addressed fully.

When there are multiple replicas of a controller--especially the destination controller--the proxy creates a load balancer to distribute requests across all controller pods. linkerd/linkerd2#6146 describes a situation where controller connections fail to be established because the client stalls for 50s+ between initiating a connection and sending a TLS ClientHello, long after the server has timed out the idle connection. As it turns out, the controller client does not necessarily drive all of its endpoints to readiness. Because load balancers are designed to process requests when only a subset of endpoints are available, the load balancer cannot be responsible for driving all endpoints in a service to readiness and we need a `SpawnReady` layer that is responsible for driving individual endpoints to readiness. While the outbound proxy's balancers are instrumented with this layer, the controller clients were not configured this way when load balancers were introduced. We likely have not encountered this previously because the balancer should effectively hide this problem in most cases: as long as a single endpoint is available requests should be processed as expected; and if there are no endpoints available, the balancer would drive at least one to readiness in order to process requests.

olix0r · 2021-05-20T20:08:47Z

I've published an image with the fix in linkerd/linkerd2-proxy#1014. You can test it by setting pod annotations:

config.linkerd.io/prox-image: ghcr.io/olix0r/l2-proxy
config.linkerd.io/proxy-version: a01b8bd2

This issue was limited to the client from proxies to the control plane components. The load balancer would stop driving connections to "readiness" once a first connection was established. The referenced change will ensure that all connections are driven to completion so these TLS detection timeouts are no longer logged. I would not, however, expect this bug to impact application traffic because the client would properly process discovery requests to one of the control plane pods. If you were seeing behavior that negatively impacted the application, definitely let us know so we can investigate further!

When there are multiple replicas of a controller--especially the destination controller--the proxy creates a load balancer to distribute requests across all controller pods. linkerd/linkerd2#6146 describes a situation where controller connections fail to be established because the client stalls for 50s+ between initiating a connection and sending a TLS ClientHello, long after the server has timed out the idle connection. As it turns out, the controller client does not necessarily drive all of its endpoints to readiness. Because load balancers are designed to process requests when only a subset of endpoints are available, the load balancer cannot be responsible for driving all endpoints in a service to readiness and we need a `SpawnReady` layer that is responsible for driving individual endpoints to readiness. While the outbound proxy's balancers are instrumented with this layer, the controller clients were not configured this way when load balancers were introduced. We likely have not encountered this previously because the balancer should effectively hide this problem in most cases: as long as a single endpoint is available requests should be processed as expected; and if there are no endpoints available, the balancer would drive at least one to readiness in order to process requests.

alex-berger · 2021-05-21T06:06:17Z

@olix0r From the aggregated linkerd-proxy logs I conclude that application traffic is not affected. However, we stopped the rollout of edge-21.5.1 to be on the save side. However, I am wondering what happens if control plane connections fail, isn't there a risk that proxies cannot obtain up-to-date endpoint list and therefore might not being able to route traffic at all or route it to wrong backends again?

olix0r · 2021-05-21T06:30:40Z

@alex-berger The controller client's balancer would ensure that at least one connection was available and then stop driving any of the other connections/handshakes to alternate endpoints. So it would have been possible to perform discovery over that one healthy connection, though requests wouldn't necessarily be distributed among all controller instances. With the change we just merged, all connections will be driven to readiness independently of the load balancer.

alex-berger · 2021-05-21T07:42:06Z

I tried to test-drive your changes but it breaks our gloo-edge ingress gateway (running in ingress mode) as it does not add the l5d-dst-override header for all requests. The outbound linkerd-proxy log of the gloo-edgre ingress gateway Pods are now full with the following log message.

{"timestamp":"[ 995.334310s]","level":"WARN","fields":{"message":"Failed to proxy request: ingress-mode routing requires the l5d-dst-override header","client.addr":"10.176.9.124:43466"},"target":"linkerd_app_core::errors","spans":[{"orig_dst":"10.176.47.220:9977","name":"ingress"}],"threadId":"ThreadId(3)"}

Basically this means, that starting with that new proxy version gloo-edge in ingress mode will no longer work and we will have to turn of ingress mode, which in turn might lead to other problems! Is it possible to revert those changes and make the proxy more lenient again (falling back to something else if l5d-dst-override is not set)?

alex-berger · 2021-05-21T11:29:07Z

@olix0r I managed to test-drive the ghcr.io/olix0r/l2-proxy:a01b8bd2 image version and there are still TLS detection timed out messages in the aggregated linkerd-proxy logs. However, now it looks like those messages only pop-up if Pods are restarted, and then only in the log of the terminating/starting Pod. During calm time windows, in which no Pods are terminated/started those messages do no longer appear.

Please take those findings with a grain of salt, as I tested things on a rather idle development cluster which does not serve much traffic. Due to #6157 I cannot test this on the cluster on which I captured all the logs and traces for this issue.

alex-berger · 2021-05-25T06:41:16Z

It would be super helpful if we could add connection information (local IP address & port, remote/peer IP address) to the DetectTimeout error message. This would allow me to reason about this error without having to increase log-level on all Pods.

lupusllc · 2021-05-26T18:43:00Z

AKS 1.18.17
Linkerd 2.10.2
Ingress NGINX "ingress-nginx-3.29.0"

I can confirm that we're seeing the same sorts of intermittent errors, as far I can tell ( it is a lot of data to process ). None of which appeared to actually drop any of our connections. We had tests running for extended periods ( at 2 second and then later, 12 second intervals ) to try and reproduce the issues, in two different AKS clusters, to specific pods.

So far we've only seen the error on services using our NGINX ingress. I tried for hours to get a non ingress service to produce the same error and it never occurred. Though all the tests ran are rather simple and are served in less than a second.

I do agree, being able to see the IP and/or pod/service in the info alerts would greatly reduce the need to turn on debug or trace in the logging.

I also wonder if the error "Connection closed error=TLS detection timed out" is actually accurate. From my understanding, the failure to detect protocol is supposed to fail open, not actually fail close the connection.

Hopefully these errors can be sorted out, we've upgraded several times to try and get rid of these false positives as they cause issues sorting through them when troubleshooting. it's hard to rule out Linkerd issues with all the noise.

* Controller clients of components with more than one replica could fail to drive all connections to completion. This could result in timeouts showing up in logs, but would not have prevented proxies from communicating with controllers. #6146 * linkerd/linkerd2-proxy#992 made the `l5d-dst-override` header required for ingress-mode proxies. This behavior has been reverted so that requests without this header are forwarded to their original destination. * OpenCensus trace spans for HTTP requests no longer include query parameters. --- * ci: Update/pin action dependencies (linkerd/linkerd2-proxy#1012) * control: Ensure endpoints are driven to readiness (linkerd/linkerd2-proxy#1014) * Make span name without query string (linkerd/linkerd2-proxy#1013) * ingress: Restore original dst address routing (linkerd/linkerd2-proxy#1016) * ci: Restict permissions in Actions (linkerd/linkerd2-proxy#1019) * Forbid unsafe code in most module (linkerd/linkerd2-proxy#1018)

olix0r · 2021-05-27T15:04:14Z

I also wonder if the error "Connection closed error=TLS detection timed out" is actually accurate. From my understanding, the failure to detect protocol is supposed to fail open, not actually fail close the connection.

This is the case for HTTP-level detection, where failing to detect protocol should be transparent to the application. However, for meshed TLS communication, we can't fail open in that way, as we risk proxying unexpected TLS handshakes to the application. We'll do some more thinking about how best to handle/diagnose these cases.

* Controller clients of components with more than one replica could fail to drive all connections to completion. This could result in timeouts showing up in logs, but would not have prevented proxies from communicating with controllers. #6146 * linkerd/linkerd2-proxy#992 made the `l5d-dst-override` header required for ingress-mode proxies. This behavior has been reverted so that requests without this header are forwarded to their original destination. * OpenCensus trace spans for HTTP requests no longer include query parameters. --- * ci: Update/pin action dependencies (linkerd/linkerd2-proxy#1012) * control: Ensure endpoints are driven to readiness (linkerd/linkerd2-proxy#1014) * Make span name without query string (linkerd/linkerd2-proxy#1013) * ingress: Restore original dst address routing (linkerd/linkerd2-proxy#1016) * ci: Restict permissions in Actions (linkerd/linkerd2-proxy#1019) * Forbid unsafe code in most module (linkerd/linkerd2-proxy#1018)

alex-berger · 2021-05-28T12:00:27Z

As of edge-21.5.3 the error message TLS detection timed out still appears sporadically immediately after the start of a Pod. However, linkerd/linkerd2-proxy#1014 has improved the situation drastically. The linkerd-proxy logs usually looks like the one below and contain zero, one or two occurrences of that message:

time="2021-05-28T07:34:09Z" level=info msg="running version edge-21.5.3"
{"timestamp":"[ 0.004003s]","level":"INFO","fields":{"message":"Using single-threaded proxy runtime"},"target":"linkerd2_proxy::rt","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.004936s]","level":"INFO","fields":{"message":"Admin interface on 0.0.0.0:4191"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.004969s]","level":"INFO","fields":{"message":"Inbound interface on 0.0.0.0:4143"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.004979s]","level":"INFO","fields":{"message":"Outbound interface on 127.0.0.1:4140"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.004987s]","level":"INFO","fields":{"message":"Tap interface on 0.0.0.0:4190"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.004994s]","level":"INFO","fields":{"message":"Local identity is tap.linkerd-viz.serviceaccount.identity.linkerd.cluster.local"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.005001s]","level":"INFO","fields":{"message":"Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.005009s]","level":"INFO","fields":{"message":"Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 2.355878s]","level":"INFO","fields":{"message":"Connection closed","error":"TLS detection timed out"},"target":"linkerd_app_core::serve","threadId":"ThreadId(2)"}
{"timestamp":"[ 5.045772s]","level":"INFO","fields":{"message":"Certified identity: tap.linkerd-viz.serviceaccount.identity.linkerd.cluster.local"},"target":"linkerd_app","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(2)"}

@olix0r If this is expected behavior, please let me know and I will close this issue.

olix0r · 2021-05-28T16:27:46Z

@alex-berger Ah, yeah, this is likely because the pod hasn't yet acquired identity:

{"timestamp":"[ 2.355878s]","level":"INFO","fields":{"message":"Connection closed","error":"TLS detection timed out"},"target":"linkerd_app_core::serve","threadId":"ThreadId(2)"}
{"timestamp":"[ 5.045772s]","level":"INFO","fields":{"message":"Certified identity: tap.linkerd-viz.serviceaccount.identity.linkerd.cluster.local"},"target":"linkerd_app","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(2)"}

so it can't terminate connections. We gate readiness on the existence of the identity, but we don't currently fail liveness probes in this state. Perhaps we should? I'm not immediately sure what the proper failure mode is in this case, I think we're basically hoping that the identity is acquired before the timeout expires, and so we don't fail the request outright and instead wait for identity to become available.

While this behavior isn't unexpected, it's probably worth leaving this open so we can take a deeper look into this behavior. At the very least, we should improve diagnostics about these timeouts (as you mentioned earlier).

stale · 2021-08-26T17:58:31Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

olix0r added priority/P0 Release Blocker area/proxy labels May 20, 2021

olix0r self-assigned this May 20, 2021

olix0r added the bug label May 20, 2021

olix0r mentioned this issue May 20, 2021

control: Ensure endpoints are driven to readiness linkerd/linkerd2-proxy#1014

Merged

alex-berger mentioned this issue May 21, 2021

PR #992 breaks ingress traffic #6157

Closed

alex-berger mentioned this issue May 26, 2021

Proxy keeps trying to connect to stale identity IP address #6184

Closed

olix0r mentioned this issue May 27, 2021

proxy: v2.145.0 #6187

Merged

stale bot added the wontfix label Aug 26, 2021

stale bot closed this as completed Sep 9, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outbound proxy delays writing of TLS handshake message #6146

Outbound proxy delays writing of TLS handshake message #6146

alex-berger commented May 20, 2021 •

edited

Loading

kleimkuhler commented May 20, 2021

olix0r commented May 20, 2021

olix0r commented May 20, 2021 •

edited

Loading

olix0r commented May 20, 2021

alex-berger commented May 21, 2021

olix0r commented May 21, 2021 •

edited

Loading

alex-berger commented May 21, 2021

alex-berger commented May 21, 2021 •

edited

Loading

alex-berger commented May 25, 2021

lupusllc commented May 26, 2021

olix0r commented May 27, 2021

alex-berger commented May 28, 2021

olix0r commented May 28, 2021

stale bot commented Aug 26, 2021

Outbound proxy delays writing of TLS handshake message #6146

Outbound proxy delays writing of TLS handshake message #6146

Comments

alex-berger commented May 20, 2021 • edited Loading

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Possible solution

Additional context

kleimkuhler commented May 20, 2021

olix0r commented May 20, 2021

olix0r commented May 20, 2021 • edited Loading

olix0r commented May 20, 2021

alex-berger commented May 21, 2021

olix0r commented May 21, 2021 • edited Loading

alex-berger commented May 21, 2021

alex-berger commented May 21, 2021 • edited Loading

alex-berger commented May 25, 2021

lupusllc commented May 26, 2021

olix0r commented May 27, 2021

alex-berger commented May 28, 2021

olix0r commented May 28, 2021

stale bot commented Aug 26, 2021

alex-berger commented May 20, 2021 •

edited

Loading

`linkerd check` output

olix0r commented May 20, 2021 •

edited

Loading

olix0r commented May 21, 2021 •

edited

Loading

alex-berger commented May 21, 2021 •

edited

Loading