Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outbound proxy delays writing of TLS handshake message #6146

Closed
alex-berger opened this issue May 20, 2021 · 14 comments
Closed

Outbound proxy delays writing of TLS handshake message #6146

alex-berger opened this issue May 20, 2021 · 14 comments
Assignees

Comments

@alex-berger
Copy link
Contributor

alex-berger commented May 20, 2021

Bug Report

Outbound linkerd-proxy sometimes does not call epoll_wait on a certain epoll file descriptor which results in very large delays between establishing a connection and sending the TLS handshake client_hello message. This in turn causes the inbound linkerd-proxy on the other side to close that connection as it times out after 10 seconds trying to read the TLS/SNI message.

What is the issue?

Interpreting the log and trace files and the key events as summarized in the below table we can see that:

  • The outbound proxy successfully opens a TCP connection, but then waits for 52 seconds before it writes the first byte of data into that connection (the TLS handshake client_hello message).
  • The inbound proxy immediatelly accepts the connection from the outbound proxy and waits for 10 seconds for any data to arrive. As no data arrives within 10 seconds, it times out, closes the connection and writes the "TLS detection timed out" log message.

It looks like this is a strange bug on the outbound proxy, which in some cases leads to a very long delay between
establishing a connection and writing the TSL handshake. What is very suspicious is that there is no epoll_wait system call on the outbound proxy for the epoll file descriptor 5 (the one the affected TCP connection is registered with).

I can only speculate on what causes this behaviour, some ideas:

  • A bug in the underlying async runtime (Tokio or Tower) which results in some epoll file descriptors not beeing polled (epoll_wait is not called for those file descritors).
  • ... your idea goes here ...

How can it be reproduced?

???

Logs, error output, etc

In order to narrow down what is going on I enabled loglevel TRACE on the outbound and
inbound linkerd-proxy sidecars of the following two Pods in our cluster:

Pod IP Address Role
gh-commander-api-f4b748487-7l57c 10.40.39.45 outbound
linkerd-destination-dc474b4bb-jsw7j 10.40.42.129 inbound

Furthermore, I started an strace sessions for the Unix processes of each of
these two linkerd-proxy instances to capture all system calls.

After running for some minutes the problem appeared again at 2021-05-19T18:49:12Z indicated
by a log message on the linkerd-destination Pod containing the string TLS detection timed out.
Being confident that we captured an instance of the problematic event, I stopped log and strace
capturing which resulted in the following files:

Filename Pod Content Relevant Time Window
10.40.39.45.log.txt gh-commander-api-f4b748487-7l57c Log of outbound linkerd-proxy 18:49:02Z - 18:49:54Z
10.40.39.45.strace.txt gh-commander-api-f4b748487-7l57c Strace output of outbound linkerd-proxy 18:49:02Z - 18:49:54Z
10.40.42.129.log.txt linkerd-destination-dc474b4bb-jsw7j Log of inbound linkerd-proxy 18:49:02Z - 18:49:14Z
10.40.42.129.strace.txt linkerd-destination-dc474b4bb-jsw7j Strace output of inbound linkerd-proxy 18:49:02Z - 18:49:14Z

Note, the log files are huge and do not fit into a gist, so they are links to the linkerd Slack channel and you might only be able to read them if you have access to that channel.

The following table summarises the key events from the above log and trace files in chronological order for the TCP
connection 10.40.39.45:3450610.40.42.129:8086:

ID Time Event Actor Remarks
1 18:49:02Z connect outbound Outbound linkerd-proxy opens TCP connection
2 18:49:02Z epoll_ctl outbound Outbound linkerd-proxy registers epoll with file descriptor 5
3 18:49:02Z accept inbound Inbound linkerd-proxy accepts TCP connection
4 18:49:02Z epoll_ctl inbound Inbound linkerd-proxy registers epoll
5 18:49:12Z read-timeout inbound Inbound linkerd-proxy closes TCP connection witout having received any data from it.
6 18:49:12Z epoll_ctl inbound Inbound linkerd-proxy deregisters epoll
7 18:49:12Z close inbound Inbound linkerd-proxy closes connection
8 18:49:12Z log inbound Inbound linkerd-proxy logs TLS detection timed out
9 18:49:54Z writev outbound Outbound linkerd-proxy writes TLS handshake data with SNI value linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
10 18:49:54Z recvfrom outbound Outbound linkerd-proxy tries to read from connection and realises that the peer closed it.
11 18:49:54Z epoll_ctl outbound Outbound linkerd-proxy deregisters epoll from file descriptor 5
12 18:49:54Z close outbound Outbound linkerd-proxy closes connection

linkerd check output

Linkerd core checks
===================

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ can retrieve the control plane version
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
√ control plane proxies are up-to-date
√ control plane proxies and cli versions match

Status check results are √

Environment

  • Kubernetes Version: 1.19.6
  • Cluster Environment: EKS
  • Host OS: Amazon Linux 2
  • Linkerd version: edge-21.5.1

Possible solution

Additional context

This issue is based on this Slack thread.

@kleimkuhler
Copy link
Contributor

@alex-berger Thanks for such a detailed issue and posting your updates as you debugged this in the slack channel. Descriptions like this are super helpful when finding the root issue. We'll take a look at what's going on.

@olix0r olix0r added priority/P0 Release Blocker area/proxy labels May 20, 2021
@olix0r olix0r self-assigned this May 20, 2021
@olix0r olix0r added the bug label May 20, 2021
@olix0r
Copy link
Member

olix0r commented May 20, 2021

@alex-berger Thank you for spending the time to put this extremely thorough bug report here. I have spotted at least one thing that doesn't look right in our code that I can put together a fix for. If so, I think this issue would be specific to HA control planes, so you may have some luck scaling down the destination service to a single replica in the meantime. Also I'll share this branch shortly and work on reproducing this and digging into more of your logs to see if there's anything else.

Sorry you hit this issue; but this detailed feedback is invaluable!

@olix0r
Copy link
Member

olix0r commented May 20, 2021

@alex-berger How did this bug manifest to your application? I understand from your description and from the logs that connections from application pods control plane pods could fail with bizarre timeouts. Did these connection issues cause problems for your application traffic? I'm not spotting any obvious warnings in the logs you shared.

There's definitely an issue here that we can fix; but I want to make sure we understand the whole surface of the problem so we can be sure it's actually being addressed fully.

olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue May 20, 2021
When there are multiple replicas of a controller--especially the
destination controller--the proxy creates a load balancer to distribute
requests across all controller pods.

linkerd/linkerd2#6146 describes a situation where controller connections
fail to be established because the client stalls for 50s+ between
initiating a connection and sending a TLS ClientHello, long after the
server has timed out the idle connection.

As it turns out, the controller client does not necessarily drive all of
its endpoints to readiness. Because load balancers are designed to
process requests when only a subset of endpoints are available, the load
balancer cannot be responsible for driving all endpoints in a service to
readiness and we need a `SpawnReady` layer that is responsible for
driving individual endpoints to readiness. While the outbound proxy's
balancers are instrumented with this layer, the controller clients were
not configured this way when load balancers were introduced.

We likely have not encountered this previously because the balancer
should effectively hide this problem in most cases: as long as a single
endpoint is available requests should be processed as expected; and if
there are no endpoints available, the balancer would drive at least one
to readiness in order to process requests.
@olix0r
Copy link
Member

olix0r commented May 20, 2021

I've published an image with the fix in linkerd/linkerd2-proxy#1014. You can test it by setting pod annotations:

config.linkerd.io/prox-image: ghcr.io/olix0r/l2-proxy
config.linkerd.io/proxy-version: a01b8bd2

This issue was limited to the client from proxies to the control plane components. The load balancer would stop driving connections to "readiness" once a first connection was established. The referenced change will ensure that all connections are driven to completion so these TLS detection timeouts are no longer logged. I would not, however, expect this bug to impact application traffic because the client would properly process discovery requests to one of the control plane pods. If you were seeing behavior that negatively impacted the application, definitely let us know so we can investigate further!

olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue May 21, 2021
When there are multiple replicas of a controller--especially the
destination controller--the proxy creates a load balancer to distribute
requests across all controller pods.

linkerd/linkerd2#6146 describes a situation where controller connections
fail to be established because the client stalls for 50s+ between
initiating a connection and sending a TLS ClientHello, long after the
server has timed out the idle connection.

As it turns out, the controller client does not necessarily drive all of
its endpoints to readiness. Because load balancers are designed to
process requests when only a subset of endpoints are available, the load
balancer cannot be responsible for driving all endpoints in a service to
readiness and we need a `SpawnReady` layer that is responsible for
driving individual endpoints to readiness. While the outbound proxy's
balancers are instrumented with this layer, the controller clients were
not configured this way when load balancers were introduced.

We likely have not encountered this previously because the balancer
should effectively hide this problem in most cases: as long as a single
endpoint is available requests should be processed as expected; and if
there are no endpoints available, the balancer would drive at least one
to readiness in order to process requests.
@alex-berger
Copy link
Contributor Author

@olix0r From the aggregated linkerd-proxy logs I conclude that application traffic is not affected. However, we stopped the rollout of edge-21.5.1 to be on the save side. However, I am wondering what happens if control plane connections fail, isn't there a risk that proxies cannot obtain up-to-date endpoint list and therefore might not being able to route traffic at all or route it to wrong backends again?

@olix0r
Copy link
Member

olix0r commented May 21, 2021

@alex-berger The controller client's balancer would ensure that at least one connection was available and then stop driving any of the other connections/handshakes to alternate endpoints. So it would have been possible to perform discovery over that one healthy connection, though requests wouldn't necessarily be distributed among all controller instances. With the change we just merged, all connections will be driven to readiness independently of the load balancer.

@alex-berger
Copy link
Contributor Author

I tried to test-drive your changes but it breaks our gloo-edge ingress gateway (running in ingress mode) as it does not add the l5d-dst-override header for all requests. The outbound linkerd-proxy log of the gloo-edgre ingress gateway Pods are now full with the following log message.

{"timestamp":"[ 995.334310s]","level":"WARN","fields":{"message":"Failed to proxy request: ingress-mode routing requires the l5d-dst-override header","client.addr":"10.176.9.124:43466"},"target":"linkerd_app_core::errors","spans":[{"orig_dst":"10.176.47.220:9977","name":"ingress"}],"threadId":"ThreadId(3)"}

Basically this means, that starting with that new proxy version gloo-edge in ingress mode will no longer work and we will have to turn of ingress mode, which in turn might lead to other problems! Is it possible to revert those changes and make the proxy more lenient again (falling back to something else if l5d-dst-override is not set)?

@alex-berger
Copy link
Contributor Author

alex-berger commented May 21, 2021

@olix0r I managed to test-drive the ghcr.io/olix0r/l2-proxy:a01b8bd2 image version and there are still TLS detection timed out messages in the aggregated linkerd-proxy logs. However, now it looks like those messages only pop-up if Pods are restarted, and then only in the log of the terminating/starting Pod. During calm time windows, in which no Pods are terminated/started those messages do no longer appear.

Please take those findings with a grain of salt, as I tested things on a rather idle development cluster which does not serve much traffic. Due to #6157 I cannot test this on the cluster on which I captured all the logs and traces for this issue.

@alex-berger
Copy link
Contributor Author

It would be super helpful if we could add connection information (local IP address & port, remote/peer IP address) to the DetectTimeout error message. This would allow me to reason about this error without having to increase log-level on all Pods.

@lupusllc
Copy link

AKS 1.18.17
Linkerd 2.10.2
Ingress NGINX "ingress-nginx-3.29.0"

I can confirm that we're seeing the same sorts of intermittent errors, as far I can tell ( it is a lot of data to process ). None of which appeared to actually drop any of our connections. We had tests running for extended periods ( at 2 second and then later, 12 second intervals ) to try and reproduce the issues, in two different AKS clusters, to specific pods.

So far we've only seen the error on services using our NGINX ingress. I tried for hours to get a non ingress service to produce the same error and it never occurred. Though all the tests ran are rather simple and are served in less than a second.

I do agree, being able to see the IP and/or pod/service in the info alerts would greatly reduce the need to turn on debug or trace in the logging.

I also wonder if the error "Connection closed error=TLS detection timed out" is actually accurate. From my understanding, the failure to detect protocol is supposed to fail open, not actually fail close the connection.

Hopefully these errors can be sorted out, we've upgraded several times to try and get rid of these false positives as they cause issues sorting through them when troubleshooting. it's hard to rule out Linkerd issues with all the noise.

olix0r added a commit that referenced this issue May 27, 2021
* Controller clients of components with more than one replica could fail
  to drive all connections to completion. This could result in timeouts
  showing up in logs, but would not have prevented proxies from
  communicating with controllers. #6146
* linkerd/linkerd2-proxy#992 made the `l5d-dst-override` header required
  for ingress-mode proxies. This behavior has been reverted so that
  requests without this header are forwarded to their original
  destination.
* OpenCensus trace spans for HTTP requests no longer include query
  parameters.

---

* ci: Update/pin action dependencies (linkerd/linkerd2-proxy#1012)
* control: Ensure endpoints are driven to readiness (linkerd/linkerd2-proxy#1014)
* Make span name without query string (linkerd/linkerd2-proxy#1013)
* ingress: Restore original dst address routing (linkerd/linkerd2-proxy#1016)
* ci: Restict permissions in Actions (linkerd/linkerd2-proxy#1019)
* Forbid unsafe code in most module (linkerd/linkerd2-proxy#1018)
@olix0r
Copy link
Member

olix0r commented May 27, 2021

I also wonder if the error "Connection closed error=TLS detection timed out" is actually accurate. From my understanding, the failure to detect protocol is supposed to fail open, not actually fail close the connection.

This is the case for HTTP-level detection, where failing to detect protocol should be transparent to the application. However, for meshed TLS communication, we can't fail open in that way, as we risk proxying unexpected TLS handshakes to the application. We'll do some more thinking about how best to handle/diagnose these cases.

olix0r added a commit that referenced this issue May 27, 2021
* Controller clients of components with more than one replica could fail
  to drive all connections to completion. This could result in timeouts
  showing up in logs, but would not have prevented proxies from
  communicating with controllers. #6146
* linkerd/linkerd2-proxy#992 made the `l5d-dst-override` header required
  for ingress-mode proxies. This behavior has been reverted so that
  requests without this header are forwarded to their original
  destination.
* OpenCensus trace spans for HTTP requests no longer include query
  parameters.

---

* ci: Update/pin action dependencies (linkerd/linkerd2-proxy#1012)
* control: Ensure endpoints are driven to readiness (linkerd/linkerd2-proxy#1014)
* Make span name without query string (linkerd/linkerd2-proxy#1013)
* ingress: Restore original dst address routing (linkerd/linkerd2-proxy#1016)
* ci: Restict permissions in Actions (linkerd/linkerd2-proxy#1019)
* Forbid unsafe code in most module (linkerd/linkerd2-proxy#1018)
@alex-berger
Copy link
Contributor Author

As of edge-21.5.3 the error message TLS detection timed out still appears sporadically immediately after the start of a Pod. However, linkerd/linkerd2-proxy#1014 has improved the situation drastically. The linkerd-proxy logs usually looks like the one below and contain zero, one or two occurrences of that message:

time="2021-05-28T07:34:09Z" level=info msg="running version edge-21.5.3"
{"timestamp":"[ 0.004003s]","level":"INFO","fields":{"message":"Using single-threaded proxy runtime"},"target":"linkerd2_proxy::rt","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.004936s]","level":"INFO","fields":{"message":"Admin interface on 0.0.0.0:4191"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.004969s]","level":"INFO","fields":{"message":"Inbound interface on 0.0.0.0:4143"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.004979s]","level":"INFO","fields":{"message":"Outbound interface on 127.0.0.1:4140"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.004987s]","level":"INFO","fields":{"message":"Tap interface on 0.0.0.0:4190"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.004994s]","level":"INFO","fields":{"message":"Local identity is tap.linkerd-viz.serviceaccount.identity.linkerd.cluster.local"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.005001s]","level":"INFO","fields":{"message":"Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.005009s]","level":"INFO","fields":{"message":"Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 2.355878s]","level":"INFO","fields":{"message":"Connection closed","error":"TLS detection timed out"},"target":"linkerd_app_core::serve","threadId":"ThreadId(2)"}
{"timestamp":"[ 5.045772s]","level":"INFO","fields":{"message":"Certified identity: tap.linkerd-viz.serviceaccount.identity.linkerd.cluster.local"},"target":"linkerd_app","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(2)"}

@olix0r If this is expected behavior, please let me know and I will close this issue.

@olix0r
Copy link
Member

olix0r commented May 28, 2021

@alex-berger Ah, yeah, this is likely because the pod hasn't yet acquired identity:

{"timestamp":"[ 2.355878s]","level":"INFO","fields":{"message":"Connection closed","error":"TLS detection timed out"},"target":"linkerd_app_core::serve","threadId":"ThreadId(2)"}
{"timestamp":"[ 5.045772s]","level":"INFO","fields":{"message":"Certified identity: tap.linkerd-viz.serviceaccount.identity.linkerd.cluster.local"},"target":"linkerd_app","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(2)"}

so it can't terminate connections. We gate readiness on the existence of the identity, but we don't currently fail liveness probes in this state. Perhaps we should? I'm not immediately sure what the proper failure mode is in this case, I think we're basically hoping that the identity is acquired before the timeout expires, and so we don't fail the request outright and instead wait for identity to become available.

While this behavior isn't unexpected, it's probably worth leaving this open so we can take a deeper look into this behavior. At the very least, we should improve diagnostics about these timeouts (as you mentioned earlier).

@stale
Copy link

stale bot commented Aug 26, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Aug 26, 2021
@stale stale bot closed this as completed Sep 9, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants