-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outbound proxy delays writing of TLS handshake message #6146
Comments
@alex-berger Thanks for such a detailed issue and posting your updates as you debugged this in the slack channel. Descriptions like this are super helpful when finding the root issue. We'll take a look at what's going on. |
@alex-berger Thank you for spending the time to put this extremely thorough bug report here. I have spotted at least one thing that doesn't look right in our code that I can put together a fix for. If so, I think this issue would be specific to HA control planes, so you may have some luck scaling down the destination service to a single replica in the meantime. Also I'll share this branch shortly and work on reproducing this and digging into more of your logs to see if there's anything else. Sorry you hit this issue; but this detailed feedback is invaluable! |
@alex-berger How did this bug manifest to your application? I understand from your description and from the logs that connections from application pods control plane pods could fail with bizarre timeouts. Did these connection issues cause problems for your application traffic? I'm not spotting any obvious warnings in the logs you shared. There's definitely an issue here that we can fix; but I want to make sure we understand the whole surface of the problem so we can be sure it's actually being addressed fully. |
When there are multiple replicas of a controller--especially the destination controller--the proxy creates a load balancer to distribute requests across all controller pods. linkerd/linkerd2#6146 describes a situation where controller connections fail to be established because the client stalls for 50s+ between initiating a connection and sending a TLS ClientHello, long after the server has timed out the idle connection. As it turns out, the controller client does not necessarily drive all of its endpoints to readiness. Because load balancers are designed to process requests when only a subset of endpoints are available, the load balancer cannot be responsible for driving all endpoints in a service to readiness and we need a `SpawnReady` layer that is responsible for driving individual endpoints to readiness. While the outbound proxy's balancers are instrumented with this layer, the controller clients were not configured this way when load balancers were introduced. We likely have not encountered this previously because the balancer should effectively hide this problem in most cases: as long as a single endpoint is available requests should be processed as expected; and if there are no endpoints available, the balancer would drive at least one to readiness in order to process requests.
I've published an image with the fix in linkerd/linkerd2-proxy#1014. You can test it by setting pod annotations: config.linkerd.io/prox-image: ghcr.io/olix0r/l2-proxy
config.linkerd.io/proxy-version: a01b8bd2 This issue was limited to the client from proxies to the control plane components. The load balancer would stop driving connections to "readiness" once a first connection was established. The referenced change will ensure that all connections are driven to completion so these TLS detection timeouts are no longer logged. I would not, however, expect this bug to impact application traffic because the client would properly process discovery requests to one of the control plane pods. If you were seeing behavior that negatively impacted the application, definitely let us know so we can investigate further! |
When there are multiple replicas of a controller--especially the destination controller--the proxy creates a load balancer to distribute requests across all controller pods. linkerd/linkerd2#6146 describes a situation where controller connections fail to be established because the client stalls for 50s+ between initiating a connection and sending a TLS ClientHello, long after the server has timed out the idle connection. As it turns out, the controller client does not necessarily drive all of its endpoints to readiness. Because load balancers are designed to process requests when only a subset of endpoints are available, the load balancer cannot be responsible for driving all endpoints in a service to readiness and we need a `SpawnReady` layer that is responsible for driving individual endpoints to readiness. While the outbound proxy's balancers are instrumented with this layer, the controller clients were not configured this way when load balancers were introduced. We likely have not encountered this previously because the balancer should effectively hide this problem in most cases: as long as a single endpoint is available requests should be processed as expected; and if there are no endpoints available, the balancer would drive at least one to readiness in order to process requests.
@olix0r From the aggregated |
@alex-berger The controller client's balancer would ensure that at least one connection was available and then stop driving any of the other connections/handshakes to alternate endpoints. So it would have been possible to perform discovery over that one healthy connection, though requests wouldn't necessarily be distributed among all controller instances. With the change we just merged, all connections will be driven to readiness independently of the load balancer. |
I tried to test-drive your changes but it breaks our gloo-edge ingress gateway (running in ingress mode) as it does not add the
Basically this means, that starting with that new proxy version gloo-edge in ingress mode will no longer work and we will have to turn of ingress mode, which in turn might lead to other problems! Is it possible to revert those changes and make the proxy more lenient again (falling back to something else if |
@olix0r I managed to test-drive the Please take those findings with a grain of salt, as I tested things on a rather idle development cluster which does not serve much traffic. Due to #6157 I cannot test this on the cluster on which I captured all the logs and traces for this issue. |
It would be super helpful if we could add connection information (local IP address & port, remote/peer IP address) to the |
AKS 1.18.17 I can confirm that we're seeing the same sorts of intermittent errors, as far I can tell ( it is a lot of data to process ). None of which appeared to actually drop any of our connections. We had tests running for extended periods ( at 2 second and then later, 12 second intervals ) to try and reproduce the issues, in two different AKS clusters, to specific pods. So far we've only seen the error on services using our NGINX ingress. I tried for hours to get a non ingress service to produce the same error and it never occurred. Though all the tests ran are rather simple and are served in less than a second. I do agree, being able to see the IP and/or pod/service in the info alerts would greatly reduce the need to turn on debug or trace in the logging. I also wonder if the error "Connection closed error=TLS detection timed out" is actually accurate. From my understanding, the failure to detect protocol is supposed to fail open, not actually fail close the connection. Hopefully these errors can be sorted out, we've upgraded several times to try and get rid of these false positives as they cause issues sorting through them when troubleshooting. it's hard to rule out Linkerd issues with all the noise. |
* Controller clients of components with more than one replica could fail to drive all connections to completion. This could result in timeouts showing up in logs, but would not have prevented proxies from communicating with controllers. #6146 * linkerd/linkerd2-proxy#992 made the `l5d-dst-override` header required for ingress-mode proxies. This behavior has been reverted so that requests without this header are forwarded to their original destination. * OpenCensus trace spans for HTTP requests no longer include query parameters. --- * ci: Update/pin action dependencies (linkerd/linkerd2-proxy#1012) * control: Ensure endpoints are driven to readiness (linkerd/linkerd2-proxy#1014) * Make span name without query string (linkerd/linkerd2-proxy#1013) * ingress: Restore original dst address routing (linkerd/linkerd2-proxy#1016) * ci: Restict permissions in Actions (linkerd/linkerd2-proxy#1019) * Forbid unsafe code in most module (linkerd/linkerd2-proxy#1018)
This is the case for HTTP-level detection, where failing to detect protocol should be transparent to the application. However, for meshed TLS communication, we can't fail open in that way, as we risk proxying unexpected TLS handshakes to the application. We'll do some more thinking about how best to handle/diagnose these cases. |
* Controller clients of components with more than one replica could fail to drive all connections to completion. This could result in timeouts showing up in logs, but would not have prevented proxies from communicating with controllers. #6146 * linkerd/linkerd2-proxy#992 made the `l5d-dst-override` header required for ingress-mode proxies. This behavior has been reverted so that requests without this header are forwarded to their original destination. * OpenCensus trace spans for HTTP requests no longer include query parameters. --- * ci: Update/pin action dependencies (linkerd/linkerd2-proxy#1012) * control: Ensure endpoints are driven to readiness (linkerd/linkerd2-proxy#1014) * Make span name without query string (linkerd/linkerd2-proxy#1013) * ingress: Restore original dst address routing (linkerd/linkerd2-proxy#1016) * ci: Restict permissions in Actions (linkerd/linkerd2-proxy#1019) * Forbid unsafe code in most module (linkerd/linkerd2-proxy#1018)
As of
@olix0r If this is expected behavior, please let me know and I will close this issue. |
@alex-berger Ah, yeah, this is likely because the pod hasn't yet acquired identity:
so it can't terminate connections. We gate readiness on the existence of the identity, but we don't currently fail liveness probes in this state. Perhaps we should? I'm not immediately sure what the proper failure mode is in this case, I think we're basically hoping that the identity is acquired before the timeout expires, and so we don't fail the request outright and instead wait for identity to become available. While this behavior isn't unexpected, it's probably worth leaving this open so we can take a deeper look into this behavior. At the very least, we should improve diagnostics about these timeouts (as you mentioned earlier). |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
Bug Report
Outbound
linkerd-proxy
sometimes does not callepoll_wait
on a certain epoll file descriptor which results in very large delays between establishing a connection and sending the TLS handshakeclient_hello
message. This in turn causes the inboundlinkerd-proxy
on the other side to close that connection as it times out after 10 seconds trying to read the TLS/SNI message.What is the issue?
Interpreting the log and trace files and the key events as summarized in the below table we can see that:
outbound
proxy successfully opens a TCP connection, but then waits for 52 seconds before it writes the first byte of data into that connection (the TLS handshakeclient_hello
message).inbound
proxy immediatelly accepts the connection from theoutbound
proxy and waits for 10 seconds for any data to arrive. As no data arrives within 10 seconds, it times out, closes the connection and writes the "TLS detection timed out
" log message.It looks like this is a strange bug on the
outbound
proxy, which in some cases leads to a very long delay betweenestablishing a connection and writing the TSL handshake. What is very suspicious is that there is no
epoll_wait
system call on theoutbound
proxy for the epoll file descriptor5
(the one the affected TCP connection is registered with).I can only speculate on what causes this behaviour, some ideas:
epoll_wait
is not called for those file descritors).How can it be reproduced?
???
Logs, error output, etc
In order to narrow down what is going on I enabled loglevel
TRACE
on the outbound andinbound
linkerd-proxy
sidecars of the following two Pods in our cluster:gh-commander-api-f4b748487-7l57c
10.40.39.45
outbound
linkerd-destination-dc474b4bb-jsw7j
10.40.42.129
inbound
Furthermore, I started an
strace
sessions for the Unix processes of each ofthese two
linkerd-proxy
instances to capture all system calls.After running for some minutes the problem appeared again at 2021-05-19T18:49:12Z indicated
by a log message on the
linkerd-destination
Pod containing the stringTLS detection timed out
.Being confident that we captured an instance of the problematic event, I stopped log and
strace
capturing which resulted in the following files:
linkerd-proxy
linkerd-proxy
linkerd-proxy
linkerd-proxy
Note, the log files are huge and do not fit into a gist, so they are links to the linkerd Slack channel and you might only be able to read them if you have access to that channel.
The following table summarises the key events from the above log and trace files in chronological order for the TCP
connection
10.40.39.45:34506
➔10.40.42.129:8086
:connect
outbound
linkerd-proxy
opens TCP connectionepoll_ctl
outbound
linkerd-proxy
registersepoll
with file descriptor5
accept
inbound
linkerd-proxy
accepts TCP connectionepoll_ctl
inbound
linkerd-proxy
registersepoll
read-timeout
inbound
linkerd-proxy
closes TCP connection witout having received any data from it.epoll_ctl
inbound
linkerd-proxy
deregistersepoll
close
inbound
linkerd-proxy
closes connectionlog
inbound
linkerd-proxy
logsTLS detection timed out
writev
outbound
linkerd-proxy
writes TLS handshake data with SNI valuelinkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
recvfrom
outbound
linkerd-proxy
tries to read from connection and realises that the peer closed it.epoll_ctl
outbound
linkerd-proxy
deregistersepoll
from file descriptor5
close
outbound
linkerd-proxy
closes connectionlinkerd check
outputEnvironment
Possible solution
Additional context
This issue is based on this Slack thread.
The text was updated successfully, but these errors were encountered: