Skip to content

Set TCP_USER_TIMEOUT socket option#926

Merged
openshift-merge-robot merged 3 commits intoopenshift:masterfrom
p0lyn0mial:tcp-usr-timeout-dialer
Nov 2, 2020
Merged

Set TCP_USER_TIMEOUT socket option#926
openshift-merge-robot merged 3 commits intoopenshift:masterfrom
p0lyn0mial:tcp-usr-timeout-dialer

Conversation

@p0lyn0mial
Copy link
Contributor

@p0lyn0mial p0lyn0mial commented Oct 19, 2020

This PR sets the TCP_USER_TIMEOUT (https://man7.org/linux/man-pages/man7/tcp.7.html) socket option which controls for how long transmitted data may be unacknowledged before the connection is forcefully closed.

Without that option, we rely on the TCP stack to detect broken network connection. This can take up to 15 minutes. During that time our platform might be unavailable.

There are already reported cases in which aggregated APIs (i.e. openshift-apiserver) were unable to establish a new connection to the Kube API for 15 minutes after "ungraceful termination" (https://bugzilla.redhat.com/show_bug.cgi?id=1881878 and after a network error https://bugzilla.redhat.com/show_bug.cgi?id=1879232#c39)

It looks like detecting broken connections on the application level is getting more traction and is preferable. Unfortunately, it is on the slow track and will require backporting to golang's std library. Until that time we would like to take advantage of TCP_USER_TIMEOUT

[1] https://go-review.googlesource.com/c/net/+/198040
[2] https://go-review.googlesource.com/c/net/+/236498#message-7bd657ac6960f0dc7acbbe28cbe3d80ac4f3a34b

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 19, 2020
@p0lyn0mial
Copy link
Contributor Author

/assing @sttts @squeed

KeepAlive: 30 * time.Second,
}).DialContext
Control: func(network, address string, c syscall.RawConn) error {
// Supported only on Linux
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to detect connections to the local host?

@p0lyn0mial p0lyn0mial force-pushed the tcp-usr-timeout-dialer branch 2 times, most recently from 7c2abcc to 6f5b6b3 Compare October 20, 2020 09:52
@squeed
Copy link
Contributor

squeed commented Oct 20, 2020

Given that I don't think many people will be accessing the apiserver over sattelite internet, these timing parameters seem just fine. However, we need to verify that they do what we want, since these knobs are non-orthogonal and never obvious.

The easiest way I can think of to test this is

  1. spin up a cluster with this PR
  2. Start a pod, start a tcpdump on a master with all traffic from / to that pod ip
  3. issue a watch for something that has no updates
  4. verify that tcp keepalives do what we expect
  5. using iptables, block access to that pod from the apiserver
  6. verify that the connection is closed.

func dialerWithDefaultOptions() *net.Dialer {
return &net.Dialer{
// TCP_USER_TIMEOUT does affect the behaviour of connect() which is controlled by this field so we set it to the same value
Timeout: 25 * time.Second,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout, in this case, is timeout to establish the TCP session. 25 seconds might be gigantic, if this is exclusively intra-cluster traffic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this function is primarily used by the operators (https://github.com/search?q=org%3Aopenshift+GetKubeConfigOrInClusterConfig&type=code).

Setting it to 5s would be okay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, now that I think about it, 20 is probably about right. In a cluster that is thrashing but making progress, setting too low a timeout just leads to additional resource exhaustion. As always, it's a balancing act between keeping lightly-loaded clusters performant vs. tolerating heavliy-loaded ones.

I wish we had better numbers to make this decision. Let me see if we have some metrics.

@p0lyn0mial p0lyn0mial force-pushed the tcp-usr-timeout-dialer branch 3 times, most recently from 5b15e39 to 3073050 Compare October 20, 2020 10:18
klog.V(2).Info("Creating the default network Dialer (unsupported platform). It may take up to 15 minutes to detect broken connections and establish a new one")
return &net.Dialer{
Timeout: 30 * time.Second,
KeepAlive: 30 * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keepalive in linux is gone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, KeepAlive setts both TCP_KEEPINTVL and TCP_KEEPIDLE to the same value. Since we want distinct values we are now setting them in setDefaultSocketOptions function

@p0lyn0mial p0lyn0mial force-pushed the tcp-usr-timeout-dialer branch from 3073050 to 9693567 Compare October 20, 2020 10:54
@p0lyn0mial p0lyn0mial changed the title [WIP]: Set TCP_USER_TIMEOUT socket option Set TCP_USER_TIMEOUT socket option Oct 20, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 20, 2020
@p0lyn0mial
Copy link
Contributor Author

/hold

for testing

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 20, 2020
@p0lyn0mial p0lyn0mial force-pushed the tcp-usr-timeout-dialer branch from f6b30d9 to e8be127 Compare October 28, 2020 11:11
@p0lyn0mial
Copy link
Contributor Author

re e8be127 commit

it turned out that TCP_KEEPINTVL and TCP_KEEPIDLE weren't set because they got overwritten after calling DialContext method, here https://github.com/golang/go/blob/master/src/net/dial.go#L431

@p0lyn0mial
Copy link
Contributor Author

I did test it on an idle connection and it looks like it actually works.

I created a simple app that establishes a watch on never changing test01/secrets resources to the Kube API server and simply dropped network traffic originating from the app.

Withouth the patch, the connection is terminated after ~270s=30s * 9 , here is a tcpdump

Screenshot 2020-10-28 at 12 56 34

With the patch, the connection is terminated after ~25s, after dropping the traffic approximately 4 keep alive probes were sent every 5s, here is a tcpdump and the modified app

Screenshot 2020-10-28 at 12 38 09

@p0lyn0mial
Copy link
Contributor Author

I did test it on an active connection and it looks like it actually works.

I created a simple app that sends 3 req per second to list test01/secrets resources from the Kube API server and simply dropped network traffic originating from the app.

Withouth the patch, the connection is terminated after ~16m, here is a tcpdump

Screenshot 2020-10-29 at 12 11 41

With the patch, the connection is terminated after ~25s, here is a tcpdump and the modified app

Screenshot 2020-10-29 at 11 42 48

@p0lyn0mial
Copy link
Contributor Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 30, 2020
@sttts
Copy link
Contributor

sttts commented Nov 2, 2020

/lgtm
/approve

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 2, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: p0lyn0mial, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 2, 2020
@openshift-merge-robot openshift-merge-robot merged commit c4fa0f5 into openshift:master Nov 2, 2020
@squeed
Copy link
Contributor

squeed commented Nov 2, 2020

post-merge /lgtm - this is awesome!

@p0lyn0mial
Copy link
Contributor Author

/cherry-pick release-4.6

@openshift-cherrypick-robot

@p0lyn0mial: new pull request created: #937

Details

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants