Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Weave reconnect occasionally fails because InitSARemote can get called after Destroy #3675

Closed
hpdvanwyk opened this issue Jul 31, 2019 · 0 comments · Fixed by #3676
Closed
Milestone

Comments

@hpdvanwyk
Copy link

What you expected to happen?

Weave should always reconnect after a network failure

What happened?

Very occasionally when reconnecting InitSARemote gets called after Destroy has been called on a connection. This causes a valid xfrm policy to be updated to an old SPI and the connection stops working.

How to reproduce it?

This was found while testing #3669 . I used the following script:

#!/bin/bash

while true; do
        # disconnect network
        qm set 102 --net1 model=virtio,bridge=vmbr1,macaddr=62:40:98:FF:02:72,link_down=1
        sleep 55
        # reconnect network
        qm set 102 --net1 model=virtio,bridge=vmbr1,macaddr=62:40:98:FF:02:72,link_down=0
        sleep 10

        # check if weave still works
        if ssh [email protected] ping 10.42.128.0 -c1; then
                date
                echo pass
        else
                sleep 10
                if ssh [email protected] ping 10.42.128.0 -c1; then
                        date
                        echo pass
                else
                        date
                        echo broken
                        break
                fi
        fi
done

This bug only happened after about 26 hours.
Without #3669 it is unlikely to trigger since using the same method would hit #3666 much more frequently.

Anything else we need to know?

Exact same setup as #3666

Versions:

/home/weave # ./weave --local version
weave 2.5.2

$ docker version
Client:
 Version:           18.06.3-ce
 API version:       1.38
 Go version:        go1.10.4
 Git commit:        d7080c1
 Built:             Wed Feb 20 02:24:22 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.3-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       d7080c1
  Built:            Wed Feb 20 02:25:33 2019
  OS/Arch:          linux/amd64
  Experimental:     false

$ uname -a
Linux hostyname0 4.14.122-rancher #1 SMP Tue May 28 01:50:21 UTC 2019 x86_64 GNU/Linux
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:02:58Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}

Shortened Logs:

Failing node:

INFO: 2019/07/26 16:16:54.567243 Setting up IPsec between 02:df:bc:77:e3:17(192.168.128.11) and 56:ff:d7:91:ac:16(192.168.128.10)
INFO: 2019/07/26 16:16:54.567323 ipsec: InitSALocal: 192.168.128.10 -> 192.168.128.11 :6784 0xf56cae2c
INFO: 2019/07/26 16:16:54.579407 ->[192.168.128.10:34095|56:ff:d7:91:ac:16(192.168.128.10)]: connection shutting down due to error: Multiple connections to 56:ff:d7:91:ac:16(192.168.128.10) added to 02:df:bc:77:e3:17(192.168.128.11)
INFO: 2019/07/26 16:16:54.579426 Setting up IPsec between 02:df:bc:77:e3:17(192.168.128.11) and 56:ff:d7:91:ac:16(192.168.128.10)
INFO: 2019/07/26 16:16:54.579721 ipsec: InitSALocal: 192.168.128.10 -> 192.168.128.11 :6784 0x37721cdb
INFO: 2019/07/26 16:16:54.579398 Destroying IPsec between 02:df:bc:77:e3:17(192.168.128.11) and 56:ff:d7:91:ac:16(192.168.128.10)
INFO: 2019/07/26 16:16:54.588961 fastdp ->[192.168.128.10:6784|56:ff:d7:91:ac:16(192.168.128.10)]: IPSec init SA remote
INFO: 2019/07/26 16:16:54.589040 ipsec: InitSARemote: 192.168.128.11 -> 192.168.128.10 :6784 0x3bce4dcc
INFO: 2019/07/26 16:16:54.589237 ipsec: destroy: in 192.168.128.10 -> 192.168.128.11 0xf56cae2c
INFO: 2019/07/26 16:16:54.600750 ipsec: InitSARemote: 192.168.128.11 -> 192.168.128.10 :6784 0xb66479db

Other node:

INFO: 2019/07/26 16:16:54.558687 Setting up IPsec between 56:ff:d7:91:ac:16(192.168.128.10) and 02:df:bc:77:e3:17(192.168.128.11)
INFO: 2019/07/26 16:16:54.558761 ipsec: InitSALocal: 192.168.128.11 -> 192.168.128.10 :6784 0xb66479db
INFO: 2019/07/26 16:16:54.576564 fastdp ->[192.168.128.11:6784|02:df:bc:77:e3:17(192.168.128.11)]: IPSec init SA remote
INFO: 2019/07/26 16:16:54.576600 ipsec: InitSARemote: 192.168.128.10 -> 192.168.128.11 :6784 0xf56cae2c
INFO: 2019/07/26 16:16:54.577313 Setting up IPsec between 56:ff:d7:91:ac:16(192.168.128.10) and 02:df:bc:77:e3:17(192.168.128.11)
INFO: 2019/07/26 16:16:54.577378 ipsec: InitSALocal: 192.168.128.11 -> 192.168.128.10 :6784 0x3bce4dcc
INFO: 2019/07/26 16:16:54.577249 ->[192.168.128.11:6783|02:df:bc:77:e3:17(192.168.128.11)]: connection shutting down due to error: Multiple connections to 02:df:bc:77:e3:17(192.168.128.11) added to 56:ff:d7:91:ac:16(192.168.128.10)
INFO: 2019/07/26 16:16:54.577541 Destroying IPsec between 56:ff:d7:91:ac:16(192.168.128.10) and 02:df:bc:77:e3:17(192.168.128.11)
INFO: 2019/07/26 16:16:54.584163 ipsec: destroy: in 192.168.128.11 -> 192.168.128.10 0xb66479db
INFO: 2019/07/26 16:16:54.588879 fastdp ->[192.168.128.11:6784|02:df:bc:77:e3:17(192.168.128.11)]: IPSec init SA remote
INFO: 2019/07/26 16:16:54.594700 ipsec: destroy: out 192.168.128.10 -> 192.168.128.11 0xf56cae2c
INFO: 2019/07/26 16:16:54.594844 ipsec: InitSARemote: 192.168.128.10 -> 192.168.128.11 :6784 0x37721cdb

On the failing node InitSARemote for 0xb66479db happens after Destroy for that connection. This updates the xfrm policy to a no longer used SPI.

More complete logs:

Failing node: https://gist.github.com/hpdvanwyk/a9649492882d3ce8ac0ec474dc2e4ef1
Non failing node: https://gist.github.com/hpdvanwyk/027d8584bdf8740254b5c0da76ecff20

This looks like it could be caused by very unlikely scheduling of the mesh.receiveTCP goroutine.
Adding a check to fastDatapathForwarder.handleCryptoInitSARemote to not run on a stopped forwarder seems to fix it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants