Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

Auth Errors on Cluster Upgrade #399

Open
wskulley opened this issue Feb 16, 2021 · 2 comments
Open

Auth Errors on Cluster Upgrade #399

wskulley opened this issue Feb 16, 2021 · 2 comments

Comments

@wskulley
Copy link

wskulley commented Feb 16, 2021

During upgrade from 'default' kops 1.18 to 'default' kops 1.19 encountered the following error on the first etcd-manager node to roll:

unable to grpc-ping discovered peer 10.28.114.172:3996: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0

The replacement etcd would not join the existing cluster. Was able to bypass the issue by adding

    manager:
      env:
      - name: GODEBUG
        value: x509ignoreCN=0```

to the kops etcd-manager cluster specs.
@ssoriche
Copy link

We recently did the same upgrade from kops 1.18 to 1.19 and encountered the same error message. This however happened on our second etcd-manager node, and setting the value of GODEBUG=x509ignoreCN=0 on the replacement node did not allow etcd to start, which blocked kube-apiserver from starting, and so on.

In order to get etcd to start we had to perform a rolling update on the third node (which had the ip address from the error message) with the --cloudonly option specified. Once the third node was replaced (and hadn't necessarily rejoined the cluster), the second node started etcd and joined both the etcd and kubernetes clusters. The third node joined both clusters without issue.

@sp-francisco-manas
Copy link

sp-francisco-manas commented Mar 16, 2021

⚠️ +1

Since etcd-manager upgrade to Go 1.15 (CommonName deprecation) all upgrades to kOps 1.19 are breaking (first master never joins the etcd clusters). The problem is that the certs being generated contains this field that has been deprecated for 20 years already, Go enforce this since 1.15 and it refuses to connect even if you have a AltNames field ( #362 added the field but it should have removed the CN too).

Until a proper fix is implemented you need to use the workaround to rollback to the old behaviour in Go. I think the proper solution is to stop generating certificates with CN on etcd-manager and rotate certs in all masters later on (1.20, 1.21?). But I'm not sure if there are some second-order effects issues by removing it.

Update: In our case, the issue was not related to this. We had a config mistake by binding both etcd and etcd-events to the same metrics port. The Go deprecation log is still appearing during startup but it was noise about this underlying issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants