Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

DNS errors after upgrading rolling update of masters in cluster #371

Open
mmerrill3 opened this issue Nov 28, 2020 · 14 comments
Open

DNS errors after upgrading rolling update of masters in cluster #371

mmerrill3 opened this issue Nov 28, 2020 · 14 comments
Assignees

Comments

@mmerrill3
Copy link

mmerrill3 commented Nov 28, 2020

I believe there is an error condition currently in etcd-manager when performing a rolling update. The gRPC peer service uses the instance's DNS server in /etc/resolv.conf to resolve other members of the etcd cluster when a failure occurs during a handshake.

It looks like the client-side of the grpc "peer" service is using the IP address to connect to the master, but the "server-side" of the grpc service, on the master, rejects the connection b/c of a name mismatch. Then, it looks like the "server-side" of the peer service does a dip on the name etcd-c, which should always fail b/c its not going to be in DNS. A restart of the server side fixes this, probably b/c the grpc service is caching something with client certificates.

I see messages like this from a master (etcd-a), rejecting connections from a new instance (etcd-c), b/c the new instance is not in DNS yet. The new member will not be in DNS, until the existing master (etcd-a) accepts the new instance as a member in the cluster. That's why I think there's a race condition going on.

2020-11-28 14:44:53.960873 I | embed: rejected connection from "10.205.165.93:53226" (error "tls: "10.205.165.93" does not match any of DNSNames ["etcd-c" "etcd-c.internal.k8s.ctnror0.qa.abc.net"] (lookup etcd-c on 10.205.160.2:53: no such host)", ServerName "etcd-a.internal.k8s.ctnror0.qa.abc.net", IPAddresses ["127.0.0.1"], DNSNames ["etcd-c" "etcd-c.internal.k8s.ctnror0.qa.abc.net"])

@mmerrill3
Copy link
Author

mmerrill3 commented Dec 10, 2020

I'm able to reproduce this pretty easily.

I have a cluster with three members, etcd-d, etcd-e, and etcd-f, running etcd 3.4.13.

Etcd-d is the leader.

If I terminate the instance, I keep quorum. that all works. The issue is when etcd-d comes back. The other two members of the cluster receive LeadershipNotifications from the new etcd-d, so the grpc peer service is ok. But, within etcd itself, in the peer connection used for health checks and heartbeat, I see rejects for the new connections from etcd-d. It looks like the embedded grpc service for etcd (not to be confused with the peer service for etcd-manager)

The message I see from etcd-e and etcd-f indicate so:

2020-12-10 19:27:56.219405 I | embed: rejected connection from "10.203.20.185:41530" (error "tls: "10.203.20.185" does not match any of DNSNames ["etcd-d" "etcd-d.internal.k8s.ctnrva0.dev.mmerrill.net"] (lookup etcd-d on 10.203.16.2:53: no such host)", ServerName "etcd-e.internal.k8s.ctnrva0.dev.mmerrill.net", IPAddresses ["127.0.0.1"], DNSNames ["etcd-d" "etcd-d.internal.k8s.ctnrva0.dev.mmerrill.net"])

I have to restart the docker containers on etcd-e and etcd-f. When I do that, etcd-d is then able to connect to the etcd heartbeat service on etcd-e and etcd-f.

To me, this sounds like an issue with etcd itself? Where, the running etcd somehow remembers the source IP for an old connection from the old etcd master, and when a new one comes online, its rejected b/c of some data still in memory? I only say that b/c restarting the container for etcd-e and etcd-f solves it.

@mmerrill3
Copy link
Author

Created a gist of me reproducing this again today:

https://gist.github.com/mmerrill3/a354e9289bc44e9c9f0711f6de932fdd

@mmerrill3
Copy link
Author

do we need to put the new IP in the SAN in the client cert to bootstrap the newly created member (from an IP perspective, the peer id is the same) to the existing cluster? Right now we just have 127.0.0.1.

@mmerrill3
Copy link
Author

I just tried to reproduce the issue one more time. This time, I got a clue. etcd-main actually was ok when I terminated the existing master instance of etcd-main, and it came back online. etcd-events was still not working, for the same reason above. But, this message printed, which shows that the controller was able to then continue, and publish the new peer map to all the peers (needed so /etc/hosts gets updated). The cluster state got obtained... and all peer connectivity worked ok for etcd-main. Makes sense b/c all the peers had their /etc/hosts entries updated.

{"level":"warn","ts":"2020-12-18T22:12:44.623Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-e75929a0-c5ed-4839-8bd7-77e5ac9b998a/etcd-a.internal.k8s.ctnror0.dev.mmerrill.net:4001","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
W1218 22:12:44.623366 4345 controller.go:730] health-check unable to reach member 603088857070764962 on [https://etcd-a.internal.k8s.ctnror0.dev.mmerrill.net:4001]: context deadline exceeded
I1218 22:12:44.640264 4345 controller.go:292] etcd cluster state: etcdClusterState

I'm not sure why this doesn't happen all the time? It's obviously trying to communicate with the old, terminated, etcd-a which is now gone. A timeout kicks in, and then the controller does its thing and notifies all the peers of the new address for etcd-a. Why doesn't this timeout kick in all the time?

@mmerrill3
Copy link
Author

I will try putting the IP in the client/server cert for etcd that is used for the healthchecks by etcd. It's a chicken before the egg issue, in that cluster members don't know about the new IP from etc/hosts yet for the newest member. B/c of that, health checks fail b/c the client cert has the DNS name (still not resolving to the newer host yet) and 127.0.0.1. I'm goig to put the new IP in there which will allow for health checks to work for etcd, and then, hopefully, there's a full quorum so etcdController and tell the members about the next IP being used by the new etcd member.

@justinsb
Copy link
Contributor

Thanks for tracking this down @mmerrill3 ... I suspect it's related to some of the newer TLS validation introduced in etcd: https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.2.md#security-authentication-9

I'm going to try to validate and I think your proposed solution of introducing the client IP sounds like a probably fix.

Sorry for the delay here!

@mmerrill3
Copy link
Author

Hi @justinsb, in my PR, I wrapped the etcd client call out in an explicit context with a timeout. That timeout ocurring allows for the new map of peers to get published, and then the DNS issues go away.

@olemarkus
Copy link
Contributor

I have experienced this a number of times while rolling update on the kOps master branch Not sure it is related to changes in etcd 3.2. I have not experienced this before the last few weeks.

@justinsb
Copy link
Contributor

I'm trying to reproduce failures using the kubetest2 work that's going on in kops. Any hints as to how to reproduce it?

I'm trying HA clusters (3 nodes), then upgrading kubernetes. I think that most people have reproduced it on AWS, with a "real" DNS name (i.e. not k8s.local) - is that right?

I'm guessing that we need to cross the 1.14/1.15/1.16 -> 1.17 boundary (i.e. go from etc 3.3.10 to 3.4.3). Or maybe is it 1.17/1.18 to 1.19 (i.e. 3.4.3 -> 3.4.13)? If anyone has any recollection of what versions in particular triggered it, that could be helpful also!

We can then also get these tests into periodic testing ... but one step at a time!

@olemarkus
Copy link
Contributor

For me it is:

  • create a cluster using the master branch with three control plane nodes
  • rotate the cluster

It felt like it happened every other roll, but it is probably a bit less.

@mmerrill3
Copy link
Author

For my case, it was a 3 member control plane, running in AWS, k8s version 1.19.4, etcd 3.4.13. I use kops rolling-update to force a restart of the leader in the etcd cluster. The etcd cluster leader is stopped and terminated from my command, and a new EC2 instance is spun up with a new IP. This new etcd member thinks he's the "peer" leader from the lock. Not the etcd leader, but the "peer" service leader. The new "peer" leader needs to publish it's new IP to the other peers, but there's a point where it gets stuck. That sticking point is in the PR. I'm not sure why its gets stuck, but it does. There's supposed to be a timeout with the etcdclient for the health check, that never kicks in. Wrapping the etcd client call out with another higher level context timeout brute forces a fix, but I didn't see why the internals controls of etcd client didn't timeout by itself.

Once the timeout happens, the next step in the controller loop is to publish the new IP for the new member. Once that is published, the other peers update their local /etc/hosts with the new entry, and the new etcd member/ec2 instance joins the etcd cluster.

@zetaab
Copy link
Contributor

zetaab commented Feb 10, 2021

@mmerrill3 / @olemarkus how did you solve this issue?

I am currently seeing error like

2021-02-10 15:48:27.137411 I | embed: rejected connection from "172.20.83.101:46836" (error "tls: \"172.20.83.101\" does not match any of DNSNames [\"etcd-b.internal.fteu1.awseu.ftrl.io\"]", ServerName "etcd-c.internal.fteu1.awseu.ftrl.io", IPAddresses ["127.0.0.1"], DNSNames ["etcd-b.internal.fteu1.awseu.ftrl.io"])

situation is that I have 2/3 masters online and those masters which are online do have incorrect hosts file

etcd that works but have incorrect hosts file:

# Begin host entries managed by etcd-manager[etcd] - do not edit
172.20.120.253	etcd-c.internal.fteu1.awseu.ftrl.io
172.20.62.209	etcd-a.internal.fteu1.awseu.ftrl.io
172.20.72.135	etcd-b.internal.fteu1.awseu.ftrl.io
# End host entries managed by etcd-manager[etcd]

but etcd-b.internal.fteu1.awseu.ftrl.io ip address is incorrect in that hosts file.. it should be 172.20.83.101, there is no instance up and running with that 172.20.72.135 ip address?!

that incorrect ip address is in "etcd-main" cluster, but for some reason ip address is correct in etcd-event cluster

root@ip-172-20-62-209:/home/ubuntu# docker exec -it 275ac3908f7a cat /etc/hosts
# Kubernetes-managed hosts file (host network).
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

# Begin host entries managed by etcd-manager[etcd-events] - do not edit
172.20.120.253	etcd-events-c.internal.fteu1.awseu.ftrl.io
172.20.62.209	etcd-events-a.internal.fteu1.awseu.ftrl.io
172.20.83.101	etcd-events-b.internal.fteu1.awseu.ftrl.io
# End host entries managed by etcd-manager[etcd-events]

@zetaab
Copy link
Contributor

zetaab commented Feb 10, 2021

I scaled down entire master ASG now the situation is:

root@ip-172-20-62-209:/home/ubuntu# docker exec -it 275ac3908f7a cat /etc/hosts
# Kubernetes-managed hosts file (host network).
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

# Begin host entries managed by etcd-manager[etcd-events] - do not edit
172.20.120.253	etcd-events-c.internal.fteu1.awseu.ftrl.io
172.20.62.209	etcd-events-a.internal.fteu1.awseu.ftrl.io
# End host entries managed by etcd-manager[etcd-events]
root@ip-172-20-62-209:/home/ubuntu# docker exec -it 135ab0143a0d cat /etc/hosts
# Kubernetes-managed hosts file (host network).
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

# Begin host entries managed by etcd-manager[etcd] - do not edit
172.20.120.253	etcd-c.internal.fteu1.awseu.ftrl.io
172.20.62.209	etcd-a.internal.fteu1.awseu.ftrl.io
172.20.72.135	etcd-b.internal.fteu1.awseu.ftrl.io
# End host entries managed by etcd-manager[etcd]

seems that etcd-main does not update the hosts file?!

which component is updating these hosts files, it seems that its stuck? I would not like to take etcd state from backups. I have tried to restart kubelet and protokube but does not help

@zetaab
Copy link
Contributor

zetaab commented Feb 10, 2021

I wrote docker restart <containerid> to both etcd-main containers that did have incorrect hosts file (do not write at the same time to all instances). After that I started missing master group, and now I can see correct ip address in hosts file and I have working etcd cluster again with 3/3 members in it

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants