-
Notifications
You must be signed in to change notification settings - Fork 45
DNS errors after upgrading rolling update of masters in cluster #371
Comments
I'm able to reproduce this pretty easily. I have a cluster with three members, etcd-d, etcd-e, and etcd-f, running etcd 3.4.13. Etcd-d is the leader. If I terminate the instance, I keep quorum. that all works. The issue is when etcd-d comes back. The other two members of the cluster receive LeadershipNotifications from the new etcd-d, so the grpc peer service is ok. But, within etcd itself, in the peer connection used for health checks and heartbeat, I see rejects for the new connections from etcd-d. It looks like the embedded grpc service for etcd (not to be confused with the peer service for etcd-manager) The message I see from etcd-e and etcd-f indicate so: 2020-12-10 19:27:56.219405 I | embed: rejected connection from "10.203.20.185:41530" (error "tls: "10.203.20.185" does not match any of DNSNames ["etcd-d" "etcd-d.internal.k8s.ctnrva0.dev.mmerrill.net"] (lookup etcd-d on 10.203.16.2:53: no such host)", ServerName "etcd-e.internal.k8s.ctnrva0.dev.mmerrill.net", IPAddresses ["127.0.0.1"], DNSNames ["etcd-d" "etcd-d.internal.k8s.ctnrva0.dev.mmerrill.net"]) I have to restart the docker containers on etcd-e and etcd-f. When I do that, etcd-d is then able to connect to the etcd heartbeat service on etcd-e and etcd-f. To me, this sounds like an issue with etcd itself? Where, the running etcd somehow remembers the source IP for an old connection from the old etcd master, and when a new one comes online, its rejected b/c of some data still in memory? I only say that b/c restarting the container for etcd-e and etcd-f solves it. |
Created a gist of me reproducing this again today: https://gist.github.com/mmerrill3/a354e9289bc44e9c9f0711f6de932fdd |
do we need to put the new IP in the SAN in the client cert to bootstrap the newly created member (from an IP perspective, the peer id is the same) to the existing cluster? Right now we just have 127.0.0.1. |
I just tried to reproduce the issue one more time. This time, I got a clue. etcd-main actually was ok when I terminated the existing master instance of etcd-main, and it came back online. etcd-events was still not working, for the same reason above. But, this message printed, which shows that the controller was able to then continue, and publish the new peer map to all the peers (needed so /etc/hosts gets updated). The cluster state got obtained... and all peer connectivity worked ok for etcd-main. Makes sense b/c all the peers had their /etc/hosts entries updated. {"level":"warn","ts":"2020-12-18T22:12:44.623Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-e75929a0-c5ed-4839-8bd7-77e5ac9b998a/etcd-a.internal.k8s.ctnror0.dev.mmerrill.net:4001","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} I'm not sure why this doesn't happen all the time? It's obviously trying to communicate with the old, terminated, etcd-a which is now gone. A timeout kicks in, and then the controller does its thing and notifies all the peers of the new address for etcd-a. Why doesn't this timeout kick in all the time? |
I will try putting the IP in the client/server cert for etcd that is used for the healthchecks by etcd. It's a chicken before the egg issue, in that cluster members don't know about the new IP from etc/hosts yet for the newest member. B/c of that, health checks fail b/c the client cert has the DNS name (still not resolving to the newer host yet) and 127.0.0.1. I'm goig to put the new IP in there which will allow for health checks to work for etcd, and then, hopefully, there's a full quorum so etcdController and tell the members about the next IP being used by the new etcd member. |
Thanks for tracking this down @mmerrill3 ... I suspect it's related to some of the newer TLS validation introduced in etcd: https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.2.md#security-authentication-9 I'm going to try to validate and I think your proposed solution of introducing the client IP sounds like a probably fix. Sorry for the delay here! |
Hi @justinsb, in my PR, I wrapped the etcd client call out in an explicit context with a timeout. That timeout ocurring allows for the new map of peers to get published, and then the DNS issues go away. |
I have experienced this a number of times while rolling update on the kOps master branch Not sure it is related to changes in etcd 3.2. I have not experienced this before the last few weeks. |
I'm trying to reproduce failures using the kubetest2 work that's going on in kops. Any hints as to how to reproduce it? I'm trying HA clusters (3 nodes), then upgrading kubernetes. I think that most people have reproduced it on AWS, with a "real" DNS name (i.e. not k8s.local) - is that right? I'm guessing that we need to cross the 1.14/1.15/1.16 -> 1.17 boundary (i.e. go from etc 3.3.10 to 3.4.3). Or maybe is it 1.17/1.18 to 1.19 (i.e. 3.4.3 -> 3.4.13)? If anyone has any recollection of what versions in particular triggered it, that could be helpful also! We can then also get these tests into periodic testing ... but one step at a time! |
For me it is:
It felt like it happened every other roll, but it is probably a bit less. |
For my case, it was a 3 member control plane, running in AWS, k8s version 1.19.4, etcd 3.4.13. I use kops rolling-update to force a restart of the leader in the etcd cluster. The etcd cluster leader is stopped and terminated from my command, and a new EC2 instance is spun up with a new IP. This new etcd member thinks he's the "peer" leader from the lock. Not the etcd leader, but the "peer" service leader. The new "peer" leader needs to publish it's new IP to the other peers, but there's a point where it gets stuck. That sticking point is in the PR. I'm not sure why its gets stuck, but it does. There's supposed to be a timeout with the etcdclient for the health check, that never kicks in. Wrapping the etcd client call out with another higher level context timeout brute forces a fix, but I didn't see why the internals controls of etcd client didn't timeout by itself. Once the timeout happens, the next step in the controller loop is to publish the new IP for the new member. Once that is published, the other peers update their local /etc/hosts with the new entry, and the new etcd member/ec2 instance joins the etcd cluster. |
@mmerrill3 / @olemarkus how did you solve this issue? I am currently seeing error like
situation is that I have 2/3 masters online and those masters which are online do have incorrect hosts file etcd that works but have incorrect hosts file:
but etcd-b.internal.fteu1.awseu.ftrl.io ip address is incorrect in that hosts file.. it should be 172.20.83.101, there is no instance up and running with that 172.20.72.135 ip address?! that incorrect ip address is in "etcd-main" cluster, but for some reason ip address is correct in etcd-event cluster
|
I scaled down entire master ASG now the situation is:
seems that etcd-main does not update the hosts file?! which component is updating these hosts files, it seems that its stuck? I would not like to take etcd state from backups. I have tried to restart kubelet and protokube but does not help |
I wrote |
I believe there is an error condition currently in etcd-manager when performing a rolling update. The gRPC peer service uses the instance's DNS server in /etc/resolv.conf to resolve other members of the etcd cluster when a failure occurs during a handshake.
It looks like the client-side of the grpc "peer" service is using the IP address to connect to the master, but the "server-side" of the grpc service, on the master, rejects the connection b/c of a name mismatch. Then, it looks like the "server-side" of the peer service does a dip on the name etcd-c, which should always fail b/c its not going to be in DNS. A restart of the server side fixes this, probably b/c the grpc service is caching something with client certificates.
I see messages like this from a master (etcd-a), rejecting connections from a new instance (etcd-c), b/c the new instance is not in DNS yet. The new member will not be in DNS, until the existing master (etcd-a) accepts the new instance as a member in the cluster. That's why I think there's a race condition going on.
2020-11-28 14:44:53.960873 I | embed: rejected connection from "10.205.165.93:53226" (error "tls: "10.205.165.93" does not match any of DNSNames ["etcd-c" "etcd-c.internal.k8s.ctnror0.qa.abc.net"] (lookup etcd-c on 10.205.160.2:53: no such host)", ServerName "etcd-a.internal.k8s.ctnror0.qa.abc.net", IPAddresses ["127.0.0.1"], DNSNames ["etcd-c" "etcd-c.internal.k8s.ctnror0.qa.abc.net"])
The text was updated successfully, but these errors were encountered: