-
Notifications
You must be signed in to change notification settings - Fork 45
Upgrade etcd to v3.4.3 fails after cluster upgrade #348
Comments
Hi @olemarkus and @justinsb and @hakman, What I saw was that the other two etcd master instance groups remained leaderless for a few minutes, then one of them became the leader. When the "master instance group X" instance came online, it became the leader again, and then told the two non-updated master instance groups to update to 3.4.13. They don't have 3.4.13 yet, and then I see the message in the original post on the non-upgraded master instances: I1125 15:20:08.619698 4338 hosts.go:181] skipping update of unchanged /etc/hosts |
I think maybe this is related to doing an etcd upgrade, in combination with a newer version of etcd-manager, at the same time? Sometimes, the etcd-manager doesn't have the newer version of etcd yet, depending upon why you are doing the upgrade. For instance, I saw the kops 1.19.beta-2 release with the new etcd-manager that has the embedded etcd 3.4.13. So, I am trying to roll my cluster with both a newer kops image, and etcd version, at the same time. If I had just placed the newer etcd-manager version first, done a rolling update on the masters, then updated the etcd version, and rolled again, all masters would have the newer version of etcd. |
After I have all the members upgraded, I still need to go to the master and restart it. It seems the Leadership Notification event that should be sent to the etcd cluster members, in leadership.go, function LeaderNotification, is not invoked. So, the non-leader members of the cluster has an empty leadership token, and any token passed to them is rejects, since they don't know who the leader is. Only after restarting the leader by terminating the instance and letting the ASG create a new instance, did I see the Leadership notification come to the other members of the cluster. Then,, the upgrade to 3.4.13 proceeded. |
Thanks @mmerrill3. Very nice debugging. |
Previously we assumed that if our peers didn't change and we had the leadership, that the peers would continue to acknowledge our leadership. But this assumes that a peer didn't restart or crash. Instead, on every leader iteration, check with all the peers that we are still the leader (using the same GRPC method). For negligible additional traffic, this should enable greater recovery from errors and greater consistency of leader election. Issue kopeio#348
Previously we assumed that if our peers didn't change and we had the leadership, that the peers would continue to acknowledge our leadership. But this assumes that a peer didn't restart or crash. Instead, on every leader iteration, check with all the peers that we are still the leader (using the same GRPC method). For negligible additional traffic, this should enable greater recovery from errors and greater consistency of leader election. Issue kopeio#348
Previously we assumed that if our peers didn't change and we had the leadership, that the peers would continue to acknowledge our leadership. But this assumes that a peer didn't restart or crash. Instead, on every leader iteration, check with all the peers that we are still the leader (using the same GRPC method). For negligible additional traffic, this should enable greater recovery from errors and greater consistency of leader election. Issue kopeio#348
Yes - great debugging. I was able to reproduce with those hints - thank you so much! #391 is intended to fix it; we're essentially going to broadcast leadership as part of every leader iteration; please take a look and see if you agree. |
Previously we assumed that if our peers didn't change and we had the leadership, that the peers would continue to acknowledge our leadership. But this assumes that a peer didn't restart or crash. Instead, on every leader iteration, check with all the peers that we are still the leader (using the same GRPC method). For negligible additional traffic, this should enable greater recovery from errors and greater consistency of leader election. Issue kopeio#348
etcd-manager fails to upgrade etcd to v3.4.3 after cluster upgrade. We had a few reports in chat and also @olemarkus bumped into it.
https://kubernetes.slack.com/archives/C3QUFP0QM/p1600787480106600
> etcdctl endpoint status https://etcd-a.foo.com:4001, 70ef3ca17a9da5c, 3.4.3, 30 MB, false, false, 571, 216270051, 216270051, https://etcd-c.foo.com:4001, d5affc8170578dbf, 3.3.10, 31 MB, true, false, 571, 216270051, 0, https://etcd-b.foo.com:4001, ffd024cf5ead90b9, 3.3.10, 31 MB, false, false, 571, 216270051, 0,
Upgrade doesn't happen and leader is logging:
CC @olemarkus @justinsb
The text was updated successfully, but these errors were encountered: