Upgrade etcd to v3.4.3 fails after cluster upgrade #348

hakman · 2020-09-28T09:17:49Z

etcd-manager fails to upgrade etcd to v3.4.3 after cluster upgrade. We had a few reports in chat and also @olemarkus bumped into it.
https://kubernetes.slack.com/archives/C3QUFP0QM/p1600787480106600

> etcdctl endpoint status
https://etcd-a.foo.com:4001, 70ef3ca17a9da5c, 3.4.3, 30 MB, false, false, 571, 216270051, 216270051,
https://etcd-c.foo.com:4001, d5affc8170578dbf, 3.3.10, 31 MB, true, false, 571, 216270051, 0,
https://etcd-b.foo.com:4001, ffd024cf5ead90b9, 3.3.10, 31 MB, false, false, 571, 216270051, 0,

Upgrade doesn't happen and leader is logging:

I0922 15:06:38.357264    3656 etcdserver.go:371] Reconfigure request: header:<leadership_token:"rdyYlWOrSo2PkYCtnORK7A" cluster_name:"etcd" > set_etcd_version:"3.4.3"
I0922 15:06:38.357363    3656 leadership.go:121] will reject leadership token "rdyYlWOrSo2PkYCtnORK7A": no leadership
2020-09-22 15:06:40.928989 W | etcdserver: the local etcd version 3.3.10 is not up-to-date
2020-09-22 15:06:40.929022 W | etcdserver: member 70ef3ca17a9da5c has a higher version 3.4.3

CC @olemarkus @justinsb

The text was updated successfully, but these errors were encountered:

mmerrill3 · 2020-11-25T15:25:57Z

Hi @olemarkus and @justinsb and @hakman,
I ran across this as well when doing the upgrade to 3.4.13 from 3.4.3 with out k8s 1.19.4 clusters. I saw this when I used kops to rolling-update the "master instance group X" that was the leader for etcd and etcd-events.

What I saw was that the other two etcd master instance groups remained leaderless for a few minutes, then one of them became the leader.

When the "master instance group X" instance came online, it became the leader again, and then told the two non-updated master instance groups to update to 3.4.13.

They don't have 3.4.13 yet, and then I see the message in the original post on the non-upgraded master instances:

I1125 15:20:08.619698 4338 hosts.go:181] skipping update of unchanged /etc/hosts
2020-11-25 15:20:10.106552 W | etcdserver: the local etcd version 3.4.3 is not up-to-date
2020-11-25 15:20:10.106571 W | etcdserver: member 38c5d2eb0a0d1cfb has a higher version 3.4.13

mmerrill3 · 2020-11-25T15:31:59Z

I think maybe this is related to doing an etcd upgrade, in combination with a newer version of etcd-manager, at the same time? Sometimes, the etcd-manager doesn't have the newer version of etcd yet, depending upon why you are doing the upgrade.

For instance, I saw the kops 1.19.beta-2 release with the new etcd-manager that has the embedded etcd 3.4.13. So, I am trying to roll my cluster with both a newer kops image, and etcd version, at the same time.

If I had just placed the newer etcd-manager version first, done a rolling update on the masters, then updated the etcd version, and rolled again, all masters would have the newer version of etcd.

mmerrill3 · 2020-11-25T17:10:31Z

After I have all the members upgraded, I still need to go to the master and restart it. It seems the Leadership Notification event that should be sent to the etcd cluster members, in leadership.go, function LeaderNotification, is not invoked. So, the non-leader members of the cluster has an empty leadership token, and any token passed to them is rejects, since they don't know who the leader is.

Only after restarting the leader by terminating the instance and letting the ASG create a new instance, did I see the Leadership notification come to the other members of the cluster. Then,, the upgrade to 3.4.13 proceeded.

hakman · 2020-11-26T09:35:56Z

Thanks @mmerrill3. Very nice debugging.

Previously we assumed that if our peers didn't change and we had the leadership, that the peers would continue to acknowledge our leadership. But this assumes that a peer didn't restart or crash. Instead, on every leader iteration, check with all the peers that we are still the leader (using the same GRPC method). For negligible additional traffic, this should enable greater recovery from errors and greater consistency of leader election. Issue kopeio#348

justinsb · 2021-01-19T16:21:49Z

Yes - great debugging. I was able to reproduce with those hints - thank you so much!

#391 is intended to fix it; we're essentially going to broadcast leadership as part of every leader iteration; please take a look and see if you agree.

Previously we assumed that if our peers didn't change and we had the leadership, that the peers would continue to acknowledge our leadership. But this assumes that a peer didn't restart or crash. Instead, on every leader iteration, check with all the peers that we are still the leader (using the same GRPC method). For negligible additional traffic, this should enable greater recovery from errors and greater consistency of leader election. Issue kopeio#348

justinsb mentioned this issue Jan 19, 2021

Eager-broadcast leadership on every leader iteration #391

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade etcd to v3.4.3 fails after cluster upgrade #348

Upgrade etcd to v3.4.3 fails after cluster upgrade #348

hakman commented Sep 28, 2020

mmerrill3 commented Nov 25, 2020

mmerrill3 commented Nov 25, 2020

mmerrill3 commented Nov 25, 2020

hakman commented Nov 26, 2020

justinsb commented Jan 19, 2021

Upgrade etcd to v3.4.3 fails after cluster upgrade #348

Upgrade etcd to v3.4.3 fails after cluster upgrade #348

Comments

hakman commented Sep 28, 2020

mmerrill3 commented Nov 25, 2020

mmerrill3 commented Nov 25, 2020

mmerrill3 commented Nov 25, 2020

hakman commented Nov 26, 2020

justinsb commented Jan 19, 2021