Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle etcd compacted revisions #208

Closed
lukasertl opened this issue Feb 22, 2024 · 10 comments · Fixed by #217
Closed

Handle etcd compacted revisions #208

lukasertl opened this issue Feb 22, 2024 · 10 comments · Fixed by #217
Assignees
Labels

Comments

@lukasertl
Copy link
Contributor

lukasertl commented Feb 22, 2024

vip-manager currently doesn't seem to handle etcd compacted revisions gracefully:

Feb 22 15:17:03 host vip-manager[3872968]: 2024/02/22 15:17:03 IP address 10.x.y.z/16 state is true, desired true
Feb 22 15:17:13 host vip-manager[3872968]: 2024/02/22 15:17:13 IP address 10.x.y.z/16 state is true, desired true
Feb 22 15:17:21 host vip-manager[3872968]: 2024/02/22 15:17:21 etcd watcher returned error: etcdserver: mvcc: required revision has been compacted
Feb 22 15:17:21 host vip-manager[3872968]: 2024/02/22 15:17:21 IP address 10.x.y.z/16 state is true, desired false
Feb 22 15:17:21 host vip-manager[3872968]: 2024/02/22 15:17:21 Removing address 10.x.y.z/16 on ens192

Restarting vip-manager fixes this, but of course the database is not accessible until then.

@cfredericksen
Copy link

I noticed this too when I was testing failover. The new leader requires a restart of vip-manager. Im trying to figure out a way to auto restart vip-manager daemon using monit on failover.

@pashagolub pashagolub self-assigned this Mar 19, 2024
@pashagolub
Copy link
Collaborator

Any chance you guys can throw me a link to get idea of compacted revisions without chatting with AI or Google? :)

Thanks in advance!

@pashagolub pashagolub added the bug label Mar 19, 2024
@cfredericksen
Copy link

cfredericksen commented Mar 19, 2024

I guess I am less aware of "compacted revisions" and more referring to auto-recovery.

I am using vip-manager as part of https://github.com/vitabaks/postgresql_cluster. If I hard stop (power off the VM, simulating a hardware failure) the leader of the cluster, vip-manager never recovers until I restart vip-manager on the new leader. Sorry for the confusion.

@SDV109
Copy link

SDV109 commented Apr 5, 2024

Hi!
I also encountered this problem If one ETCD node is unavailable for a short period of time, in my case ETCD 3.5.11 and vip-manager 2.3.0 are used. The solution that helped was to restart the vip manager on the server that became the master in patroni cluster.

@pashagolub, I used the official etcd documentation (https://etcd.io/docs/v3.5/op-guide/maintenance/) the Auto Compression block when in the repository https://github.com/vitabaks/postgresql_cluster added PR with the addition of the compression function of the internal database ETCD.

@vitabaks
Copy link

@pashagolub please help.

etcdserver: mvcc: required revision has been compacted

maybe it will be enough to simply retry here to re-read the value of the latest version of the key?

@pashagolub
Copy link
Collaborator

I'd rather fix this error. But I cannot find how I can reproduce it in a simple way

@vitabaks
Copy link

vitabaks commented Apr 15, 2024

I cannot find how I can reproduce it in a simple way

Add ETCD_AUTO_COMPACTION_RETENTION="1" option to etcd.conf

Example:

ETCD_NAME="pgnode01"
ETCD_LISTEN_CLIENT_URLS="http://192.168.150.141:2379,http://127.0.0.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.150.141:2379"
ETCD_LISTEN_PEER_URLS="http://192.168.150.141:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.150.141:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-postgres-cluster"
ETCD_INITIAL_CLUSTER="pgnode01=http://192.168.150.141:2380,pgnode02=http://192.168.150.142:2380,pgnode03=http://192.168.150.143:2380"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_DATA_DIR="/var/lib/etcd"
ETCD_ELECTION_TIMEOUT="5000"
ETCD_HEARTBEAT_INTERVAL="1000"
ETCD_INITIAL_ELECTION_TICK_ADVANCE="false"
ETCD_AUTO_COMPACTION_RETENTION="1"

Or use postgresql_cluster to deploy the Postgres cluster (with vip-manager) or to deploy ETCD cluster only.

@SDV109
Copy link

SDV109 commented Apr 16, 2024

@pashagolub, Hi, there is new information.
In connection with the detected problem, for the time being until the new version and fix of the problem, I turned off maintenance for ETCD on our DB clusters today and that's what an interesting moment I discovered.
On the cluster where vip-manager 2.3.0 was, when restarting ETCD, vip manager gave the error described above and vip disappeared before restarting vip-manager, and on the second cluster, where vip-manager version 2.1.0 is used, I did not find any problems with vip-manager, perhaps this is due to hardware, because on in the second cluster, the hardware is better, then the ETCD restart time was faster, but perhaps this will give a hint in finding a solution.
I will try to reproduce this scenario in my test zone with different versions as soon as possible and I will unsubscribe with an additional comment.

@pashagolub pashagolub linked a pull request Apr 19, 2024 that will close this issue
@pashagolub
Copy link
Collaborator

Hi people. Would you please try #217

Thanks in advance!

@SDV109
Copy link

SDV109 commented Apr 21, 2024

@pashagolub Hi!
I've done several iterations of testing.
In the first case, I did apt-get --purge --autoremove remove vip-manager for the old version, after installing the version from the 208-handle-etcd-compressed-revisions branch, and in this case, it was extremely difficult to repeat the error, but in one of the many attempts to cause this error, I managed to get it to call.
Next, I deployed the cluster from 0 and installed vip-manager immediately from the 208-handle-etcd-compressed-revisions branch, and this time, no matter how many different attempts I made, the error could not be reproduced. I tried to stop etcd on one of the 3 nodes, restart etc also on each of the nodes, but the error did not reproduce.

Upd:
Perhaps the error in the first case is related to incomplete removal of the vip-manager via apt-get --purge --autoremove remove vip-manager, since after removal, even if I restart the VM, the vip-manager service continues to work, despite the fact that there are no binaries anymore.
After_remove_vip-manager.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants