-
-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PITR fails when 'pgbackrest' in patroni_create_replica_methods #588
Comments
Hello @FilipBB Postgres cannot be started (the reason needs to be investigated), this can be seen in the ansible log:
Have you tried running Postgres manually? You would most likely get the same error as automation. Could you provide the postgres log from the 10.60.1.156 server? |
I ran the command that the ansible task runs on the replica, on the replica, namely:
The log I posted is the postgres log from the replica where this command failed. Here is the ansible output, command output and postgres log after running remove_cluster.yml and deploy_cluster.yml again (pgnode1 is 10.60.1.155):
|
@FilipBB thanks It's strange that you get a return code of 1, I'll try to reproduce it. I also ask you to execute these commands (for comparison)
|
Maybe it's because I am using a custom data dir?
Finally, it works if I put in the correct data dir:
But of course, I am using /data/db as the custom data dir on all nodes, so I don't understand why it works on the master but not the replicas... |
Thank you. Now I see that you have no problems using the I can't yet say exactly why you get the error when using pg_ctlcluster, but since I had plans to remove its use from the code, I will do it, and when the PR is ready, I will ask you to test the playbook again if you don't mind. |
Ok, will do, thanks for looking into it so quickly. |
It's strange, it gets stuck on starting the database on the master node. It seems to be stuck in a loop of restoring and trying to start up but I don't see any specific error messages. The is whether I set 'pgbackrest' in patroni_create_replica_methods or not. If I revert to the old main.yml then everything works as before (PITR works with 'basebackup' in patroni_create_replica, but not with 'pgbackrest') FYI I am applying the new main.yml on top of 1.9.0.
pgbackrest logs:
|
The intended master seems to be starting up as a standby according to the patroni logs, which correlates with what I can find on google for the "postgres@postgres FATAL: the database system is starting up" message.
|
@FilipBB Everything is fine, the database restores WAL files from the backup repository to reach the specified recovery target. You should have expected completion on the "Waiting for PostgreSQL Recovery to complete (WAL apply)" task. Next, when the promote is executed, Postgres will be stopped. Why Primary started restoring WAL again at startup (after the promote) is a question... As soon as the recovery is completed, the database will be available. |
Yeah, it seems like this node is not designated as the master for some reason, but I have not changed variables or anything else from 1.9.0. |
FYI this is the inventory file:
|
At startup, Patroni always runs Postgres as a replica and only after written the leader key in DCS (etcd), it promotes to Primary. Why started restoring WAL again after startup is a question. I think we could automate temporarily disabling |
So it seems like etcd elected pgnode1 (10.60.1.155) to be master, instead of what I set in ansible to be the master (pgnode0 10.60.1.154):
Patroni also sees it as a replica, with no other nodes in the cluster:
When I ran remove_cluster.yml I did set 'remove_postgres=true' and 'remove_etcd=true', but this should be ok right? I.e. if I created a brand new cluster to restore the db this would be the case anyway. |
etcd cluster and Postgres cluster are different clusters and etcd has its own leader. Judging by the conclusion patronictl, your cluster is still recovering (State: starting). |
Ok, I see. The postgres and ansible logs just show the same as I posted above, postgres is stuck in a recovery/startup loop and ansible is waiting for it to finish starting. |
OK, let's try this idea
when I implement this change, I will write to you so that we can test it. |
Ok, sure, thanks. |
@FilipBB, Hi, can you please describe the algorithm of your work? |
Basically, here's the procedure I used:
Then I do a restore to a point just after the second table was added: (I am running this restore days after the original backups, all the WAL files have already been archived to S3)
Here are the relevants restore vars I have set:
The thing is, if it was an issue with the WAL files, then I would expect any PITR to fail. But if I go back to the 1.9.0 roles/patroni/main.yml file and set only 'basebackup' in 'patroni_create_replica_methods' then I am able to restore to the same point in time without issue. |
Just for information (this is not the cause of the problem), it is not necessary to do this, just launch a playbook with a tag to restore the current cluster:
This is described here https://github.com/vitabaks/postgresql_cluster?tab=readme-ov-file#restore-and-cloning ("Point-In-Time-Recovery" section) |
So, interestingly, I did the following:
this gave me an error in pg log saying that max_connections was set too low (100) and it had been 500 on the master, and pg failed to start. I checked the postgres.conf in the data dir and max_connections was indeed 100, the conf file looks like just the default conf file.'
Patroni started up fine and the node became master Now I can run the "deploy_pgcluster.yml --tags point_in_time_recovery" playbook and everything is fine on the master, but the replicas have issues: Log on pgnode1 (replica):
|
I spoke too soon, if I run "deploy_pgcluster.yml --tags point_in_time_recovery" again after the previous steps it will fail on the master as before (recovery/restart loop) So I have to do:
That will recover and start the db on the master, then I can run "deploy_pgcluster.yml --tags point_in_time_recovery" again which will work, except for the timeline issues on the replicas, which I guess is a separate issue. I'm not sure if this helps, maybe I'm just confusing the issue... |
@FilipBB, Hi! Today I tried the following scenario:
patroni_cluster_bootstrap_method: "pgbackrest" # or "wal-g", "pgbackrest", "pg_probackup"
patroni_create_replica_methods:
- pgbackrest
I started the recovery scenarios:
pgbackrest_patroni_cluster_restore_command:
# '/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} --delta restore' # restore from latest backup
'/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} --type=time "--target=2024-04-16 12:12:00+03" --delta restore' # Point-in-Time Recovery (example)
@FilipBB, Could you try to repeat this scenario from scratch with the current version of postgresql_cluster? |
Is the problem still relevant? |
Hello, sorry I don't have a lot of time to do testing right now. I was able to restore full backups without issues so I'm not using PITR for the time being. Please close the ticket if you cannot reproduce and I will reopen it when I can do some more thorough testing. |
When recovering from a backup using PITR with pgbackrest, the following command is run in the "Start PostgreSQL for Recovery" task:
"/usr/bin/pg_ctlcluster {{ postgresql_version }} {{ postgresql_cluster_name }} start -o '-c hot_standby=off'"
Start PostgreSQL for Recovery
When run on the master, this command returns 0 because postgres stays running after recovery. On the replicas, this command returns a non-zero exit code:
fatal: [10.60.1.156]: FAILED! => {"changed": true, "cmd": ["/usr/bin/pg_ctlcluster", "16", "main", "start", "-o", "-c hot_standby=off"], "delta": "0:00:01.486977", "end": "2024-02-28 17:44:11.565582", "msg": "non-zero return code", "rc": 1, "start": "2024-02-28 17:44:10.078605", "stderr": "The PostgreSQL server failed to start. Please check the log output:\n2024-02-28 17:44:10 UTC [95842-1] LOG: redirecting log output to logging collector process\n2024-02-28 17:44:10 UTC [95842-2] HINT: Future log output will appear in directory \"/var/log/postgresql\".", "stderr_lines": ["The PostgreSQL server failed to start. Please check the log output:", "2024-02-28 17:44:10 UTC [95842-1] LOG: redirecting log output to logging collector process", "2024-02-28 17:44:10 UTC [95842-2] HINT: Future log output will appear in directory \"/var/log/postgresql\"."], "stdout": "", "stdout_lines": []}
The postgres log from the replica:
It does work with a full restore, just not a PITR. If I remove 'pgbackrest' from patroni_create_replica_methods and leave only 'basebackup' then PITR works as well. Am I doing something wrong?
The text was updated successfully, but these errors were encountered: